[Paper] Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Published: 2 days ago (April 17, 2026 at 01:07 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.16247v1

Overview

The paper introduces HILBERT, a new multimodal framework that learns joint audio‑text embeddings from long, segmented documents—think podcasts, lecture recordings, or video subtitles—while working with limited training data. By cleverly combining frozen pre‑trained speech and language models with a dual‑contrastive alignment strategy, HILBERT produces robust, balanced representations that outperform existing methods on highly imbalanced classification tasks.

Key Contributions

Hierarchical Long‑Sequence Fusion: Uses cross‑modal attention and self‑attentive pooling to merge segment‑level audio and text features into a single document‑level embedding.
Reciprocal Dual Contrastive Alignment: Aligns each modality (audio, text) to a shared joint space independently (audio→joint, text→joint) rather than forcing a direct audio‑text match, which mitigates dimensionality imbalance.
Structure‑Preserving Regularizer (CKA): Enforces that the internal geometry of each modality is retained after projection into the joint space.
Information‑Balanced Regularizer: Equalizes the amount of information contributed by audio and text, preventing one modality from dominating the joint representation.
Mixture‑of‑Experts (MoE) Classifier: Handles heterogeneous label sets by dynamically weighting expert predictions based on the concatenated audio, text, and joint embeddings.
Low‑Resource Viability: Demonstrates strong performance even when only a small amount of labeled multimodal data is available.

Methodology

Feature Extraction:
- Audio segments are passed through a frozen pre‑trained speech encoder (e.g., wav2vec 2.0).
- Text segments (transcripts, subtitles) are encoded with a frozen language model (e.g., BERT).
- Both encoders output fixed‑size vectors per segment, preserving the original temporal order.
Hierarchical Fusion:
- Cross‑Modal Attention: Each audio segment attends to all text segments (and vice‑versa), allowing the model to capture cross‑modal cues such as “the speaker says X while a music cue occurs.”
- Self‑Attentive Pooling: The attended segment vectors are aggregated into a single representation per modality, producing audio‑only and text‑only document embeddings.
Joint Embedding Construction:
- The modality‑specific embeddings are concatenated and passed through a lightweight transformer that outputs a joint cross‑attentive embedding.
Reciprocal Dual Contrastive Loss:
- Two contrastive objectives are optimized simultaneously:
  - Audio‑to‑Joint: Pulls the audio embedding close to its joint counterpart while pushing away other joint embeddings.
  - Text‑to‑Joint: Same idea for text.
- This “reciprocal” setup avoids directly contrasting audio with text, which can be problematic when their dimensionalities differ drastically.
Auxiliary Regularizers:
- CKA Loss: Measures similarity of the internal covariance structure before and after projection, encouraging the joint space to preserve each modality’s geometry.
- Mutual Information Balancing Loss: Estimates the information flow from each modality into the joint space and penalizes imbalances.
Downstream Classification:
- A Mixture‑of‑Experts classifier receives the concatenated three embeddings (audio, text, joint) and learns to weight expert predictions based on the task’s label distribution.

Results & Findings

Semantic Coherence: t‑SNE visualizations show that HILBERT’s joint embeddings cluster by topic across modalities, confirming successful alignment.
Classification Gains: On a multi‑class, highly imbalanced audio‑text dataset (e.g., podcast genre classification), HILBERT improves macro‑F1 by 8–12% over strong baselines such as simple concatenation or single‑contrastive alignment.
Robustness to Low Data: With only 10% of the labeled data, performance degrades by less than 3%, indicating that the frozen encoders and regularizers effectively leverage pre‑learned knowledge.
Ablation Insights: Removing either the CKA or the information‑balancing loss drops F1 by ~4%, confirming that both regularizers are essential for stable long‑sequence fusion.

Practical Implications

Podcast & Video Analytics: Companies can build more accurate genre, sentiment, or content‑tagging pipelines without needing massive labeled corpora.
Assistive Tech: Speech‑to‑text systems for the hearing impaired can benefit from richer joint embeddings that capture context across modalities.
Content Recommendation: Balanced audio‑text representations enable better similarity search (e.g., “find videos with similar spoken content and background music”).
Low‑Resource Languages: Because HILBERT relies on frozen encoders, it can be adapted to languages where labeled multimodal data is scarce, simply by swapping in a language‑specific speech or text model.
Modular Integration: The architecture is plug‑and‑play; developers can replace the backbone encoders (e.g., use Whisper for audio) without redesigning the alignment mechanism.

Limitations & Future Work

Dependence on Pre‑trained Encoders: Quality of the joint embedding is bounded by the capabilities of the frozen speech and language models; domain‑specific vocabularies may still suffer.
Computational Overhead: Cross‑modal attention across long sequences can be memory‑intensive; scaling to hour‑long recordings may require segment‑wise batching or efficient attention tricks.
Limited Evaluation Domains: Experiments focus on a few benchmark datasets; broader testing on diverse media (e.g., movies, meetings) would strengthen claims.
Future Directions:
- Explore learnable adapters on top of frozen encoders to fine‑tune domain nuances.
- Incorporate multimodal transformers that handle variable‑length inputs more efficiently (e.g., Longformer, Performer).
- Extend the MoE classifier to multi‑task settings where audio‑text embeddings serve downstream tasks like summarization or question answering.

HILBERT demonstrates that thoughtful alignment and regularization can unlock high‑quality multimodal representations even when data is scarce—a promising step toward more intelligent, context‑aware audio‑text applications.

Authors

Habibeh Naderi
Behrouz Haji Soleimani
Stan Matwin

Paper Information

arXiv ID: 2604.16247v1
Categories: cs.LG, cs.AI
Published: April 17, 2026
PDF: Download PDF

[Paper] Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ASMR-Bench: Auditing for Sabotage in ML Research

[Paper] Geometric regularization of autoencoders via observed stochastic dynamics

[Paper] Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

[Paper] Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design