[Paper] Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
Source: arXiv - 2604.16247v1
Overview
The paper introduces HILBERT, a new multimodal framework that learns joint audio‑text embeddings from long, segmented documents—think podcasts, lecture recordings, or video subtitles—while working with limited training data. By cleverly combining frozen pre‑trained speech and language models with a dual‑contrastive alignment strategy, HILBERT produces robust, balanced representations that outperform existing methods on highly imbalanced classification tasks.
Key Contributions
- Hierarchical Long‑Sequence Fusion: Uses cross‑modal attention and self‑attentive pooling to merge segment‑level audio and text features into a single document‑level embedding.
- Reciprocal Dual Contrastive Alignment: Aligns each modality (audio, text) to a shared joint space independently (audio→joint, text→joint) rather than forcing a direct audio‑text match, which mitigates dimensionality imbalance.
- Structure‑Preserving Regularizer (CKA): Enforces that the internal geometry of each modality is retained after projection into the joint space.
- Information‑Balanced Regularizer: Equalizes the amount of information contributed by audio and text, preventing one modality from dominating the joint representation.
- Mixture‑of‑Experts (MoE) Classifier: Handles heterogeneous label sets by dynamically weighting expert predictions based on the concatenated audio, text, and joint embeddings.
- Low‑Resource Viability: Demonstrates strong performance even when only a small amount of labeled multimodal data is available.
Methodology
-
Feature Extraction:
- Audio segments are passed through a frozen pre‑trained speech encoder (e.g., wav2vec 2.0).
- Text segments (transcripts, subtitles) are encoded with a frozen language model (e.g., BERT).
- Both encoders output fixed‑size vectors per segment, preserving the original temporal order.
-
Hierarchical Fusion:
- Cross‑Modal Attention: Each audio segment attends to all text segments (and vice‑versa), allowing the model to capture cross‑modal cues such as “the speaker says X while a music cue occurs.”
- Self‑Attentive Pooling: The attended segment vectors are aggregated into a single representation per modality, producing audio‑only and text‑only document embeddings.
-
Joint Embedding Construction:
- The modality‑specific embeddings are concatenated and passed through a lightweight transformer that outputs a joint cross‑attentive embedding.
-
Reciprocal Dual Contrastive Loss:
- Two contrastive objectives are optimized simultaneously:
- Audio‑to‑Joint: Pulls the audio embedding close to its joint counterpart while pushing away other joint embeddings.
- Text‑to‑Joint: Same idea for text.
- This “reciprocal” setup avoids directly contrasting audio with text, which can be problematic when their dimensionalities differ drastically.
- Two contrastive objectives are optimized simultaneously:
-
Auxiliary Regularizers:
- CKA Loss: Measures similarity of the internal covariance structure before and after projection, encouraging the joint space to preserve each modality’s geometry.
- Mutual Information Balancing Loss: Estimates the information flow from each modality into the joint space and penalizes imbalances.
-
Downstream Classification:
- A Mixture‑of‑Experts classifier receives the concatenated three embeddings (audio, text, joint) and learns to weight expert predictions based on the task’s label distribution.
Results & Findings
- Semantic Coherence: t‑SNE visualizations show that HILBERT’s joint embeddings cluster by topic across modalities, confirming successful alignment.
- Classification Gains: On a multi‑class, highly imbalanced audio‑text dataset (e.g., podcast genre classification), HILBERT improves macro‑F1 by 8–12% over strong baselines such as simple concatenation or single‑contrastive alignment.
- Robustness to Low Data: With only 10% of the labeled data, performance degrades by less than 3%, indicating that the frozen encoders and regularizers effectively leverage pre‑learned knowledge.
- Ablation Insights: Removing either the CKA or the information‑balancing loss drops F1 by ~4%, confirming that both regularizers are essential for stable long‑sequence fusion.
Practical Implications
- Podcast & Video Analytics: Companies can build more accurate genre, sentiment, or content‑tagging pipelines without needing massive labeled corpora.
- Assistive Tech: Speech‑to‑text systems for the hearing impaired can benefit from richer joint embeddings that capture context across modalities.
- Content Recommendation: Balanced audio‑text representations enable better similarity search (e.g., “find videos with similar spoken content and background music”).
- Low‑Resource Languages: Because HILBERT relies on frozen encoders, it can be adapted to languages where labeled multimodal data is scarce, simply by swapping in a language‑specific speech or text model.
- Modular Integration: The architecture is plug‑and‑play; developers can replace the backbone encoders (e.g., use Whisper for audio) without redesigning the alignment mechanism.
Limitations & Future Work
- Dependence on Pre‑trained Encoders: Quality of the joint embedding is bounded by the capabilities of the frozen speech and language models; domain‑specific vocabularies may still suffer.
- Computational Overhead: Cross‑modal attention across long sequences can be memory‑intensive; scaling to hour‑long recordings may require segment‑wise batching or efficient attention tricks.
- Limited Evaluation Domains: Experiments focus on a few benchmark datasets; broader testing on diverse media (e.g., movies, meetings) would strengthen claims.
- Future Directions:
- Explore learnable adapters on top of frozen encoders to fine‑tune domain nuances.
- Incorporate multimodal transformers that handle variable‑length inputs more efficiently (e.g., Longformer, Performer).
- Extend the MoE classifier to multi‑task settings where audio‑text embeddings serve downstream tasks like summarization or question answering.
HILBERT demonstrates that thoughtful alignment and regularization can unlock high‑quality multimodal representations even when data is scarce—a promising step toward more intelligent, context‑aware audio‑text applications.
Authors
- Habibeh Naderi
- Behrouz Haji Soleimani
- Stan Matwin
Paper Information
- arXiv ID: 2604.16247v1
- Categories: cs.LG, cs.AI
- Published: April 17, 2026
- PDF: Download PDF