[Paper] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
Source: arXiv - 2602.23300v1
Overview
Emotion Recognition in Conversations (ERC) sits at the intersection of natural language processing, speech processing, and affective computing. The paper introduces MiSTER‑E, a modular Mixture‑of‑Experts (MoE) architecture that treats speech and text as separate “experts” while also learning a cross‑modal expert. By decoupling modality‑specific context modeling from multimodal fusion, the authors achieve state‑of‑the‑art performance on three widely used ERC benchmarks without ever using speaker identity.
Key Contributions
- Mixture‑of‑Experts framework for ERC: three experts (speech‑only, text‑only, and cross‑modal) whose outputs are combined by a learned gating network.
- LLM‑backed utterance embeddings: large language models fine‑tuned on speech and text provide rich, contextual representations before temporal modeling.
- Convolution‑recurrent context layer: captures the flow of dialogue across turns while preserving modality‑specific nuances.
- Supervised contrastive loss: explicitly aligns paired speech‑text embeddings, encouraging the two modalities to speak the same “emotional language.”
- KL‑divergence regularisation across experts: forces the three experts to stay consistent, reducing over‑reliance on any single modality.
- Speaker‑agnostic design: the system works without speaker IDs, making it applicable to anonymised or multi‑speaker settings.
- Strong empirical results: weighted F1 scores of 70.9 % (IEMOCAP), 69.5 % (MELD), and 87.9 % (MOSI), surpassing prior speech‑text ERC baselines.
Methodology
-
Embedding extraction
- Speech: a pre‑trained large language model (e.g., wav2vec‑2.0) fine‑tuned on emotion‑annotated speech yields an utterance‑level vector.
- Text: a transformer‑based LLM (e.g., BERT) fine‑tuned on the same task provides a complementary textual vector.
-
Context modeling
- Each modality’s sequence of utterance embeddings passes through a 1‑D convolution (captures local turn‑to‑turn patterns) followed by a bidirectional GRU (captures longer‑range dependencies).
-
Expert heads
- Speech‑only expert: predicts emotion from the speech‑context stream.
- Text‑only expert: predicts from the text‑context stream.
- Cross‑modal expert: concatenates the two streams, passes them through a small feed‑forward network, and outputs a joint prediction.
-
Dynamic gating
- A lightweight gating network ingests the three expert logits and learns a soft weighting (via a softmax) that varies per utterance, effectively deciding “which expert to trust more” in each context.
-
Training objectives
- Cross‑entropy for the primary emotion classification.
- Supervised contrastive loss on paired speech‑text embeddings to pull together representations of the same emotion and push apart different emotions.
- KL‑divergence regularisation between the three expert output distributions to keep them aligned while still allowing specialization.
-
Inference
- The final emotion label is the weighted sum of expert predictions, as dictated by the gating network.
Results & Findings
| Dataset | Weighted F1 (MiSTER‑E) | Prior Best* |
|---|---|---|
| IEMOCAP | 70.9 % | 68.2 % |
| MELD | 69.5 % | 66.7 % |
| MOSI | 87.9 % | 85.3 % |
*Prior best refers to the strongest published speech‑text ERC baseline.
- Ablation studies show that removing the contrastive loss drops F1 by ~2 pts, while disabling the gating network reduces performance by ~3 pts, confirming each component’s contribution.
- The cross‑modal expert alone is weaker than the gated combination, highlighting the benefit of letting the model decide per‑utterance which modality dominates.
- The system remains robust when speaker IDs are omitted, unlike many earlier ERC models that rely on speaker turn information.
Practical Implications
- Customer‑service bots: Real‑time emotion detection from both voice and transcribed text can enable more empathetic responses without needing to track who is speaking.
- Call‑center analytics: Aggregating speech‑text emotion scores across calls can surface trends (e.g., rising frustration) while respecting privacy (no speaker IDs required).
- Multimodal UI/UX: Apps that capture both spoken commands and chat messages can adapt UI elements (color, tone) based on the inferred emotional state.
- Healthcare tele‑monitoring: Detecting emotional cues from patient‑doctor video calls can flag potential mental‑health concerns early, even when only audio or text streams are available.
- Developer-friendly integration: Because each expert is a self‑contained module, teams can swap in their own speech or text encoders (e.g., Whisper, GPT‑4) without redesigning the whole pipeline.
Limitations & Future Work
- Dataset bias: The three benchmarks are relatively small and domain‑specific (acted dialogues, TV shows, product reviews). Generalisation to noisy, real‑world call data remains to be proven.
- Compute cost: Fine‑tuning large speech and text LLMs, plus running three experts and a gating network, can be resource‑intensive for edge deployments.
- Speaker‑agnostic trade‑off: While removing speaker IDs improves privacy, it may discard useful turn‑taking cues that could boost accuracy in settings where speaker information is safe to use.
- Future directions suggested by the authors include: scaling the MoE to more than three experts (e.g., adding visual cues), exploring lightweight distillation for on‑device inference, and testing the framework on multilingual ERC tasks.
Authors
- Soumya Dutta
- Smruthi Balaji
- Sriram Ganapathy
Paper Information
- arXiv ID: 2602.23300v1
- Categories: cs.CL, eess.AS
- Published: February 26, 2026
- PDF: Download PDF