[Paper] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Published: 3 days ago (February 26, 2026 at 01:08 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23300v1

Overview

Emotion Recognition in Conversations (ERC) sits at the intersection of natural language processing, speech processing, and affective computing. The paper introduces MiSTER‑E, a modular Mixture‑of‑Experts (MoE) architecture that treats speech and text as separate “experts” while also learning a cross‑modal expert. By decoupling modality‑specific context modeling from multimodal fusion, the authors achieve state‑of‑the‑art performance on three widely used ERC benchmarks without ever using speaker identity.

Key Contributions

Mixture‑of‑Experts framework for ERC: three experts (speech‑only, text‑only, and cross‑modal) whose outputs are combined by a learned gating network.
LLM‑backed utterance embeddings: large language models fine‑tuned on speech and text provide rich, contextual representations before temporal modeling.
Convolution‑recurrent context layer: captures the flow of dialogue across turns while preserving modality‑specific nuances.
Supervised contrastive loss: explicitly aligns paired speech‑text embeddings, encouraging the two modalities to speak the same “emotional language.”
KL‑divergence regularisation across experts: forces the three experts to stay consistent, reducing over‑reliance on any single modality.
Speaker‑agnostic design: the system works without speaker IDs, making it applicable to anonymised or multi‑speaker settings.
Strong empirical results: weighted F1 scores of 70.9 % (IEMOCAP), 69.5 % (MELD), and 87.9 % (MOSI), surpassing prior speech‑text ERC baselines.

Methodology

Embedding extraction
- Speech: a pre‑trained large language model (e.g., wav2vec‑2.0) fine‑tuned on emotion‑annotated speech yields an utterance‑level vector.
- Text: a transformer‑based LLM (e.g., BERT) fine‑tuned on the same task provides a complementary textual vector.
Context modeling
- Each modality’s sequence of utterance embeddings passes through a 1‑D convolution (captures local turn‑to‑turn patterns) followed by a bidirectional GRU (captures longer‑range dependencies).
Expert heads
- Speech‑only expert: predicts emotion from the speech‑context stream.
- Text‑only expert: predicts from the text‑context stream.
- Cross‑modal expert: concatenates the two streams, passes them through a small feed‑forward network, and outputs a joint prediction.
Dynamic gating
- A lightweight gating network ingests the three expert logits and learns a soft weighting (via a softmax) that varies per utterance, effectively deciding “which expert to trust more” in each context.
Training objectives
- Cross‑entropy for the primary emotion classification.
- Supervised contrastive loss on paired speech‑text embeddings to pull together representations of the same emotion and push apart different emotions.
- KL‑divergence regularisation between the three expert output distributions to keep them aligned while still allowing specialization.
Inference
- The final emotion label is the weighted sum of expert predictions, as dictated by the gating network.

Results & Findings

Dataset	Weighted F1 (MiSTER‑E)	Prior Best*
IEMOCAP	70.9 %	68.2 %
MELD	69.5 %	66.7 %
MOSI	87.9 %	85.3 %

*Prior best refers to the strongest published speech‑text ERC baseline.

Ablation studies show that removing the contrastive loss drops F1 by ~2 pts, while disabling the gating network reduces performance by ~3 pts, confirming each component’s contribution.
The cross‑modal expert alone is weaker than the gated combination, highlighting the benefit of letting the model decide per‑utterance which modality dominates.
The system remains robust when speaker IDs are omitted, unlike many earlier ERC models that rely on speaker turn information.

Practical Implications

Customer‑service bots: Real‑time emotion detection from both voice and transcribed text can enable more empathetic responses without needing to track who is speaking.
Call‑center analytics: Aggregating speech‑text emotion scores across calls can surface trends (e.g., rising frustration) while respecting privacy (no speaker IDs required).
Multimodal UI/UX: Apps that capture both spoken commands and chat messages can adapt UI elements (color, tone) based on the inferred emotional state.
Healthcare tele‑monitoring: Detecting emotional cues from patient‑doctor video calls can flag potential mental‑health concerns early, even when only audio or text streams are available.
Developer-friendly integration: Because each expert is a self‑contained module, teams can swap in their own speech or text encoders (e.g., Whisper, GPT‑4) without redesigning the whole pipeline.

Limitations & Future Work

Dataset bias: The three benchmarks are relatively small and domain‑specific (acted dialogues, TV shows, product reviews). Generalisation to noisy, real‑world call data remains to be proven.
Compute cost: Fine‑tuning large speech and text LLMs, plus running three experts and a gating network, can be resource‑intensive for edge deployments.
Speaker‑agnostic trade‑off: While removing speaker IDs improves privacy, it may discard useful turn‑taking cues that could boost accuracy in settings where speaker information is safe to use.
Future directions suggested by the authors include: scaling the MoE to more than three experts (e.g., adding visual cues), exploring lightweight distillation for on‑device inference, and testing the framework on multilingual ERC tasks.

Authors

Soumya Dutta
Smruthi Balaji
Sriram Ganapathy

Paper Information

arXiv ID: 2602.23300v1
Categories: cs.CL, eess.AS
Published: February 26, 2026
PDF: Download PDF

[Paper] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems