[Paper] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Published: (February 26, 2026 at 01:08 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23300v1

Overview

Emotion Recognition in Conversations (ERC) sits at the intersection of natural language processing, speech processing, and affective computing. The paper introduces MiSTER‑E, a modular Mixture‑of‑Experts (MoE) architecture that treats speech and text as separate “experts” while also learning a cross‑modal expert. By decoupling modality‑specific context modeling from multimodal fusion, the authors achieve state‑of‑the‑art performance on three widely used ERC benchmarks without ever using speaker identity.

Key Contributions

  • Mixture‑of‑Experts framework for ERC: three experts (speech‑only, text‑only, and cross‑modal) whose outputs are combined by a learned gating network.
  • LLM‑backed utterance embeddings: large language models fine‑tuned on speech and text provide rich, contextual representations before temporal modeling.
  • Convolution‑recurrent context layer: captures the flow of dialogue across turns while preserving modality‑specific nuances.
  • Supervised contrastive loss: explicitly aligns paired speech‑text embeddings, encouraging the two modalities to speak the same “emotional language.”
  • KL‑divergence regularisation across experts: forces the three experts to stay consistent, reducing over‑reliance on any single modality.
  • Speaker‑agnostic design: the system works without speaker IDs, making it applicable to anonymised or multi‑speaker settings.
  • Strong empirical results: weighted F1 scores of 70.9 % (IEMOCAP), 69.5 % (MELD), and 87.9 % (MOSI), surpassing prior speech‑text ERC baselines.

Methodology

  1. Embedding extraction

    • Speech: a pre‑trained large language model (e.g., wav2vec‑2.0) fine‑tuned on emotion‑annotated speech yields an utterance‑level vector.
    • Text: a transformer‑based LLM (e.g., BERT) fine‑tuned on the same task provides a complementary textual vector.
  2. Context modeling

    • Each modality’s sequence of utterance embeddings passes through a 1‑D convolution (captures local turn‑to‑turn patterns) followed by a bidirectional GRU (captures longer‑range dependencies).
  3. Expert heads

    • Speech‑only expert: predicts emotion from the speech‑context stream.
    • Text‑only expert: predicts from the text‑context stream.
    • Cross‑modal expert: concatenates the two streams, passes them through a small feed‑forward network, and outputs a joint prediction.
  4. Dynamic gating

    • A lightweight gating network ingests the three expert logits and learns a soft weighting (via a softmax) that varies per utterance, effectively deciding “which expert to trust more” in each context.
  5. Training objectives

    • Cross‑entropy for the primary emotion classification.
    • Supervised contrastive loss on paired speech‑text embeddings to pull together representations of the same emotion and push apart different emotions.
    • KL‑divergence regularisation between the three expert output distributions to keep them aligned while still allowing specialization.
  6. Inference

    • The final emotion label is the weighted sum of expert predictions, as dictated by the gating network.

Results & Findings

DatasetWeighted F1 (MiSTER‑E)Prior Best*
IEMOCAP70.9 %68.2 %
MELD69.5 %66.7 %
MOSI87.9 %85.3 %

*Prior best refers to the strongest published speech‑text ERC baseline.

  • Ablation studies show that removing the contrastive loss drops F1 by ~2 pts, while disabling the gating network reduces performance by ~3 pts, confirming each component’s contribution.
  • The cross‑modal expert alone is weaker than the gated combination, highlighting the benefit of letting the model decide per‑utterance which modality dominates.
  • The system remains robust when speaker IDs are omitted, unlike many earlier ERC models that rely on speaker turn information.

Practical Implications

  • Customer‑service bots: Real‑time emotion detection from both voice and transcribed text can enable more empathetic responses without needing to track who is speaking.
  • Call‑center analytics: Aggregating speech‑text emotion scores across calls can surface trends (e.g., rising frustration) while respecting privacy (no speaker IDs required).
  • Multimodal UI/UX: Apps that capture both spoken commands and chat messages can adapt UI elements (color, tone) based on the inferred emotional state.
  • Healthcare tele‑monitoring: Detecting emotional cues from patient‑doctor video calls can flag potential mental‑health concerns early, even when only audio or text streams are available.
  • Developer-friendly integration: Because each expert is a self‑contained module, teams can swap in their own speech or text encoders (e.g., Whisper, GPT‑4) without redesigning the whole pipeline.

Limitations & Future Work

  • Dataset bias: The three benchmarks are relatively small and domain‑specific (acted dialogues, TV shows, product reviews). Generalisation to noisy, real‑world call data remains to be proven.
  • Compute cost: Fine‑tuning large speech and text LLMs, plus running three experts and a gating network, can be resource‑intensive for edge deployments.
  • Speaker‑agnostic trade‑off: While removing speaker IDs improves privacy, it may discard useful turn‑taking cues that could boost accuracy in settings where speaker information is safe to use.
  • Future directions suggested by the authors include: scaling the MoE to more than three experts (e.g., adding visual cues), exploring lightweight distillation for on‑device inference, and testing the framework on multilingual ERC tasks.

Authors

  • Soumya Dutta
  • Smruthi Balaji
  • Sriram Ganapathy

Paper Information

  • arXiv ID: 2602.23300v1
  • Categories: cs.CL, eess.AS
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »