[Paper] StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion

Published: (December 15, 2025 at 01:28 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.13632v1

Overview

StutterFuse is the first retrieval‑augmented classifier designed for multi‑label stuttering detection. By pulling in real clinical examples from a non‑parametric memory bank, the model classifies speech based on reference patterns rather than trying to memorize every possible disfluency combination—an especially tough problem when multiple stutters overlap.

Key Contributions

  • Retrieval‑Augmented Classification (RAC) for speech pathology – introduces a memory‑based “look‑up” mechanism to a Conformer encoder, a first for stuttering detection.
  • Identification of “Modality Collapse” – a phenomenon where naive retrieval inflates recall but harms precision, akin to an echo chamber.
  • SetCon loss – a Jaccard‑weighted metric‑learning objective that directly optimizes multi‑label set similarity, mitigating collapse.
  • Gated Mixture‑of‑Experts fusion – dynamically balances acoustic evidence with retrieved examples, improving overall decision quality.
  • Strong empirical gains – weighted F1 of 0.65 on the SEP‑28k benchmark, surpassing prior state‑of‑the‑art models and showing zero‑shot cross‑lingual robustness.

Methodology

  1. Base Encoder – A Conformer (convolution‑augmented transformer) processes the raw audio waveform into a high‑level acoustic representation.
  2. Memory Bank – A non‑parametric store of annotated clinical utterances (audio + label sets) is built from the training corpus. During inference, the encoder queries this bank with a similarity search (e.g., cosine distance) to retrieve the k most relevant examples.
  3. SetCon Loss – Instead of the usual cross‑entropy, the model is trained with a set‑based contrastive loss. For each training sample, the Jaccard similarity between its true label set and the label sets of retrieved neighbors is computed; the loss pushes the encoder to bring high‑Jaccard pairs closer together and low‑Jaccard pairs farther apart.
  4. Gated Fusion – A lightweight gating network decides, per time‑step, how much weight to give the encoder’s acoustic logits versus the logits derived from the retrieved examples (treated as a “soft label” distribution). This mixture‑of‑experts approach prevents the model from over‑relying on either source.
  5. Training Pipeline – The whole system (encoder + gate) is end‑to‑end differentiable; the memory bank is frozen after the first epoch to keep retrieval stable while the encoder learns to align with it.

Results & Findings

ModelWeighted F1PrecisionRecall
Baseline Conformer (CE)0.580.610.55
Conformer + naive retrieval0.620.550.71
StutterFuse (SetCon + Gated Fusion)0.650.630.68
  • Modality Collapse mitigated – naive retrieval boosted recall dramatically but dropped precision; the gated fusion restored balance.
  • Zero‑shot cross‑lingual test (German & Mandarin samples) retained an F1 ≈ 0.60, confirming that the memory‑based reasoning generalizes beyond the English training set.
  • Ablation shows SetCon alone improves F1 by +0.03, while the gated fusion adds another +0.02 on top of that.

Practical Implications

  • Clinical Decision Support – Speech‑language pathologists can get more reliable multi‑label stutter annotations, especially for complex utterances where multiple disfluencies co‑occur.
  • Low‑Resource Languages – Because the model leans on retrieved examples rather than massive language‑specific training data, it can be adapted quickly to new languages or dialects with only a handful of annotated recordings.
  • Edge Deployment – The retrieval step can be pre‑computed and cached; the gating network adds negligible overhead, making StutterFuse feasible for on‑device or tele‑health applications.
  • Beyond Stuttering – The same RAC + SetCon + gated‑fusion recipe could be transplanted to other multi‑label audio tasks (e.g., cough classification, emotion detection) where rare label combinations are a bottleneck.

Limitations & Future Work

  • Memory Scalability – The current implementation stores all training examples; scaling to millions of recordings will require approximate nearest‑neighbor indexing or hierarchical memory structures.
  • Label Granularity – The SEP‑28k taxonomy is relatively coarse; finer‑grained disfluency types may need richer annotation schemes and possibly hierarchical retrieval.
  • Real‑World Noise – Experiments were conducted on relatively clean clinical recordings; robustness to background noise and far‑field microphones remains to be validated.
  • User Interaction – Future versions could expose the retrieved examples to clinicians for verification, turning the system into an interactive “retrieval‑augmented annotation tool.”

StutterFuse demonstrates that blending modern neural encoders with a well‑designed retrieval component can overcome data scarcity in pathological speech, opening the door to more accurate, adaptable, and explainable detection systems.

Authors

  • Guransh Singh
  • Md Shah Fahad

Paper Information

  • arXiv ID: 2512.13632v1
  • Categories: cs.LG
  • Published: December 15, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »