[Paper] StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion
Source: arXiv - 2512.13632v1
Overview
StutterFuse is the first retrieval‑augmented classifier designed for multi‑label stuttering detection. By pulling in real clinical examples from a non‑parametric memory bank, the model classifies speech based on reference patterns rather than trying to memorize every possible disfluency combination—an especially tough problem when multiple stutters overlap.
Key Contributions
- Retrieval‑Augmented Classification (RAC) for speech pathology – introduces a memory‑based “look‑up” mechanism to a Conformer encoder, a first for stuttering detection.
- Identification of “Modality Collapse” – a phenomenon where naive retrieval inflates recall but harms precision, akin to an echo chamber.
- SetCon loss – a Jaccard‑weighted metric‑learning objective that directly optimizes multi‑label set similarity, mitigating collapse.
- Gated Mixture‑of‑Experts fusion – dynamically balances acoustic evidence with retrieved examples, improving overall decision quality.
- Strong empirical gains – weighted F1 of 0.65 on the SEP‑28k benchmark, surpassing prior state‑of‑the‑art models and showing zero‑shot cross‑lingual robustness.
Methodology
- Base Encoder – A Conformer (convolution‑augmented transformer) processes the raw audio waveform into a high‑level acoustic representation.
- Memory Bank – A non‑parametric store of annotated clinical utterances (audio + label sets) is built from the training corpus. During inference, the encoder queries this bank with a similarity search (e.g., cosine distance) to retrieve the k most relevant examples.
- SetCon Loss – Instead of the usual cross‑entropy, the model is trained with a set‑based contrastive loss. For each training sample, the Jaccard similarity between its true label set and the label sets of retrieved neighbors is computed; the loss pushes the encoder to bring high‑Jaccard pairs closer together and low‑Jaccard pairs farther apart.
- Gated Fusion – A lightweight gating network decides, per time‑step, how much weight to give the encoder’s acoustic logits versus the logits derived from the retrieved examples (treated as a “soft label” distribution). This mixture‑of‑experts approach prevents the model from over‑relying on either source.
- Training Pipeline – The whole system (encoder + gate) is end‑to‑end differentiable; the memory bank is frozen after the first epoch to keep retrieval stable while the encoder learns to align with it.
Results & Findings
| Model | Weighted F1 | Precision | Recall |
|---|---|---|---|
| Baseline Conformer (CE) | 0.58 | 0.61 | 0.55 |
| Conformer + naive retrieval | 0.62 | 0.55 | 0.71 |
| StutterFuse (SetCon + Gated Fusion) | 0.65 | 0.63 | 0.68 |
- Modality Collapse mitigated – naive retrieval boosted recall dramatically but dropped precision; the gated fusion restored balance.
- Zero‑shot cross‑lingual test (German & Mandarin samples) retained an F1 ≈ 0.60, confirming that the memory‑based reasoning generalizes beyond the English training set.
- Ablation shows SetCon alone improves F1 by +0.03, while the gated fusion adds another +0.02 on top of that.
Practical Implications
- Clinical Decision Support – Speech‑language pathologists can get more reliable multi‑label stutter annotations, especially for complex utterances where multiple disfluencies co‑occur.
- Low‑Resource Languages – Because the model leans on retrieved examples rather than massive language‑specific training data, it can be adapted quickly to new languages or dialects with only a handful of annotated recordings.
- Edge Deployment – The retrieval step can be pre‑computed and cached; the gating network adds negligible overhead, making StutterFuse feasible for on‑device or tele‑health applications.
- Beyond Stuttering – The same RAC + SetCon + gated‑fusion recipe could be transplanted to other multi‑label audio tasks (e.g., cough classification, emotion detection) where rare label combinations are a bottleneck.
Limitations & Future Work
- Memory Scalability – The current implementation stores all training examples; scaling to millions of recordings will require approximate nearest‑neighbor indexing or hierarchical memory structures.
- Label Granularity – The SEP‑28k taxonomy is relatively coarse; finer‑grained disfluency types may need richer annotation schemes and possibly hierarchical retrieval.
- Real‑World Noise – Experiments were conducted on relatively clean clinical recordings; robustness to background noise and far‑field microphones remains to be validated.
- User Interaction – Future versions could expose the retrieved examples to clinicians for verification, turning the system into an interactive “retrieval‑augmented annotation tool.”
StutterFuse demonstrates that blending modern neural encoders with a well‑designed retrieval component can overcome data scarcity in pathological speech, opening the door to more accurate, adaptable, and explainable detection systems.
Authors
- Guransh Singh
- Md Shah Fahad
Paper Information
- arXiv ID: 2512.13632v1
- Categories: cs.LG
- Published: December 15, 2025
- PDF: Download PDF