[Paper] StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion

Published: 3 days ago (December 15, 2025 at 01:28 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.13632v1

Overview

StutterFuse is the first retrieval‑augmented classifier designed for multi‑label stuttering detection. By pulling in real clinical examples from a non‑parametric memory bank, the model classifies speech based on reference patterns rather than trying to memorize every possible disfluency combination—an especially tough problem when multiple stutters overlap.

Key Contributions

Retrieval‑Augmented Classification (RAC) for speech pathology – introduces a memory‑based “look‑up” mechanism to a Conformer encoder, a first for stuttering detection.
Identification of “Modality Collapse” – a phenomenon where naive retrieval inflates recall but harms precision, akin to an echo chamber.
SetCon loss – a Jaccard‑weighted metric‑learning objective that directly optimizes multi‑label set similarity, mitigating collapse.
Gated Mixture‑of‑Experts fusion – dynamically balances acoustic evidence with retrieved examples, improving overall decision quality.
Strong empirical gains – weighted F1 of 0.65 on the SEP‑28k benchmark, surpassing prior state‑of‑the‑art models and showing zero‑shot cross‑lingual robustness.

Methodology

Base Encoder – A Conformer (convolution‑augmented transformer) processes the raw audio waveform into a high‑level acoustic representation.
Memory Bank – A non‑parametric store of annotated clinical utterances (audio + label sets) is built from the training corpus. During inference, the encoder queries this bank with a similarity search (e.g., cosine distance) to retrieve the k most relevant examples.
SetCon Loss – Instead of the usual cross‑entropy, the model is trained with a set‑based contrastive loss. For each training sample, the Jaccard similarity between its true label set and the label sets of retrieved neighbors is computed; the loss pushes the encoder to bring high‑Jaccard pairs closer together and low‑Jaccard pairs farther apart.
Gated Fusion – A lightweight gating network decides, per time‑step, how much weight to give the encoder’s acoustic logits versus the logits derived from the retrieved examples (treated as a “soft label” distribution). This mixture‑of‑experts approach prevents the model from over‑relying on either source.
Training Pipeline – The whole system (encoder + gate) is end‑to‑end differentiable; the memory bank is frozen after the first epoch to keep retrieval stable while the encoder learns to align with it.

Results & Findings

Model	Weighted F1	Precision	Recall
Baseline Conformer (CE)	0.58	0.61	0.55
Conformer + naive retrieval	0.62	0.55	0.71
StutterFuse (SetCon + Gated Fusion)	0.65	0.63	0.68

Modality Collapse mitigated – naive retrieval boosted recall dramatically but dropped precision; the gated fusion restored balance.
Zero‑shot cross‑lingual test (German & Mandarin samples) retained an F1 ≈ 0.60, confirming that the memory‑based reasoning generalizes beyond the English training set.
Ablation shows SetCon alone improves F1 by +0.03, while the gated fusion adds another +0.02 on top of that.

Practical Implications

Clinical Decision Support – Speech‑language pathologists can get more reliable multi‑label stutter annotations, especially for complex utterances where multiple disfluencies co‑occur.
Low‑Resource Languages – Because the model leans on retrieved examples rather than massive language‑specific training data, it can be adapted quickly to new languages or dialects with only a handful of annotated recordings.
Edge Deployment – The retrieval step can be pre‑computed and cached; the gating network adds negligible overhead, making StutterFuse feasible for on‑device or tele‑health applications.
Beyond Stuttering – The same RAC + SetCon + gated‑fusion recipe could be transplanted to other multi‑label audio tasks (e.g., cough classification, emotion detection) where rare label combinations are a bottleneck.

Limitations & Future Work

Memory Scalability – The current implementation stores all training examples; scaling to millions of recordings will require approximate nearest‑neighbor indexing or hierarchical memory structures.
Label Granularity – The SEP‑28k taxonomy is relatively coarse; finer‑grained disfluency types may need richer annotation schemes and possibly hierarchical retrieval.
Real‑World Noise – Experiments were conducted on relatively clean clinical recordings; robustness to background noise and far‑field microphones remains to be validated.
User Interaction – Future versions could expose the retrieved examples to clinicians for verification, turning the system into an interactive “retrieval‑augmented annotation tool.”

StutterFuse demonstrates that blending modern neural encoders with a well‑designed retrieval component can overcome data scarcity in pathological speech, opening the door to more accurate, adaptable, and explainable detection systems.

Authors

Guransh Singh
Md Shah Fahad

Paper Information

arXiv ID: 2512.13632v1
Categories: cs.LG
Published: December 15, 2025
PDF: Download PDF

[Paper] StutterFuse: Mitigating Modality Collapse in Stuttering Detection with Jaccard-Weighted Metric Learning and Gated Fusion

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Spatia: Video Generation with Updatable Spatial Memory

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

[Paper] Artism: AI-Driven Dual-Engine System for Art Generation and Critique

[Paper] Learning Model Parameter Dynamics in a Combination Therapy for Bladder Cancer from Sparse Biological Data