[Paper] Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
Source: arXiv - 2602.12241v1
Overview
Moonshine v2 tackles a core bottleneck in on‑device speech recognition: the latency that comes from processing an entire utterance before any words can be output. By replacing the classic full‑attention Transformer encoder with a sliding‑window (ergodic) self‑attention mechanism, the authors achieve low‑time‑to‑first‑token (TTFT) while keeping word‑error‑rate (WER) on par with much larger, slower models. This makes real‑time, edge‑friendly ASR feasible for live transcription, voice‑command interfaces, and instant translation.
Key Contributions
- Ergodic streaming encoder: Introduces a bounded‑latency self‑attention scheme that only looks at a local window of frames, eliminating the quadratic cost of full‑utterance attention.
- State‑of‑the‑art accuracy with tiny models: Moonshine v2 matches the WER of models up to 6× larger, proving that local attention can preserve global lexical cues when designed correctly.
- Latency‑centric evaluation: Provides detailed TTFT measurements across utterance lengths, showing linear‑time inference that scales gracefully on edge hardware.
- Open‑source‑ready implementation: The architecture is built on TensorFlow Lite/Edge TPU‑compatible ops, facilitating immediate adoption by developers.
- Comprehensive benchmark suite: Validates on standard datasets (LibriSpeech, VoxPopuli, and a proprietary streaming test set) to demonstrate robustness across domains.
Methodology
-
Sliding‑window self‑attention:
- Each encoder layer attends to a fixed‑size temporal window (e.g., 400 ms) around the current frame instead of the whole sequence.
- Overlapping windows are “ergodic”: as the stream progresses, each frame eventually participates in multiple windows, allowing information to propagate across the entire utterance without ever requiring a full‑utterance pass.
-
Chunk‑wise processing pipeline:
- Audio is buffered into overlapping chunks (e.g., 1 s with 50 % overlap).
- Each chunk is fed through the streaming encoder, producing a compact representation that is fed to a lightweight decoder (often a CTC or transducer head).
-
Training tricks to preserve global context:
- Curriculum masking: Randomly mask portions of the window during training so the model learns to infer missing context.
- Auxiliary full‑attention loss: A small “teacher” network with full attention provides soft targets, nudging the streaming encoder toward the same representations.
-
Model scaling:
- The base Moonshine v2 uses 12 encoder layers with 256 hidden units (≈30 M parameters).
- Larger variants (up to 80 M parameters) follow the same sliding‑window design, showing a smooth accuracy‑latency trade‑off.
Results & Findings
| Model (size) | Dataset | WER ↓ | TTFT (ms) | Speed‑up vs. Full‑Attention |
|---|---|---|---|---|
| Moonshine v2‑S (30 M) | LibriSpeech test‑clean | 4.3 % | 120 | 5.8× |
| Moonshine v2‑M (80 M) | LibriSpeech test‑other | 6.1 % | 150 | 4.9× |
| Full‑Attention Transformer (180 M) | LibriSpeech test‑clean | 4.2 % | 720 | 1× (baseline) |
- Accuracy parity: The smallest Moonshine v2 model is within 0.1 % absolute WER of a 6× larger full‑attention baseline.
- Latency gains: TTFT grows only modestly with utterance length (≈10 ms per additional second of speech), compared to a linear increase of > 500 ms for the full‑attention encoder.
- Resource footprint: Memory consumption drops from ~1.2 GB to < 250 MB, enabling deployment on smartphones and micro‑controllers with < 2 W power budgets.
Practical Implications
- Edge‑first voice assistants: Developers can embed high‑accuracy ASR directly on phones, wearables, or IoT hubs without relying on cloud round‑trips, preserving user privacy and reducing latency.
- Live captioning & translation: Real‑time subtitles for video conferencing become smoother, as the first words appear within a tenth of a second.
- Cost‑effective scaling: Since Moonshine v2 achieves the same WER with a fraction of the parameters, cloud‑based transcription services can serve more concurrent streams per GPU, lowering operational expenses.
- Simplified integration: The model’s reliance on standard TensorFlow Lite ops means it can be dropped into existing pipelines (e.g., Android SpeechRecognizer, TensorFlow.js) with minimal engineering effort.
Limitations & Future Work
- Window size trade‑off: Too small a window harms performance on highly ambiguous phonemes; the paper reports a sweet spot around 400 ms, but this may need tuning for languages with longer co‑articulation patterns.
- Long‑range dependencies: While ergodic overlap mitigates the issue, extremely long utterances (> 30 s) still exhibit a slight drift in WER compared to full attention.
- Domain adaptation: The current experiments focus on English datasets; extending the approach to tonal languages or noisy industrial environments will require additional robustness studies.
- Future directions: The authors suggest exploring adaptive window sizes (dynamic based on acoustic confidence) and hybrid architectures that selectively invoke full‑attention blocks when the model detects high uncertainty.
Moonshine v2 demonstrates that smart local attention can replace the heavyweight global attention of classic Transformers for streaming ASR, unlocking low‑latency, high‑accuracy speech interfaces that run comfortably on edge hardware. For developers building the next generation of voice‑first products, this work offers a ready‑to‑use blueprint for turning “listen‑first, think‑later” into “listen‑and‑respond‑now.”
Authors
- Manjunath Kudlur
- Evan King
- James Wang
- Pete Warden
Paper Information
- arXiv ID: 2602.12241v1
- Categories: cs.CL, cs.LG, cs.SD
- Published: February 12, 2026
- PDF: Download PDF