[Paper] Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

Published: 3 days ago (February 12, 2026 at 01:20 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.12241v1

Overview

Moonshine v2 tackles a core bottleneck in on‑device speech recognition: the latency that comes from processing an entire utterance before any words can be output. By replacing the classic full‑attention Transformer encoder with a sliding‑window (ergodic) self‑attention mechanism, the authors achieve low‑time‑to‑first‑token (TTFT) while keeping word‑error‑rate (WER) on par with much larger, slower models. This makes real‑time, edge‑friendly ASR feasible for live transcription, voice‑command interfaces, and instant translation.

Key Contributions

Ergodic streaming encoder: Introduces a bounded‑latency self‑attention scheme that only looks at a local window of frames, eliminating the quadratic cost of full‑utterance attention.
State‑of‑the‑art accuracy with tiny models: Moonshine v2 matches the WER of models up to 6× larger, proving that local attention can preserve global lexical cues when designed correctly.
Latency‑centric evaluation: Provides detailed TTFT measurements across utterance lengths, showing linear‑time inference that scales gracefully on edge hardware.
Open‑source‑ready implementation: The architecture is built on TensorFlow Lite/Edge TPU‑compatible ops, facilitating immediate adoption by developers.
Comprehensive benchmark suite: Validates on standard datasets (LibriSpeech, VoxPopuli, and a proprietary streaming test set) to demonstrate robustness across domains.

Methodology

Sliding‑window self‑attention:
- Each encoder layer attends to a fixed‑size temporal window (e.g., 400 ms) around the current frame instead of the whole sequence.
- Overlapping windows are “ergodic”: as the stream progresses, each frame eventually participates in multiple windows, allowing information to propagate across the entire utterance without ever requiring a full‑utterance pass.
Chunk‑wise processing pipeline:
- Audio is buffered into overlapping chunks (e.g., 1 s with 50 % overlap).
- Each chunk is fed through the streaming encoder, producing a compact representation that is fed to a lightweight decoder (often a CTC or transducer head).
Training tricks to preserve global context:
- Curriculum masking: Randomly mask portions of the window during training so the model learns to infer missing context.
- Auxiliary full‑attention loss: A small “teacher” network with full attention provides soft targets, nudging the streaming encoder toward the same representations.
Model scaling:
- The base Moonshine v2 uses 12 encoder layers with 256 hidden units (≈30 M parameters).
- Larger variants (up to 80 M parameters) follow the same sliding‑window design, showing a smooth accuracy‑latency trade‑off.

Results & Findings

Model (size)	Dataset	WER ↓	TTFT (ms)	Speed‑up vs. Full‑Attention
Moonshine v2‑S (30 M)	LibriSpeech test‑clean	4.3 %	120	5.8×
Moonshine v2‑M (80 M)	LibriSpeech test‑other	6.1 %	150	4.9×
Full‑Attention Transformer (180 M)	LibriSpeech test‑clean	4.2 %	720	1× (baseline)

Accuracy parity: The smallest Moonshine v2 model is within 0.1 % absolute WER of a 6× larger full‑attention baseline.
Latency gains: TTFT grows only modestly with utterance length (≈10 ms per additional second of speech), compared to a linear increase of > 500 ms for the full‑attention encoder.
Resource footprint: Memory consumption drops from ~1.2 GB to < 250 MB, enabling deployment on smartphones and micro‑controllers with < 2 W power budgets.

Practical Implications

Edge‑first voice assistants: Developers can embed high‑accuracy ASR directly on phones, wearables, or IoT hubs without relying on cloud round‑trips, preserving user privacy and reducing latency.
Live captioning & translation: Real‑time subtitles for video conferencing become smoother, as the first words appear within a tenth of a second.
Cost‑effective scaling: Since Moonshine v2 achieves the same WER with a fraction of the parameters, cloud‑based transcription services can serve more concurrent streams per GPU, lowering operational expenses.
Simplified integration: The model’s reliance on standard TensorFlow Lite ops means it can be dropped into existing pipelines (e.g., Android SpeechRecognizer, TensorFlow.js) with minimal engineering effort.

Limitations & Future Work

Window size trade‑off: Too small a window harms performance on highly ambiguous phonemes; the paper reports a sweet spot around 400 ms, but this may need tuning for languages with longer co‑articulation patterns.
Long‑range dependencies: While ergodic overlap mitigates the issue, extremely long utterances (> 30 s) still exhibit a slight drift in WER compared to full attention.
Domain adaptation: The current experiments focus on English datasets; extending the approach to tonal languages or noisy industrial environments will require additional robustness studies.
Future directions: The authors suggest exploring adaptive window sizes (dynamic based on acoustic confidence) and hybrid architectures that selectively invoke full‑attention blocks when the model detects high uncertainty.

Moonshine v2 demonstrates that smart local attention can replace the heavyweight global attention of classic Transformers for streaming ASR, unlocking low‑latency, high‑accuracy speech interfaces that run comfortably on edge hardware. For developers building the next generation of voice‑first products, this work offers a ready‑to‑use blueprint for turning “listen‑first, think‑later” into “listen‑and‑respond‑now.”

Authors

Manjunath Kudlur
Evan King
James Wang
Pete Warden

Paper Information

arXiv ID: 2602.12241v1
Categories: cs.CL, cs.LG, cs.SD
Published: February 12, 2026
PDF: Download PDF

[Paper] Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Test-Time Scaling for WebAgents

[Paper] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

[Paper] A technical curriculum on language-oriented artificial intelligence in translation and specialised communication

[Paper] 'Sorry, I Didn't Catch That': How Speech Models Miss What Matters Most