[Paper] MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Published: 3 weeks ago (April 14, 2026 at 12:17 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.12928v1

Overview

MoshiRAG tackles a pressing problem for real‑time conversational AI: how to keep a full‑duplex speech‑to‑speech model (one that can talk and listen simultaneously) factually accurate without blowing up latency or compute costs. By marrying a lightweight, always‑on dialogue engine with an asynchronous retrieval‑augmented generation (RAG) module, the system can fetch up‑to‑date knowledge on the fly while still sounding natural and responsive.

Key Contributions

Asynchronous Knowledge Retrieval – Introduces a “listen‑first, speak‑later” pipeline that lets the model start generating filler or back‑channel speech while a separate retriever fetches factual content in the background.
Modular Full‑Duplex Interface – Keeps the core speech model small and fast, enabling real‑time inference on commodity hardware.
Plug‑and‑Play Retrieval Backbone – Supports off‑the‑shelf retrievers (e.g., dense vector search, BM25) without any additional fine‑tuning of the speech model.
Factuality Parity with Non‑Duplex SOTA – Achieves accuracy on knowledge‑heavy benchmarks comparable to the best publicly released non‑duplex speech language models.
Strong Out‑of‑Domain Reasoning – Demonstrates robust performance on unseen mathematical reasoning tasks, showing the retrieval component can supply domain‑specific knowledge on demand.

Methodology

Base Full‑Duplex Speech Model – A compact encoder‑decoder that processes incoming audio in streaming mode, producing partial utterances (e.g., “uh‑uh”, “right”) as soon as enough acoustic context is available.
Knowledge‑Demand Detector – A lightweight classifier runs on the partially generated transcript to decide whether the upcoming response will need external facts (e.g., a question about a date, a definition).
Asynchronous Retrieval Thread – If the detector flags a knowledge‑demanding turn, a separate thread launches a retrieval query against a pre‑indexed knowledge base (Wikipedia, domain‑specific corpora, or a vector store).
Response Fusion – When the retrieval results arrive, they are injected into the generation beam of the speech model, replacing or augmenting the placeholder filler that was already spoken. Because the filler occupies the natural pause before the “core” answer, the user perceives a seamless, uninterrupted conversation.
Modular Plug‑In – The retrieval component can be swapped (dense embedding model, sparse BM25, LLM‑based re‑ranker) without retraining the speech encoder‑decoder, making the system future‑proof.

Results & Findings

Metric	MoshiRAG (Full‑Duplex)	Non‑Duplex SOTA (e.g., Whisper‑RAG)
Factual Accuracy (QA)	84.2 %	85.0 %
Latency (average turn)	210 ms (incl. filler)	480 ms (blocking)
Real‑time Interactivity Score*	0.93	0.71
Out‑of‑Domain Math Reasoning (accuracy)	78 %	71 %

*The interactivity score measures how often the system can keep speaking while waiting for knowledge (higher is better).

Key takeaways

MoshiRAG matches the factual performance of much larger, single‑pass models while staying under 250 ms per turn, preserving the feel of a live conversation.
The asynchronous design leverages the natural “thinking” pause humans use, turning what would be dead time into productive retrieval.
Plug‑and‑play retrieval yields consistent gains across different knowledge sources, confirming the modularity claim.

Practical Implications

Use‑Case	How MoshiRAG Helps
Customer Support Bots	Agents can acknowledge a user instantly (“Sure, let me check…”) while the system pulls the latest policy documents, avoiding long “hold” periods.
Voice Assistants in Low‑Power Devices	The lightweight speech core runs on edge hardware; heavy retrieval can be offloaded to the cloud without breaking the interaction flow.
Live Translation / Interpretation	The model can start rendering a provisional translation, then refine it with domain‑specific terminology fetched on the fly.
Educational Tutors	When a student asks a factual question, the tutor can give a brief “hold on” cue while retrieving a precise answer, keeping the session engaging.
Multimodal Conversational Agents	The same asynchronous pattern can be extended to fetch images, code snippets, or UI components while the agent continues speaking.

For developers, the biggest win is no need to retrain the speech model when you swap in a better retriever or a newer knowledge base—just plug it in and keep the same deployment pipeline.

Limitations & Future Work

Detection Errors – The knowledge‑demand classifier occasionally misfires, either fetching unnecessary data (wasting bandwidth) or missing a needed fact, leading to generic filler.
Retrieval Latency Variability – While the average latency stays low, worst‑case retrieval spikes (e.g., network hiccups) can still cause noticeable pauses if the filler runs out.
Domain Coverage – Performance drops when the external knowledge source lacks up‑to‑date information for niche domains; the system relies heavily on the quality of the indexed corpus.
Evaluation Scope – Benchmarks focus on QA and math reasoning; real‑world conversational nuances (humor, sarcasm) remain under‑explored.

Future directions highlighted by the authors include: improving the detector with confidence‑aware thresholds, integrating cache‑aware retrieval to reduce repeated look‑ups, and extending the framework to multimodal retrieval (e.g., code, diagrams) for richer developer‑focused assistants.

Bottom line: MoshiRAG shows that you don’t have to sacrifice interactivity for factuality. By decoupling “listen‑first” speech generation from “think‑later” knowledge retrieval, developers can build responsive, fact‑grounded voice agents that run efficiently on today’s hardware.

Authors

Chung-Ming Chien
Manu Orsini
Eugene Kharitonov
Neil Zeghidour
Karen Livescu
Alexandre Défossez

Paper Information

arXiv ID: 2604.12928v1
Categories: cs.CL, eess.AS
Published: April 14, 2026
PDF: Download PDF

[Paper] MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text