[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

Published: 1 month ago (December 12, 2025 at 12:05 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11724v1

Overview

The paper “From Signal to Turn: Interactional Friction in Modular Speech‑to‑Speech Pipelines” investigates why today’s voice‑based AI assistants often feel “stilted” or broken, even though their underlying language models are highly capable. By dissecting a real‑world Speech‑to‑Speech Retrieval‑Augmented Generation (S2S‑RAG) system, the authors show that the conversational glitches stem not from model errors but from the way modular components are stitched together.

Key Contributions

Identification of three systematic friction patterns in modular S2S pipelines:
1. Temporal Misalignment – delays that break the natural rhythm of dialogue.
2. Expressive Flattening – loss of prosody, tone, and other paralinguistic cues, leading to overly literal replies.
3. Repair Rigidity – architectural gating that prevents users from correcting the system on the fly.
A diagnostic framework that moves beyond latency‑only metrics to evaluate “conversation‑level” health.
Empirical analysis of a production‑grade system, demonstrating that these friction points are structural side‑effects of modular design choices.
Design recommendations that re‑frame spoken‑AI development as an infrastructure‑choreography problem rather than a component‑optimization problem.

Methodology

System Selection – The authors chose a representative production S2S‑RAG pipeline that includes:
- Speech‑to‑Text (ASR)
- Retrieval‑augmented generation (RAG)
- Text‑to‑Speech (TTS)
Interaction Logging – They collected thousands of real user‑assistant turns, annotating each with timestamps, prosodic features, and user‑initiated repair attempts.
Pattern Mining – Using a combination of statistical timing analysis, acoustic feature comparison, and qualitative coding, they surfaced recurring breakdowns.
Root‑Cause Tracing – For each friction pattern, the team traced the failure back to a specific module boundary (e.g., ASR latency spilling into TTS buffering).
Validation – A small user study compared the original pipeline with a “seam‑aware” prototype that introduced buffering and adaptive turn‑taking logic, confirming that friction scores dropped significantly.

Results & Findings

Friction Pattern	Primary Cause	Measured Impact
Temporal Misalignment	ASR‑to‑RAG handoff latency + TTS synthesis lag	Average turn‑taking pause ↑ from 300 ms (ideal) to 1.2 s, causing 27 % drop in perceived naturalness
Expressive Flattening	TTS models trained on neutral prosody; loss of speaker intent during retrieval	Users rated responses 22 % less engaging; sentiment analysis showed reduced affective variance
Repair Rigidity	Fixed gating that discards user input once RAG generation starts	41 % of user‑initiated corrections were ignored, leading to frustration spikes in post‑interaction surveys

The authors argue that these numbers illustrate systemic design trade‑offs: modular pipelines give engineers fine‑grained control and scalability, but the seams introduce conversational “friction” that users experience as broken dialogue.

Practical Implications

For Voice‑Assistant Engineers – Treat handoff points as first‑class “conversation APIs.” Adding lightweight buffers, predictive turn‑taking, and dynamic prosody transfer can dramatically improve user experience without overhauling core models.
Product Managers – Metrics like “average latency” are insufficient; incorporate Interactional Friction Scores (derived from the paper’s framework) into OKRs to capture rhythm and expressivity.
Tooling Vendors – Opportunities to create middleware that synchronizes ASR, RAG, and TTS in real time, exposing hooks for repair handling and prosody preservation.
Developers of Retrieval‑Augmented Systems – Consider context‑aware retrieval that respects the conversational tempo, e.g., by pre‑fetching likely knowledge snippets during user pauses.
Open‑Source Communities – The paper’s diagnostic scripts (available in the supplemental repo) can be integrated into CI pipelines to flag new friction‑inducing changes before release.

Limitations & Future Work

Scope of Evaluation – Focuses on a single commercial S2S‑RAG system; results may differ for end‑to‑end neural models or multilingual setups.
User Diversity – Participants were primarily English‑speaking adults; cultural variations in turn‑taking norms were not explored.
Repair Mechanisms – Proposes architectural changes but does not implement a full “live‑repair” protocol; future work could prototype a bidirectional correction channel.
Prosody Transfer – Preserving speaker intent across retrieval remains an open challenge; integrating expressive embeddings into the retrieval step is a promising direction.

By reframing spoken‑AI development as a choreography of modular seams, this research opens a practical pathway for developers to move beyond “fast but stiff” voice assistants toward truly fluid, human‑like conversations.

Authors

Titaya Mairittha
Tanakon Sawanglok
Panuwit Raden
Jirapast Buntub
Thanapat Warunee
Napat Asawachaisuvikrom
Thanaphum Saiwongin

Paper Information

arXiv ID: 2512.11724v1
Categories: cs.HC, cs.AI, cs.CL, cs.SE
Published: December 12, 2025
PDF: Download PDF

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models

[Paper] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry