[Paper] Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Published: (February 26, 2026 at 12:39 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23266v1

Overview

The paper introduces Discourse‑Aware Dual‑Track Streaming Response (DDTSR), a new architecture for spoken dialogue systems that dramatically cuts response latency while keeping the conversation coherent. By letting the system listen‑while‑thinking and speak‑while‑thinking, DDTSR bridges the gap between the traditional “ASR → LLM → TTS” pipeline and the real‑time expectations of modern voice assistants.

Key Contributions

  • Dual‑track model synergy: A lightweight “connector” model emits discourse markers (e.g., “well”, “so”) in real time, while a heavyweight LLM performs deep reasoning in parallel.
  • Streaming cross‑modal collaboration: ASR, LLM inference, and TTS are overlapped dynamically, enabling the earliest possible “speakable” moment.
  • Curriculum‑learning for discourse continuity: A training regime that teaches the system to maintain logical flow between the early, partial utterance and the later, fully‑fledged response.
  • Plug‑and‑play compatibility: DDTSR works with a variety of LLM backbones (GPT‑2, LLaMA, etc.) without architectural changes.
  • Latency reduction of 19 %–51 % on two benchmark spoken‑dialogue datasets, with negligible loss in discourse quality.

Methodology

  1. Two‑track inference

    • Small connector model (≈ 10 M parameters) runs continuously on the incoming ASR stream. Its sole job is to predict discourse connectives—short filler phrases that signal the system is processing the user’s input.
    • Large reasoning model (e.g., 7 B‑parameter LLM) receives the same ASR tokens but works at its own pace, generating the substantive answer.
  2. Streaming orchestration

    • The ASR engine emits partial transcripts token‑by‑token.
    • As soon as the connector model outputs a connective, the TTS engine starts synthesizing it, while the large model continues to consume the transcript.
    • When the large model finishes a segment, its output is stitched to the already‑spoken connective, producing a seamless, incremental spoken reply.
  3. Curriculum learning for coherence

    • Training proceeds in stages: first, the system learns to produce high‑quality full‑sentence responses.
    • Next, it is exposed to truncated inputs and forced to generate partial responses that later need to be extended.
    • A loss term penalizes incoherent jumps between early and later segments, encouraging the model to keep a consistent discourse thread.
  4. Implementation details

    • The pipeline is built on open‑source ASR (Whisper), LLM (LLaMA‑2), and neural TTS (FastSpeech2).
    • A lightweight scheduler decides when the TTS buffer can be flushed, based on confidence thresholds from the connector model.

Results & Findings

MetricBaseline (ASR → LLM → TTS)DDTSR (best config)
End‑to‑end latency (s)1.850.92 (‑50 %)
Word Error Rate (ASR)6.2 %6.2 % (unchanged)
Discourse Coherence (human rating, 1‑5)4.34.2
BLEU (response quality)23.122.8
  • Latency dropped between 19 % and 51 % depending on utterance length; the biggest gains appeared for longer user turns where the large model would otherwise dominate the waiting time.
  • Quality metrics (BLEU, human coherence scores) stayed within 0.3 points of the baseline, confirming that the early “connective” filler does not degrade the overall conversation.
  • Scalability tests showed the same latency gains when swapping the large LLM for a 13 B‑parameter model, indicating the approach is model‑agnostic.

Practical Implications

  • Voice assistants & smart speakers can feel more natural, responding almost instantly with a “thinking” cue (e.g., “Let me see…”) while still delivering a full answer a split‑second later.
  • Customer‑service bots can reduce user frustration caused by long silences, potentially improving satisfaction scores and reducing call abandonment.
  • Edge deployment: Because the connector model is tiny, it can run on-device (e.g., on a smartphone’s NPU), allowing the early speech to be generated locally while the heavy LLM runs in the cloud.
  • Developer ergonomics: DDTSR is a drop‑in module; you can wrap any existing ASR‑LLM‑TTS stack with the provided scheduler and gain latency improvements without retraining the core LLM.
  • Real‑time multimodal agents (e.g., voice‑plus‑visual assistants) can synchronize spoken feedback with UI updates, creating smoother interactive experiences.

Limitations & Future Work

  • Connector model simplicity: The current small model only emits generic discourse markers; richer early content (e.g., partial facts) could further improve perceived responsiveness.
  • Latency vs. confidence trade‑off: Aggressive early TTS can produce filler speech that later needs to be overwritten if the large model’s answer diverges, leading to occasional audible “re‑phrasing”.
  • Domain‑specific tuning: The curriculum learning schedule was tuned on open‑domain dialogue; specialized domains (medical, legal) may require custom curricula.
  • Future directions suggested by the authors include: (1) training a multi‑task connector that can emit brief content snippets, (2) adaptive scheduling based on real‑time confidence scores, and (3) extending the framework to multimodal generation (e.g., simultaneous speech and on‑screen text).

Authors

  • Siyuan Liu
  • Jiahui Xu
  • Feng Jiang
  • Kuang Wang
  • Zefeng Zhao
  • Chu-Ren Huang
  • Jinghang Gu
  • Changqing Yin
  • Haizhou Li

Paper Information

  • arXiv ID: 2602.23266v1
  • Categories: cs.CL
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »