[Paper] Spoken DialogSum: An Emotion-Rich Conversational Dataset for Spoken Dialogue Summarization

Published: (December 16, 2025 at 01:54 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.14687v1

Overview

The paper introduces Spoken DialogSum, the first large‑scale dataset that pairs raw conversational audio with two types of summaries—one factual and one emotion‑rich—while also providing utterance‑level annotations for speaker age, gender, and emotion. By bridging speech, text, and paralinguistic cues, the authors enable end‑to‑end audio‑language models (Audio‑LLMs) to generate summaries that preserve both the content and the emotional tone of spoken dialogues.

Key Contributions

  • A novel multimodal corpus: 13,460 spoken dialogues synthesized with expressive TTS, each linked to (a) a factual summary, (b) an emotion‑focused summary, and (c) fine‑grained speaker/utterance metadata (age, gender, emotion, pitch, speaking rate).
  • Two‑stage data creation pipeline:
    1. LLM‑driven rewriting of the DialogSum text corpus to inject natural fillers, back‑channels, and emotion tags.
    2. High‑fidelity expressive TTS that renders the annotated scripts into audio aligned with the paralinguistic labels.
  • Benchmark baselines: Comparison of a cascaded ASR‑LLM pipeline vs. a unified Audio‑LLM, showing a 28 % relative gain in ROUGE‑L for emotion‑rich summaries when using the end‑to‑end model.
  • Open‑source release: Dataset, audio samples, and code are publicly available, encouraging reproducibility and downstream research.

Methodology

  1. Script Enrichment: The authors start from the existing DialogSum text dataset. A large language model (LLM) rewrites each dialogue script, inserting Switchboard‑style phenomena (e.g., “uh‑mm”, “yeah”, “right”) and annotating each utterance with an emotion label (e.g., happy, sad, angry).
  2. Paralinguistic Tagging: For every utterance, additional acoustic attributes—pitch contour and speaking rate—are generated based on the assigned emotion, ensuring that the synthesized speech reflects the intended affect.
  3. Expressive Text‑to‑Speech: Using a state‑of‑the‑art expressive TTS engine, the annotated scripts are turned into high‑quality audio. The TTS system is conditioned on the emotion and acoustic tags, producing speech that naturally varies in tone, intonation, and speed.
  4. Summarization Targets: Two reference summaries are created for each dialogue: a factual summary (content‑only) and an emotion‑rich summary (explicitly mentions affective states).
  5. Model Evaluation: Two baselines are tested:
    • Cascaded: Automatic Speech Recognition (ASR) → Text‑LLM summarizer.
    • End‑to‑end Audio‑LLM: Directly consumes audio and generates the summary.
      Performance is measured with ROUGE‑L and qualitative emotion preservation.

Results & Findings

  • The Audio‑LLM outperforms the cascaded pipeline on emotion‑rich summarization, achieving a 28 % relative improvement in ROUGE‑L (and noticeable gains in emotion recall).
  • For factual summaries, the gap between the two systems narrows, indicating that the primary advantage of end‑to‑end modeling lies in preserving affective cues that often get lost in ASR transcription.
  • Human evaluation confirms that the Audio‑LLM’s summaries better capture speaker emotions and subtle conversational dynamics (e.g., sarcasm, excitement).
  • The dataset itself proves useful for training models that need to align acoustic features with textual sentiment, opening doors for multimodal sentiment analysis and empathetic AI.

Practical Implications

  • Customer‑service automation: Voice‑based agents can generate post‑call summaries that highlight not only the issue but also the caller’s emotional state, enabling more personalized follow‑ups.
  • Meeting transcription tools: End‑to‑end summarizers can produce minutes that flag moments of tension or enthusiasm, helping teams prioritize action items.
  • Accessibility: For hearing‑impaired users, emotion‑aware captions can convey the affect behind spoken content, improving comprehension.
  • Content moderation & analytics: Media monitoring platforms can automatically flag emotionally charged segments in podcasts or call‑center recordings for further review.
  • Training empathetic conversational agents: Developers can fine‑tune dialogue systems on Spoken DialogSum to better recognize and respond to user emotions in real time.

Limitations & Future Work

  • Synthetic audio: Although the expressive TTS is high‑quality, the dataset relies on synthetic speech, which may not capture all nuances of natural human prosody and background noise.
  • Emotion taxonomy: The study uses a limited set of coarse emotion categories; finer‑grained affective states (e.g., frustration vs. anger) remain unexplored.
  • Scalability to real recordings: Future work should validate the Audio‑LLM’s performance on noisy, real‑world conversational recordings and investigate domain adaptation techniques.
  • Multilingual extension: The current corpus is English‑only; extending the pipeline to other languages would broaden applicability.

Spoken DialogSum opens a new frontier for emotion‑aware speech summarization, offering developers a ready‑made resource to build more empathetic, context‑rich voice applications.

Authors

  • Yen-Ju Lu
  • Kunxiao Gao
  • Mingrui Liang
  • Helin Wang
  • Thomas Thebaud
  • Laureano Moro-Velazquez
  • Najim Dehak
  • Jesus Villalba

Paper Information

  • arXiv ID: 2512.14687v1
  • Categories: cs.CL, cs.AI, cs.LG, eess.AS
  • Published: December 16, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »