[Paper] DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates

Published: (March 5, 2026 at 01:30 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.05459v1

Overview

The paper introduces DEBISS, a newly released corpus that captures individual, semi‑structured spoken debates. By combining raw audio, transcriptions, speaker diarization, argument‑mining tags, and quality‑assessment labels, the dataset fills a glaring gap in resources for researchers and developers building tools that understand real‑world debating behavior.

Key Contributions

  • A first‑of‑its‑kind spoken‑debate corpus covering a variety of topics (politics, education, everyday issues) with natural, unscripted speech.
  • Multi‑layer annotation pipeline:
    • Automatic speech‑to‑text (ASR) plus manual correction.
    • Speaker diarization to separate multiple participants.
    • Argument mining tags (claim, premise, rebuttal, support).
    • Debater quality scores (coherence, persuasiveness, logical consistency).
  • Open‑access release of audio files, transcripts, and annotation files in a unified format (JSONL + WAV).
  • Baseline benchmarks for several NLP tasks (ASR error rates, argument component detection, quality classification) to help the community get started quickly.
  • Comprehensive documentation and a small “starter kit” (scripts for data loading, evaluation metrics, and a pre‑trained transformer model fine‑tuned on DEBISS).

Methodology

  1. Data Collection – 150 participants were recruited through university mailing lists and social media. Each recorded a 5‑10 minute monologue on a prompt (e.g., “Should remote work become permanent?”). Recordings were done with consumer‑grade microphones to reflect realistic audio quality.
  2. Pre‑processing – Raw audio was run through a state‑of‑the‑art ASR system (Whisper‑large) to generate initial transcripts. Human annotators then corrected errors and added timestamps.
  3. Speaker Diarization – Since most recordings featured a single speaker, diarization was primarily used to detect interruptions (e.g., self‑interruptions, filler words) and to segment the monologue into logical turns.
  4. Argument Mining Annotation – Trained linguists labeled each sentence with argument components (claim, premise, evidence, rebuttal) following the Toulmin model. Inter‑annotator agreement (Cohen’s κ) averaged 0.78, indicating reliable labeling.
  5. Quality Assessment – A panel of debate coaches rated each monologue on three dimensions (coherence, persuasiveness, logical consistency) on a 1‑5 Likert scale. These scores serve as supervised targets for quality‑prediction models.
  6. Benchmarking – The authors fine‑tuned BERT‑based classifiers for argument component detection and a regression head for quality prediction, reporting baseline F1 scores (≈ 0.71) and RMSE (≈ 0.84) respectively.

Results & Findings

  • ASR performance: Word Error Rate (WER) of 12.4 % after manual correction, showing that even consumer‑grade recordings are tractable for downstream NLP.
  • Argument component detection: The best model achieved 71 % F1 across all four component classes, with claims being the easiest to detect (F1 = 0.78) and rebuttals the hardest (F1 = 0.62).
  • Quality prediction: Regression models could predict overall debater quality with a Pearson correlation of 0.68 to human scores, suggesting that linguistic cues capture a substantial portion of perceived persuasiveness.
  • Data diversity: Topic distribution was balanced (≈ 10 % per topic) and speaker demographics (age 18‑55, 55 % female) matched typical online debate populations, enhancing external validity.

Practical Implications

  • Automated debate coaching tools – Developers can build real‑time feedback systems that flag weak arguments or suggest evidence, leveraging the argument‑mining labels.
  • Speech‑enabled debate platforms – Voice‑first discussion apps (e.g., Clubhouse‑style rooms) can use the diarization and ASR pipelines to generate searchable transcripts and highlight argumentative structure on the fly.
  • Persuasion analytics for marketers & policymakers – Quality‑assessment scores enable sentiment‑aware ranking of spoken content, helping organizations surface the most compelling arguments from webinars or town‑hall meetings.
  • Educational tech – Language‑learning platforms can incorporate DEBISS to teach argumentative writing and speaking, providing learners with authentic examples and automated grading.
  • Research acceleration – The open‑source starter kit lowers the entry barrier for experiments in spoken argument mining, multimodal debate analysis, and cross‑modal transfer learning (e.g., using text‑only debate corpora to improve spoken models).

Limitations & Future Work

  • Monologue focus – While the dataset captures “individual” debates, it lacks true multi‑party, interactive exchanges, which are common in real‑world forums.
  • Audio quality variance – Recordings were made in relatively quiet environments; noisy, in‑the‑wild audio (e.g., street interviews) remains untested.
  • Cultural and language scope – All participants were native Portuguese speakers; extending DEBISS to other languages and cultural debate styles is needed for broader applicability.
  • Annotation granularity – Current argument labels follow a coarse Toulmin scheme; finer‑grained rhetorical moves (e.g., rhetorical questions, analogies) could enrich downstream models.

The authors plan to expand DEBISS with multi‑speaker debates, incorporate noisy field recordings, and open a crowdsourcing platform for continuous annotation growth.

Authors

  • Klaywert Danillo Ferreira de Souza
  • David Eduardo Pereira
  • Cláudio E. C. Campelo
  • Larissa Lucena Vasconcelos

Paper Information

  • arXiv ID: 2603.05459v1
  • Categories: cs.CL, cs.DB
  • Published: March 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »