[Paper] ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging

Published: (January 5, 2026 at 10:32 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.02209v1

Overview

The ARCADE paper introduces ARCADE (Arabic Radio Corpus for Audio Dialect Evaluation), the first large‑scale Arabic speech dataset that tags audio at the city level. By harvesting radio broadcasts from across the Arab world and annotating each 30‑second clip with fine‑grained dialect, emotion, and speech‑type metadata, the authors provide a powerful new resource for building and evaluating dialect‑aware speech technologies.

Key Contributions

  • City‑level dialect granularity: 3,790 unique audio segments labeled for 58 cities in 19 Arab countries.
  • Multi‑task annotation schema: Each clip includes dialect, emotion, speech type (e.g., news, talk‑show), and a validity flag for dialect identification.
  • Robust data pipeline: Automated streaming capture, quality filtering, and human verification by 1–3 native reviewers per clip.
  • Open‑source release: Full dataset (6,907 annotations) hosted on Hugging Face, ready for immediate use in research and product development.
  • Benchmark baseline: Baseline multi‑task learning models and evaluation metrics for city‑level dialect tagging are provided.

Methodology

  1. Data Collection – The team identified 1,200+ Arabic radio stations on public streaming platforms. A custom crawler continuously recorded 30‑second windows from each live stream, ensuring a mix of Modern Standard Arabic (MSA) and regional dialects.
  2. Quality Assurance – Audio segments were automatically screened for signal‑to‑noise ratio, clipping, and language detection. Low‑quality clips were discarded.
  3. Human Annotation – Native Arabic speakers (1–3 per clip) listened to each segment via a web interface and supplied:
    • Dialect label (city, country, and broader dialect family)
    • Emotion (neutral, happy, sad, angry, etc.)
    • Speech type (news, interview, music‑intro, etc.)
    • Validity flag (whether the dialect can be confidently identified)
  4. Dataset Curation – After annotation, the authors performed statistical checks (label balance, inter‑annotator agreement) and split the data into train/validation/test sets, preserving city distribution.
  5. Baseline Modeling – Using wav2vec‑2.0 embeddings, they trained a multi‑task classifier that jointly predicts dialect, emotion, and speech type, reporting city‑level accuracy and macro‑F1 scores.

Results & Findings

  • Dialect tagging: The baseline model achieved ≈68% top‑1 accuracy on the 58‑city classification task, a strong start given the fine granularity.
  • Multi‑task gains: Jointly learning emotion and speech type boosted dialect accuracy by ~4% compared to a single‑task model, indicating useful cross‑signal information.
  • Data quality: Inter‑annotator agreement (Cohen’s κ) for dialect labels was 0.78, confirming that native speakers can reliably distinguish city‑level speech cues.
  • Label distribution: Some megacities (e.g., Cairo, Riyadh) dominate the dataset, but the authors applied stratified sampling to keep smaller‑city representation sufficient for training.

Practical Implications

  • Improved ASR & TTS: Speech recognition and synthesis systems can be fine‑tuned to city‑specific pronunciations, reducing error rates for localized applications (e.g., voice assistants in Saudi Arabia vs. Egypt).
  • Dialect‑aware NLP: Sentiment analysis, intent detection, and chatbot responses can be adapted to regional lexical choices, enhancing user experience.
  • Content personalization: Media platforms can automatically route news or advertisements to listeners whose dialect matches the content, increasing relevance.
  • Sociolinguistic analytics: Companies can monitor dialect trends in real‑time (e.g., emerging slang) by feeding live radio streams into models trained on ARCADE.
  • Low‑resource language tech: The open dataset lowers the barrier for startups and research labs to prototype dialect‑specific models without costly data collection.

Limitations & Future Work

  • Geographic bias: Larger urban centers are over‑represented; rural dialects may still be under‑captured.
  • Single‑modality: Only audio is provided; pairing with transcripts would enable end‑to‑end speech‑to‑text research.
  • Static snapshot: Radio content evolves; periodic updates are needed to keep the corpus current.
  • Annotation depth: While emotion and speech type are included, finer sociolinguistic tags (e.g., speaker age, gender) are absent.

Future work could expand coverage to community radio, add text transcriptions, and explore continual data pipelines that auto‑ingest new broadcasts while preserving annotation quality.

Authors

  • Omer Nacar
  • Serry Sibaee
  • Adel Ammar
  • Yasser Alhabashi
  • Nadia Samer Sibai
  • Yara Farouk Ahmed
  • Ahmed Saud Alqusaiyer
  • Sulieman Mahmoud AlMahmoud
  • Abdulrhman Mamdoh Mukhaniq
  • Lubaba Raed
  • Sulaiman Mohammed Alatwah
  • Waad Nasser Alqahtani
  • Yousif Abdulmajeed Alnasser
  • Mohamed Aziz Khadraoui
  • Wadii Boulila

Paper Information

  • arXiv ID: 2601.02209v1
  • Categories: cs.CL, cs.CY, cs.SD
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »