[Paper] Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Source: arXiv - 2602.14819v1
Overview
A new open‑source dataset called Testimole‑Conversational brings together more than 30 billion Italian word‑tokens harvested from public discussion boards spanning 1996‑2024. By providing a chronologically deep, informal‑language snapshot of Italian online communication, the corpus is positioned as a cornerstone for training native‑Italian large language models (LLMs) and for sociolinguistic studies of digital discourse.
Key Contributions
- Scale: Over 30 B word‑tokens, making it one of the largest monolingual Italian corpora ever released.
- Temporal breadth: Covers 28 years of discussion‑board activity, enabling diachronic analyses of language change.
- Domain richness: Captures a wide variety of informal registers, slang, emojis, code‑switching, and forum‑specific conventions.
- Open access: The authors will distribute the cleaned, tokenized dataset under a permissive license for research and commercial use.
- Dual utility: Serves both NLP practitioners (pre‑training, domain adaptation, conversational AI) and sociolinguists (studying language variation, online social behavior).
Methodology
- Data collection – Publicly available Italian discussion boards were scraped using respectful crawling policies (robots.txt compliance, rate limiting).
- Cleaning pipeline – Duplicate posts, signatures, and boilerplate navigation text were removed. Non‑Italian content and spam were filtered out with language‑identification heuristics and a lightweight classifier.
- Tokenization & metadata – Text was tokenized with the Italian spaCy tokenizer; each message was annotated with timestamp, forum category, and thread ID to preserve conversational context.
- Quality checks – Random samples were manually inspected for noise, and basic statistics (vocabulary size, token‑type ratio) were computed to verify corpus health.
The pipeline is deliberately simple so that other researchers can reproduce or extend it for additional forums or languages.
Results & Findings
- Vocabulary richness: Over 2 M unique lemmas, with a long tail of region‑specific slang and neologisms that appear only in recent years.
- Temporal drift: Frequency analysis shows a clear rise in English loanwords, emojis, and internet memes after 2010, reflecting broader cultural shifts.
- Conversational dynamics: Thread‑level metadata enables the extraction of turn‑taking patterns, reply latency, and user interaction graphs—valuable for dialogue system training.
- Baseline language models: Fine‑tuning a 1.3 B‑parameter Italian transformer on the corpus yields a +12 % perplexity reduction on downstream Italian QA and chat benchmarks compared to models pre‑trained on generic web crawls.
Practical Implications
- Better Italian LLMs: Pre‑training on Testimole‑Conversational can close the performance gap between English‑centric LLMs and native Italian models, improving code generation, summarization, and virtual assistant quality for Italian users.
- Domain‑adapted chatbots: Companies building customer‑support bots can fine‑tune on this data to capture the informal tone and idiomatic expressions typical of Italian online users.
- Content moderation tools: The corpus provides a realistic testbed for training classifiers that detect hate speech, harassment, or misinformation in Italian forums.
- Sociolinguistic dashboards: Researchers and marketers can track the emergence of new slang, sentiment trends, or regional language usage over nearly three decades, informing product localization and cultural analysis.
Limitations & Future Work
- Platform bias: The dataset is limited to the specific forums that were publicly accessible; niche communities (e.g., gaming, LGBTQ+, regional dialect forums) may be under‑represented.
- Noise residuals: Despite cleaning, some spam, bot‑generated posts, and non‑Italian fragments remain, requiring downstream filtering for sensitive applications.
- Ethical considerations: While the data is public, user anonymity cannot be guaranteed; future releases should explore differential privacy techniques or consent‑aware sampling.
- Extension roadmap: The authors plan to augment the corpus with multimodal signals (images, emojis as separate tokens) and to release a version with speaker‑level anonymized IDs for richer conversational modeling.
Testimole‑Conversational opens the door to a new generation of Italian‑centric AI tools while also offering a living laboratory for scholars interested in how language evolves in the digital public sphere.
Authors
- Matteo Rinaldi
- Rossella Varvara
- Viviana Patti
Paper Information
- arXiv ID: 2602.14819v1
- Categories: cs.CL
- Published: February 16, 2026
- PDF: Download PDF