[Paper] 'Chi nas dal soch el sent de legn' -- Auditing Text Corpora for Lombard

Published: (June 4, 2026 at 12:20 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.06349v1

Overview

Several of the world’s languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

Key Contributions

This paper presents research in the following areas:

  • cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Edoardo Signoroni
  • Pavel Rychlý

Paper Information

  • arXiv ID: 2606.06349v1
  • Categories: cs.CL
  • Published: June 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »