[Paper] 'Chi nas dal soch el sent de legn' -- Auditing Text Corpora for Lombard

Published: 6 days ago (June 4, 2026 at 12:20 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.06349v1

Overview

Several of the world’s languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Edoardo Signoroni
Pavel Rychlý

Paper Information

arXiv ID: 2606.06349v1
Categories: cs.CL
Published: June 4, 2026
PDF: Download PDF

[Paper] 'Chi nas dal soch el sent de legn' -- Auditing Text Corpora for Lombard

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] Agentopia: Long-Term Life Simulation and Learning in Agent Societies

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings