[Paper] Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards
Source: arXiv - 2601.08778v1
Overview
The paper Pervasive Annotation Errors Break Text‑to‑SQL Benchmarks and Leaderboards examines a hidden problem that could be skewing the entire research landscape for text‑to‑SQL systems: massive annotation mistakes in the most widely used benchmark datasets. By quantifying the error rates and showing how they alter model scores and rankings, the authors reveal that many “state‑of‑the‑art” claims may be based on faulty ground truth.
Key Contributions
- Error‑rate audit of two flagship text‑to‑SQL benchmarks (BIRD and Spider 2.0‑Snow), uncovering >50 % erroneous entries in sampled subsets.
- Manual correction of a representative slice of the BIRD development set (BIRD Mini‑Dev) to create a clean evaluation benchmark.
- Re‑evaluation of 16 open‑source text‑to‑SQL agents on both the original and corrected subsets, demonstrating performance swings of ‑7 % to +31 % (relative) and rank shifts of up to 9 positions.
- Correlation analysis showing that rankings on the noisy subset still predict performance on the full (uncorrected) dev set (Spearman ρ = 0.85) but fail to predict performance on the clean subset (ρ = 0.32).
- Release of the corrected data and evaluation scripts to the community (GitHub link).
Methodology
- Sampling & Expert Review – Randomly sampled 200 examples from each benchmark’s development split. Two domain experts independently inspected the natural‑language question, the associated SQL query, and the underlying database schema to flag mismatches, ambiguous phrasing, or outright errors. Disagreements were resolved by a third reviewer.
- Error Rate Computation – An entry was counted as erroneous if any of the following held: (a) the SQL did not correctly answer the question, (b) the question was ambiguous given the schema, or (c) the annotation violated SQL syntax/semantics.
- Creation of BIRD Mini‑Dev – All flagged errors in the sampled BIRD subset were corrected, producing a high‑quality “gold‑standard” dev set.
- Model Re‑evaluation – The 16 publicly available text‑to‑SQL systems listed on the BIRD leaderboard were run on both the original and corrected subsets using the authors’ evaluation script (exact‑match accuracy).
- Statistical Analysis – Relative performance changes were computed, and Spearman rank correlation was used to compare leaderboard orderings on (i) the original noisy subset, (ii) the corrected subset, and (iii) the full BIRD dev set.
Results & Findings
| Benchmark | Sampled Size | Annotated Error Rate |
|---|---|---|
| BIRD Mini‑Dev | 200 | 52.8 % |
| Spider 2.0‑Snow | 200 | 62.8 % |
- Performance volatility: After correcting the BIRD Mini‑Dev, some models improved by up to 31 % relative accuracy, while others dropped by 7 %.
- Leaderboard reshuffling: Rank positions moved as much as ±9 spots; the model that was #1 on the noisy set fell to #10 on the clean set, and vice‑versa.
- Correlation insight: Rankings on the noisy subset still predict the full (uncorrected) dev set (ρ = 0.85, p = 3.26e‑5), indicating that the leaderboard is essentially measuring “how well you cope with bad data.” In contrast, rankings on the clean subset have a weak, non‑significant correlation (ρ = 0.32, p = 0.23).
- Implication: Current leaderboards may be rewarding robustness to annotation noise rather than genuine SQL generation ability.
Practical Implications
- Model selection: Companies evaluating off‑the‑shelf text‑to‑SQL tools should not rely solely on benchmark scores; a sanity‑check on a clean, domain‑specific validation set is essential.
- Dataset hygiene: Teams building internal QA pipelines or custom benchmarks must invest in rigorous annotation verification to avoid misleading performance reports.
- Tooling upgrades: The released corrected BIRD Mini‑Dev can serve as a quick sanity test for new architectures, helping developers spot over‑fitting to noisy patterns.
- Research direction: Efforts that focus on “noise‑robust” training tricks may be over‑valued if the underlying benchmark is itself noisy; shifting focus toward better schema‑question alignment and error‑aware training could yield more real‑world gains.
- Deployment risk mitigation: Since annotation errors can inflate or deflate perceived accuracy, production systems should incorporate runtime validation (e.g., execution‑based checks) rather than trusting model‑generated SQL at face value.
Limitations & Future Work
- Sample size: The error audit covers only a fraction (≈200 examples) of each benchmark; while the rates are alarming, the exact global error proportion may differ.
- Human‑in‑the‑loop bias: Expert judgments, though systematic, are still subjective; a larger pool of annotators could provide more robust error classifications.
- Scope to other benchmarks: The study focuses on BIRD and Spider 2.0‑Snow; extending the analysis to other text‑to‑SQL datasets (e.g., WikiSQL, CoSQL) would validate whether the problem is systemic.
- Automated detection: Future work could explore machine‑learning‑based tools to flag likely annotation errors at scale, reducing the manual effort required for dataset cleaning.
The authors have open‑sourced their corrected subset and evaluation scripts, inviting the community to build cleaner, more trustworthy text‑to‑SQL benchmarks.
Authors
- Tengjun Jin
- Yoojin Choi
- Yuxuan Zhu
- Daniel Kang
Paper Information
- arXiv ID: 2601.08778v1
- Categories: cs.AI, cs.DB
- Published: January 13, 2026
- PDF: Download PDF