[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

Published: 3 days ago (May 8, 2026 at 01:44 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.08057v1

Overview

The paper presents CA‑SQL, a new Text‑to‑SQL inference pipeline that adapts its search effort to the estimated difficulty of each query. By dynamically allocating more compute to harder problems and using an evolutionary‑style prompt seeding strategy, the authors push the limits of what a modest LLM (GPT‑4o‑mini) can achieve on the notoriously tough “challenging” tier of the BIRD benchmark.

Key Contributions

Complexity‑aware compute budgeting – a lightweight difficulty estimator decides how many candidate SQL statements to generate for each natural‑language question.
Exploratory prompt seeding – a custom prompt that injects diverse “seed” queries, inspired by evolutionary search, encourages the base LLM to produce a broader solution space.
Novel voting selector – after generation, a simple yet effective voting scheme picks the most promising candidate based on execution feedback and soft similarity metrics.
State‑of‑the‑art results on BIRD – 51.72 % accuracy on the “challenging” development set using only GPT‑4o‑mini, surpassing larger‑model baselines.
Open‑source‑friendly design – the pipeline relies on standard LLM APIs and does not require fine‑tuning, making it easy to integrate into existing developer workflows.

Methodology

Difficulty Estimation – For each NL question, a fast heuristic (e.g., length, number of tables/joins mentioned, lexical complexity) predicts a difficulty score.
Compute Allocation – The score maps to a budget: easy queries get a single‑shot generation, while harder ones trigger multiple generations (e.g., 5‑10 candidates).
Prompt Seeding – Instead of a plain “translate this to SQL” prompt, the system prepends a small set of synthetically created seed queries that vary in structure (different join orders, sub‑queries, aliasing). This nudges the LLM to explore alternative formulations.
Candidate Generation – The LLM produces a batch of SQL statements per budget slot, each conditioned on a different seed.
Execution & Voting – Each candidate is run against the target database (or a sandbox) to collect execution results. A voting algorithm combines execution success, similarity to other candidates, and a soft F1‑style token overlap to rank the outputs. The top‑ranked SQL is returned as the final answer.

The whole pipeline is inference‑only; no gradient updates or model retraining are required.

Results & Findings

Metric (BIRD dev)	CA‑SQL (GPT‑4o‑mini)	Prior In‑Context Baselines
Challenging tier accuracy	51.72 %	~38 % (GPT‑4)
Overall execution accuracy	61.06 %	~55 %
Soft F1	68.77 %	~62 %

Key takeaways

Dynamic budgeting yields diminishing returns for easy queries but dramatically improves hard cases, confirming that “one‑size‑fits‑all” generation is suboptimal.
Prompt seeding adds ~6‑8 % absolute gain on the challenging tier, showing that even a frozen LLM benefits from richer context.
The voting selector outperforms naive “first‑candidate” or “majority‑vote” strategies, especially when execution feedback is noisy.

Practical Implications

Developer tooling – IDE plugins or low‑code platforms that translate user questions into SQL can embed CA‑SQL’s budgeting logic to allocate more compute only when needed, keeping latency low for routine queries.
Cost‑effective scaling – Organizations can achieve near‑state‑of‑the‑art performance with cheaper LLM endpoints (e.g., mini‑models) by spending extra API calls selectively on hard problems.
Robust data‑access layers – Applications that need to generate ad‑hoc analytics (e.g., BI dashboards) can use the voting selector to guard against malformed SQL that would otherwise cause runtime errors.
Educational tools – Automated tutoring systems can expose students to multiple plausible query formulations, fostering deeper understanding of relational algebra.

Limitations & Future Work

Heuristic difficulty estimator – The current estimator is hand‑crafted; a learned predictor could better capture nuanced complexities.
Execution sandbox requirement – Voting relies on running candidate queries, which may be infeasible in highly restricted environments or with privacy‑sensitive data.
Scalability to massive schemas – The approach has been validated on BIRD (moderate schema size); handling enterprise‑scale catalogs with hundreds of tables may need additional pruning strategies.
Future directions suggested by the authors include integrating reinforcement learning to adapt the budget online, exploring richer seed generation (e.g., using program synthesis), and extending the framework to other code‑generation tasks beyond SQL.

Authors

James Petullo
Nianwen Xue

Paper Information

arXiv ID: 2605.08057v1
Categories: cs.CL, cs.AI
Published: May 8, 2026
PDF: Download PDF

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims

[Paper] Tool Calling is Linearly Readable and Steerable in Language Models