[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
Source: arXiv - 2605.08057v1
Overview
The paper presents CA‑SQL, a new Text‑to‑SQL inference pipeline that adapts its search effort to the estimated difficulty of each query. By dynamically allocating more compute to harder problems and using an evolutionary‑style prompt seeding strategy, the authors push the limits of what a modest LLM (GPT‑4o‑mini) can achieve on the notoriously tough “challenging” tier of the BIRD benchmark.
Key Contributions
- Complexity‑aware compute budgeting – a lightweight difficulty estimator decides how many candidate SQL statements to generate for each natural‑language question.
- Exploratory prompt seeding – a custom prompt that injects diverse “seed” queries, inspired by evolutionary search, encourages the base LLM to produce a broader solution space.
- Novel voting selector – after generation, a simple yet effective voting scheme picks the most promising candidate based on execution feedback and soft similarity metrics.
- State‑of‑the‑art results on BIRD – 51.72 % accuracy on the “challenging” development set using only GPT‑4o‑mini, surpassing larger‑model baselines.
- Open‑source‑friendly design – the pipeline relies on standard LLM APIs and does not require fine‑tuning, making it easy to integrate into existing developer workflows.
Methodology
- Difficulty Estimation – For each NL question, a fast heuristic (e.g., length, number of tables/joins mentioned, lexical complexity) predicts a difficulty score.
- Compute Allocation – The score maps to a budget: easy queries get a single‑shot generation, while harder ones trigger multiple generations (e.g., 5‑10 candidates).
- Prompt Seeding – Instead of a plain “translate this to SQL” prompt, the system prepends a small set of synthetically created seed queries that vary in structure (different join orders, sub‑queries, aliasing). This nudges the LLM to explore alternative formulations.
- Candidate Generation – The LLM produces a batch of SQL statements per budget slot, each conditioned on a different seed.
- Execution & Voting – Each candidate is run against the target database (or a sandbox) to collect execution results. A voting algorithm combines execution success, similarity to other candidates, and a soft F1‑style token overlap to rank the outputs. The top‑ranked SQL is returned as the final answer.
The whole pipeline is inference‑only; no gradient updates or model retraining are required.
Results & Findings
| Metric (BIRD dev) | CA‑SQL (GPT‑4o‑mini) | Prior In‑Context Baselines |
|---|---|---|
| Challenging tier accuracy | 51.72 % | ~38 % (GPT‑4) |
| Overall execution accuracy | 61.06 % | ~55 % |
| Soft F1 | 68.77 % | ~62 % |
Key takeaways
- Dynamic budgeting yields diminishing returns for easy queries but dramatically improves hard cases, confirming that “one‑size‑fits‑all” generation is suboptimal.
- Prompt seeding adds ~6‑8 % absolute gain on the challenging tier, showing that even a frozen LLM benefits from richer context.
- The voting selector outperforms naive “first‑candidate” or “majority‑vote” strategies, especially when execution feedback is noisy.
Practical Implications
- Developer tooling – IDE plugins or low‑code platforms that translate user questions into SQL can embed CA‑SQL’s budgeting logic to allocate more compute only when needed, keeping latency low for routine queries.
- Cost‑effective scaling – Organizations can achieve near‑state‑of‑the‑art performance with cheaper LLM endpoints (e.g., mini‑models) by spending extra API calls selectively on hard problems.
- Robust data‑access layers – Applications that need to generate ad‑hoc analytics (e.g., BI dashboards) can use the voting selector to guard against malformed SQL that would otherwise cause runtime errors.
- Educational tools – Automated tutoring systems can expose students to multiple plausible query formulations, fostering deeper understanding of relational algebra.
Limitations & Future Work
- Heuristic difficulty estimator – The current estimator is hand‑crafted; a learned predictor could better capture nuanced complexities.
- Execution sandbox requirement – Voting relies on running candidate queries, which may be infeasible in highly restricted environments or with privacy‑sensitive data.
- Scalability to massive schemas – The approach has been validated on BIRD (moderate schema size); handling enterprise‑scale catalogs with hundreds of tables may need additional pruning strategies.
- Future directions suggested by the authors include integrating reinforcement learning to adapt the budget online, exploring richer seed generation (e.g., using program synthesis), and extending the framework to other code‑generation tasks beyond SQL.
Authors
- James Petullo
- Nianwen Xue
Paper Information
- arXiv ID: 2605.08057v1
- Categories: cs.CL, cs.AI
- Published: May 8, 2026
- PDF: Download PDF