From 40% to 100% SQL Generation Accuracy: Why Local AI Needs Self-Correction, Not Perfect Prompts

Published: 2 months ago (December 2, 2025 at 05:14 PM EST)

4 min read

Source: Dev.to

Overview

Self‑correction loops beat “perfect‑first‑time” approaches for local AI. Building a Retail Analytics Copilot that runs entirely on a laptop (using a quantized 24B model) sounds great for privacy, but it’s a nightmare for reliability. Unlike hosted GPTs, which follow instructions like a senior engineer, local models behave more like enthusiastic interns: they try hard, but they hallucinate syntax, forget schema details, and love to chat when they should be coding.

Baseline

My initial baseline was dismal: only 40 % of generated SQL queries actually executed. The rest were plagued by syntax errors, hallucinated columns, or conversational fluff (“Here is your query …”).

The Shift: Expect Errors, Then Repair

Instead of trying to prevent errors, I built a system that expects them and repairs them automatically.

Repair Loop Workflow

Generate – The model writes a query.
Execute – Run it against SQLite.
Catch – If it fails, capture the error (e.g., no such column: 'Price').
Feedback – Feed the exact error message back to the model: “The previous query failed with error X. Fix it.”

This pattern generalises beyond SQL. Whenever you work with probabilistic systems, external APIs that can fail, or ambiguous user input, design for graceful degradation.

Training Data from Failures

Each (failed_query, error_message, corrected_query) triple becomes a potential few‑shot example for future optimisation. You’re not just fixing bugs; you’re building a self‑improving system.

Repair Loop Implementation (Python)

def sql_execution_node(state: AgentState) -> AgentState:
    """Execute SQL and handle errors gracefully."""
    query = state["sql_query"]

    try:
        cursor.execute(query)
        state["sql_results"] = cursor.fetchall()
        state["errors"] = []
    except sqlite3.OperationalError as e:
        # Don't crash—capture and route to repair
        state["sql_results"] = []
        state["errors"].append(str(e))
        state["feedback"] = f"SQL execution failed: {e}. Fix the query."
        state["repair_count"] = state.get("repair_count", 0) + 1

    return state

def should_repair(state: AgentState) -> str:
    """Conditional edge: repair or continue?"""
    if state["errors"] and state["repair_count"]  str:
    match = re.search(r"(SELECT\s+.*)", text, re.IGNORECASE)
    return match.group(1).strip() if match else ""

Even better: enforce structured output (e.g., JSON) so you avoid free‑form parsing altogether.

Moving from Prompt Engineering to DSPy

I stopped hand‑writing prompts and switched to DSPy, which treats prompt optimisation as a learnable problem.

Think of traditional prompt engineering like manually tuning hyper‑parameters; DSPy is like back‑propagation, automatically searching for optimal prompts using gradient‑free optimisation over a metric you define.

Defining a Metric

“A query is good if it executes and returns non‑empty results.”
Using the BootstrapFewShot optimiser, DSPy generated multiple candidate SQL queries, ran them, and kept only the ones that passed the metric as few‑shot examples.

Metric Improvements

Metric	Baseline	After Optimisation	After Repair Loop
Valid SQL (%)	40 %	85 %	100 %
Correct Format (%)	30 %	60 %	95 %
End‑to‑End Success (%)	12 %	51 %	66 %*

*The gap between 100 % valid SQL and 66 % end‑to‑end success shows that optimising one component creates new bottlenecks elsewhere (orchestration logic, not model quality).

The Repair Pattern in Practice

Add a feedback field to your LLM state. On failure, inject the error message and retry. This costs roughly one extra LLM call but can improve reliability by an order of magnitude.

The “ELECT” Test

assert clean_sql_output("SQL: SELECT * FROM orders") == "SELECT * FROM orders"
assert clean_sql_output("SELECT * FROM orders") == "SELECT * FROM orders"

If these assertions fail, you’re misusing string methods.

DSPy Starter Template

Define your task as (inputs, output, metric).
Let the optimiser discover effective few‑shot examples instead of writing them manually.

Why This Matters (Beyond This Project)

The AI landscape is bifurcating:

Approach	Characteristics
Cloud‑first (GPT‑5.1, Claude, …)	Powerful but expensive; privacy‑riskier.
Edge‑first (local models)	Cheaper, private, but harder to wrangle.

Companies that master local AI will dominate regulated industries (healthcare, finance, government) where cloud LLMs are non‑starters. The bottleneck isn’t model weights; it’s reliability engineering.

If you can make a 7B model behave like a 70B model through clever orchestration, you’re solving a $100 B problem. This is not just a technical exercise; it’s a strategic moat.

Skills That Matter

Systems thinking – Understanding failure modes and designing around them.
Optimization – Treating prompts as learnable parameters, not art.
Defensive engineering – Building for a probabilistic world.

Master these, and you’ll compete not with prompt engineers but with infrastructure engineers at AI‑native companies.

Project repository: