[Paper] Anka: A Domain-Specific Language for Reliable LLM Code Generation

Published: (December 29, 2025 at 12:28 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23214v1

Overview

The paper introduces Anka, a tiny domain‑specific language (DSL) built for data‑transformation pipelines, and shows that large language models (LLMs) can generate correct Anka code almost flawlessly—even when the model has never seen the language before. By constraining the syntax and making state changes explicit, Anka cuts the error rate of LLM‑driven code generation on multi‑step tasks by roughly 40 percentage points compared with Python.

Key Contributions

  • A purpose‑made DSL for data‑pipeline tasks that eliminates common sources of ambiguity (e.g., implicit variable lifetimes, flexible ordering).
  • Zero‑shot learning demonstration: Claude 3.5 Haiku achieves 99.9 % parse success and 95.8 % task accuracy without any fine‑tuning on Anka.
  • Empirical benchmark of 100 multi‑step problems showing a 100 % vs. 60 % accuracy gap between Anka and Python for the same LLM.
  • Cross‑model validation with GPT‑4o‑mini, confirming a 26.7‑point boost on multi‑step tasks when using Anka.
  • Open‑source release of the language implementation, benchmark suite, and evaluation harness to spur further research.

Methodology

  1. Designing the DSL – The authors distilled a typical ETL (extract‑transform‑load) workflow into a handful of primitives (read, filter, map, join, write, etc.) with a strict, line‑oriented syntax. Each statement’s inputs and outputs are declared explicitly, so the model never has to guess which variable a later operation should consume.
  2. Prompt engineering – For each benchmark problem, a short in‑context prompt supplies a few hand‑crafted Anka examples (the “training” data for the LLM) and a natural‑language description of the desired pipeline.
  3. Zero‑shot generation – The LLM is asked to emit Anka code directly; no fine‑tuning or external tooling is involved.
  4. Evaluation pipeline – Generated code is first parsed by the Anka interpreter (to catch syntax errors) and then executed against hidden test cases. Success is measured as parse success (syntactically valid) and task accuracy (produces the correct output).
  5. Comparative baseline – The same prompts are used to ask the LLM to produce Python code for the identical tasks, allowing a head‑to‑head error analysis.

Results & Findings

MetricAnka (Claude 3.5 Haiku)Python (Claude 3.5 Haiku)Python (GPT‑4o‑mini)
Parse success99.9 %92.3 %94.1 %
Overall task accuracy95.8 %60.0 %68.5 %
Multi‑step pipeline accuracy100 %60 %73.3 %
Accuracy gain vs. Python+35.8 pp+27.3 pp

Key takeaways

  • The constrained syntax virtually eliminates syntactic mistakes, letting the LLM focus on the logical ordering of operations.
  • Even though the models have been trained on massive Python corpora, they struggle with implicit state handling, leading to frequent “variable not defined” or “wrong order” bugs.
  • The DSL’s explicit data flow makes the generation problem easier to solve with in‑context learning alone.

Practical Implications

  • Rapid prototyping of data pipelines – Teams can embed Anka (or a similar DSL) into internal tooling, letting developers describe transformations in plain English and receive ready‑to‑run pipeline code.
  • Reduced debugging overhead – Because the generated code parses cleanly and follows a deterministic execution order, the time spent hunting down LLM‑induced bugs drops dramatically.
  • Safer LLM integration – In regulated environments (finance, healthcare) where code correctness is non‑negotiable, a DSL acts as a guardrail, constraining the model to a vetted subset of operations.
  • Model‑agnostic benefits – The cross‑model validation suggests that any sufficiently capable LLM can leverage a well‑designed DSL, making the approach portable across vendor APIs.
  • Extensible to other domains – The same design principles (explicit state, limited primitives) could be applied to networking configs, cloud‑infra manifests, or UI layout specifications, turning LLMs into reliable code assistants for niche stacks.

Limitations & Future Work

  • Domain narrowness – Anka targets data‑transformation pipelines; its advantages may not transfer directly to more general programming tasks (e.g., algorithmic problem solving).
  • Learning curve for developers – Teams need to adopt a new syntax and tooling, which could be a barrier without strong incentives.
  • Scalability of the benchmark – The study uses 100 curated problems; larger, more diverse real‑world workloads could reveal edge cases.
  • Prompt dependence – Performance hinges on the quality of the few‑shot examples; automated prompt generation or curriculum learning remains unexplored.
  • Future directions – The authors propose expanding the DSL to cover streaming data, adding type inference, and investigating fine‑tuning LLMs on DSL corpora to push accuracy even higher.

Authors

  • Saif Khalfan Saif Al Mazrouei

Paper Information

  • arXiv ID: 2512.23214v1
  • Categories: cs.CL, cs.LG, cs.PL, cs.SE
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »