[Paper] Anka: A Domain-Specific Language for Reliable LLM Code Generation

Published: 3 weeks ago (December 29, 2025 at 12:28 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23214v1

Overview

The paper introduces Anka, a tiny domain‑specific language (DSL) built for data‑transformation pipelines, and shows that large language models (LLMs) can generate correct Anka code almost flawlessly—even when the model has never seen the language before. By constraining the syntax and making state changes explicit, Anka cuts the error rate of LLM‑driven code generation on multi‑step tasks by roughly 40 percentage points compared with Python.

Key Contributions

A purpose‑made DSL for data‑pipeline tasks that eliminates common sources of ambiguity (e.g., implicit variable lifetimes, flexible ordering).
Zero‑shot learning demonstration: Claude 3.5 Haiku achieves 99.9 % parse success and 95.8 % task accuracy without any fine‑tuning on Anka.
Empirical benchmark of 100 multi‑step problems showing a 100 % vs. 60 % accuracy gap between Anka and Python for the same LLM.
Cross‑model validation with GPT‑4o‑mini, confirming a 26.7‑point boost on multi‑step tasks when using Anka.
Open‑source release of the language implementation, benchmark suite, and evaluation harness to spur further research.

Methodology

Designing the DSL – The authors distilled a typical ETL (extract‑transform‑load) workflow into a handful of primitives (read, filter, map, join, write, etc.) with a strict, line‑oriented syntax. Each statement’s inputs and outputs are declared explicitly, so the model never has to guess which variable a later operation should consume.
Prompt engineering – For each benchmark problem, a short in‑context prompt supplies a few hand‑crafted Anka examples (the “training” data for the LLM) and a natural‑language description of the desired pipeline.
Zero‑shot generation – The LLM is asked to emit Anka code directly; no fine‑tuning or external tooling is involved.
Evaluation pipeline – Generated code is first parsed by the Anka interpreter (to catch syntax errors) and then executed against hidden test cases. Success is measured as parse success (syntactically valid) and task accuracy (produces the correct output).
Comparative baseline – The same prompts are used to ask the LLM to produce Python code for the identical tasks, allowing a head‑to‑head error analysis.

Results & Findings

Metric	Anka (Claude 3.5 Haiku)	Python (Claude 3.5 Haiku)	Python (GPT‑4o‑mini)
Parse success	99.9 %	92.3 %	94.1 %
Overall task accuracy	95.8 %	60.0 %	68.5 %
Multi‑step pipeline accuracy	100 %	60 %	73.3 %
Accuracy gain vs. Python	+35.8 pp	—	+27.3 pp

Key takeaways

The constrained syntax virtually eliminates syntactic mistakes, letting the LLM focus on the logical ordering of operations.
Even though the models have been trained on massive Python corpora, they struggle with implicit state handling, leading to frequent “variable not defined” or “wrong order” bugs.
The DSL’s explicit data flow makes the generation problem easier to solve with in‑context learning alone.

Practical Implications

Rapid prototyping of data pipelines – Teams can embed Anka (or a similar DSL) into internal tooling, letting developers describe transformations in plain English and receive ready‑to‑run pipeline code.
Reduced debugging overhead – Because the generated code parses cleanly and follows a deterministic execution order, the time spent hunting down LLM‑induced bugs drops dramatically.
Safer LLM integration – In regulated environments (finance, healthcare) where code correctness is non‑negotiable, a DSL acts as a guardrail, constraining the model to a vetted subset of operations.
Model‑agnostic benefits – The cross‑model validation suggests that any sufficiently capable LLM can leverage a well‑designed DSL, making the approach portable across vendor APIs.
Extensible to other domains – The same design principles (explicit state, limited primitives) could be applied to networking configs, cloud‑infra manifests, or UI layout specifications, turning LLMs into reliable code assistants for niche stacks.

Limitations & Future Work

Domain narrowness – Anka targets data‑transformation pipelines; its advantages may not transfer directly to more general programming tasks (e.g., algorithmic problem solving).
Learning curve for developers – Teams need to adopt a new syntax and tooling, which could be a barrier without strong incentives.
Scalability of the benchmark – The study uses 100 curated problems; larger, more diverse real‑world workloads could reveal edge cases.
Prompt dependence – Performance hinges on the quality of the few‑shot examples; automated prompt generation or curriculum learning remains unexplored.
Future directions – The authors propose expanding the DSL to cover streaming data, adding type inference, and investigating fine‑tuning LLMs on DSL corpora to push accuracy even higher.

Authors

Saif Khalfan Saif Al Mazrouei

Paper Information

arXiv ID: 2512.23214v1
Categories: cs.CL, cs.LG, cs.PL, cs.SE
Published: December 29, 2025
PDF: Download PDF

[Paper] Anka: A Domain-Specific Language for Reliable LLM Code Generation

Overview

Key Contributions

Methodology

Results & Findings

Key takeaways

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models