[Paper] SteuerLLM: Local specialized large language model for German tax law analysis
Source: arXiv - 2602.11081v1
Overview
The paper introduces SteuerLLM, a 28‑billion‑parameter large language model that has been fine‑tuned specifically for German tax law. To evaluate it, the authors also release SteuerEx, the first open benchmark built from real German university tax‑law exams, complete with a partial‑credit scoring scheme that mirrors how students are actually graded. The work shows that a domain‑adapted LLM can beat much larger, general‑purpose models on legally‑rigorous tasks, highlighting the importance of specialized data over sheer model size.
Key Contributions
- SteuerEx benchmark – 115 expert‑validated exam questions covering six core tax‑law topics, with a statement‑level, partial‑credit evaluation that reflects real‑world grading.
- Synthetic training pipeline – A controlled retrieval‑augmented generation process that turns authentic exam material into a large, high‑quality synthetic dataset while preserving legal terminology and citation style.
- SteuerLLM model – A 28 B parameter LLM fine‑tuned on the synthetic tax‑law corpus; it consistently outperforms comparable‑size instruction‑tuned models and even larger general models on the SteuerEx benchmark.
- Open science release – All benchmark data, training corpora, model weights, and evaluation scripts are publicly available, plus a live web demo for interactive testing.
Methodology
-
Benchmark construction
- Collected past German university tax‑law exam papers.
- Selected 115 questions spanning income tax, corporate tax, VAT, inheritance tax, trade tax, and international tax.
- Each question was broken down into individual statements; experts assigned partial‑credit scores (0–1) to reflect the nuanced grading used in academia.
-
Synthetic data generation
- Used a retrieval‑augmented pipeline: a base LLM first retrieved relevant statutes and prior exam solutions, then generated new question‑answer pairs that mimic the style and citation rigor of the original exams.
- Applied strict post‑processing filters (e.g., correct citation format, numeric consistency) to ensure legal fidelity.
-
Model fine‑tuning
- Started from a strong German‑language LLM (28 B parameters).
- Trained on the synthetic tax‑law corpus with instruction‑following objectives (answer generation, citation extraction, numerical reasoning).
- Employed LoRA adapters to keep compute requirements manageable while allowing rapid experimentation.
-
Evaluation
- Ran all models on SteuerEx, scoring each statement with the partial‑credit metric.
- Compared against several baselines: a generic instruction‑tuned 28 B model, a 70 B general LLM, and a smaller domain‑specific model.
Results & Findings
| Model | Avg. Partial‑Credit Score (0‑1) | Relative Gain vs. Generic 28 B |
|---|---|---|
| Generic 28 B (instruction‑tuned) | 0.48 | – |
| 70 B General LLM | 0.51 | +6 % |
| SteuerLLM (28 B) | 0.66 | +38 % |
| Small domain‑specific (7 B) | 0.58 | +21 % |
- SteuerLLM beats the larger 70 B model despite having fewer parameters, confirming that domain‑specific data matters more than raw scale for legal reasoning.
- The model shows marked improvements in statutory citation accuracy (↑ 45 % correct citations) and numerical precision (error rate ↓ 30 %).
- Human evaluators noted that SteuerLLM’s explanations follow the structured argumentation style required in tax‑law reasoning, something generic models often miss.
Practical Implications
- Legal tech startups can embed SteuerLLM (or a similar domain‑adapted model) into tax‑advisory chatbots, reducing the need for costly human review of routine queries.
- Enterprise tax departments may automate the first draft of tax filings, statutory citations, or internal compliance memos, freeing accountants to focus on high‑value analysis.
- The retrieval‑augmented synthetic data pipeline offers a reproducible recipe for other regulated domains (e.g., GDPR, financial reporting) where annotated data are scarce.
- Because the model is released under an open license, developers can fine‑tune it further for company‑specific statutes, regional variations, or integration with existing document‑management systems.
Limitations & Future Work
- Synthetic bias: Although the generation pipeline enforces legal formality, it may still propagate subtle biases from the base LLM, leading to occasional mis‑interpretations of ambiguous statutes.
- Scope: SteuerEx covers university‑level exams; real‑world tax consulting often involves more complex, multi‑jurisdictional scenarios that were not tested.
- Explainability: The model can produce plausible citations, but it does not provide a transparent reasoning trace that auditors could audit.
- Future directions suggested by the authors include: expanding the benchmark to cover corporate‑level tax filings, integrating external legal databases for real‑time retrieval, and exploring chain‑of‑thought prompting to improve interpretability.
Authors
- Sebastian Wind
- Jeta Sopa
- Laurin Schmid
- Quirin Jackl
- Sebastian Kiefer
- Fei Wu
- Martin Mayr
- Harald Köstler
- Gerhard Wellein
- Andreas Maier
- Soroosh Tayebi Arasteh
Paper Information
- arXiv ID: 2602.11081v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: February 11, 2026
- PDF: Download PDF