[Paper] SteuerLLM: Local specialized large language model for German tax law analysis

Published: 2 months ago (February 11, 2026 at 12:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.11081v1

Overview

The paper introduces SteuerLLM, a 28‑billion‑parameter large language model that has been fine‑tuned specifically for German tax law. To evaluate it, the authors also release SteuerEx, the first open benchmark built from real German university tax‑law exams, complete with a partial‑credit scoring scheme that mirrors how students are actually graded. The work shows that a domain‑adapted LLM can beat much larger, general‑purpose models on legally‑rigorous tasks, highlighting the importance of specialized data over sheer model size.

Key Contributions

SteuerEx benchmark – 115 expert‑validated exam questions covering six core tax‑law topics, with a statement‑level, partial‑credit evaluation that reflects real‑world grading.
Synthetic training pipeline – A controlled retrieval‑augmented generation process that turns authentic exam material into a large, high‑quality synthetic dataset while preserving legal terminology and citation style.
SteuerLLM model – A 28 B parameter LLM fine‑tuned on the synthetic tax‑law corpus; it consistently outperforms comparable‑size instruction‑tuned models and even larger general models on the SteuerEx benchmark.
Open science release – All benchmark data, training corpora, model weights, and evaluation scripts are publicly available, plus a live web demo for interactive testing.

Methodology

Benchmark construction
- Collected past German university tax‑law exam papers.
- Selected 115 questions spanning income tax, corporate tax, VAT, inheritance tax, trade tax, and international tax.
- Each question was broken down into individual statements; experts assigned partial‑credit scores (0–1) to reflect the nuanced grading used in academia.
Synthetic data generation
- Used a retrieval‑augmented pipeline: a base LLM first retrieved relevant statutes and prior exam solutions, then generated new question‑answer pairs that mimic the style and citation rigor of the original exams.
- Applied strict post‑processing filters (e.g., correct citation format, numeric consistency) to ensure legal fidelity.
Model fine‑tuning
- Started from a strong German‑language LLM (28 B parameters).
- Trained on the synthetic tax‑law corpus with instruction‑following objectives (answer generation, citation extraction, numerical reasoning).
- Employed LoRA adapters to keep compute requirements manageable while allowing rapid experimentation.
Evaluation
- Ran all models on SteuerEx, scoring each statement with the partial‑credit metric.
- Compared against several baselines: a generic instruction‑tuned 28 B model, a 70 B general LLM, and a smaller domain‑specific model.

Results & Findings

Model	Avg. Partial‑Credit Score (0‑1)	Relative Gain vs. Generic 28 B
Generic 28 B (instruction‑tuned)	0.48	–
70 B General LLM	0.51	+6 %
SteuerLLM (28 B)	0.66	+38 %
Small domain‑specific (7 B)	0.58	+21 %

SteuerLLM beats the larger 70 B model despite having fewer parameters, confirming that domain‑specific data matters more than raw scale for legal reasoning.
The model shows marked improvements in statutory citation accuracy (↑ 45 % correct citations) and numerical precision (error rate ↓ 30 %).
Human evaluators noted that SteuerLLM’s explanations follow the structured argumentation style required in tax‑law reasoning, something generic models often miss.

Practical Implications

Legal tech startups can embed SteuerLLM (or a similar domain‑adapted model) into tax‑advisory chatbots, reducing the need for costly human review of routine queries.
Enterprise tax departments may automate the first draft of tax filings, statutory citations, or internal compliance memos, freeing accountants to focus on high‑value analysis.
The retrieval‑augmented synthetic data pipeline offers a reproducible recipe for other regulated domains (e.g., GDPR, financial reporting) where annotated data are scarce.
Because the model is released under an open license, developers can fine‑tune it further for company‑specific statutes, regional variations, or integration with existing document‑management systems.

Limitations & Future Work

Synthetic bias: Although the generation pipeline enforces legal formality, it may still propagate subtle biases from the base LLM, leading to occasional mis‑interpretations of ambiguous statutes.
Scope: SteuerEx covers university‑level exams; real‑world tax consulting often involves more complex, multi‑jurisdictional scenarios that were not tested.
Explainability: The model can produce plausible citations, but it does not provide a transparent reasoning trace that auditors could audit.
Future directions suggested by the authors include: expanding the benchmark to cover corporate‑level tax filings, integrating external legal databases for real‑time retrieval, and exploring chain‑of‑thought prompting to improve interpretability.

Authors

Sebastian Wind
Jeta Sopa
Laurin Schmid
Quirin Jackl
Sebastian Kiefer
Fei Wu
Martin Mayr
Harald Köstler
Gerhard Wellein
Andreas Maier
Soroosh Tayebi Arasteh

Paper Information

arXiv ID: 2602.11081v1
Categories: cs.CL, cs.AI, cs.LG
Published: February 11, 2026
PDF: Download PDF

[Paper] SteuerLLM: Local specialized large language model for German tax law analysis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[Paper] SCOPE: Selective Conformal Optimized Pairwise LLM Judging