[Paper] LLMs in Interpreting Legal Documents
Source: arXiv - 2512.09830v1
Overview
Simone Corbo’s recent chapter investigates how Large Language Models (LLMs) can be harnessed to interpret legal documents—statutes, contracts, and case law. By mapping out concrete use‑cases and benchmarking performance, the work shows both the promise and the pitfalls of plugging generative AI into the legal workflow.
Key Contributions
- Use‑case taxonomy for LLM‑driven legal tasks (e.g., statutory interpretation, contract summarisation, negotiation assistance, and legal information retrieval).
- Risk analysis covering algorithmic monoculture, hallucinations, and regulatory compliance (EU AI Act, U.S. AI initiatives, emerging Chinese guidelines).
- Two novel benchmarks tailored to the legal domain, measuring factual accuracy, interpretability, and compliance of LLM outputs.
- Guidelines for responsible deployment, linking technical safeguards to specific legal requirements across jurisdictions.
Methodology
Corbo adopts a mixed‑methods approach that is easy to follow for developers:
- Task definition – Real‑world legal activities were broken down into discrete NLP sub‑tasks (e.g., clause extraction, legal reasoning, summarisation).
- Model selection – Off‑the‑shelf LLMs (GPT‑4, Claude, LLaMA‑2) were fine‑tuned on publicly available legal corpora and evaluated both zero‑shot and few‑shot.
- Benchmark construction – Two datasets were curated:
- Statute‑QA: 1,200 multiple‑choice questions derived from EU and U.S. statutes.
- Contract‑Interpret: 500 contract excerpts with expert‑annotated interpretations.
- Evaluation metrics – Accuracy, factual consistency (hallucination rate), and a compliance score (how often the model’s answer aligns with regulatory constraints).
- Risk assessment – Simulated deployment scenarios to surface failure modes such as “algorithmic monoculture” (over‑reliance on a single model) and privacy leakage.
Results & Findings
| Task | Best Model (Fine‑tuned) | Accuracy | Hallucination Rate | Compliance Score |
|---|---|---|---|---|
| Statute‑QA | GPT‑4‑FT | 78% | 4% | 92% |
| Contract‑Interpret | LLaMA‑2‑FT | 71% | 6% | 88% |
- Accuracy: Fine‑tuned LLMs outperform zero‑shot baselines by 12–18 percentage points.
- Hallucinations: Even top models still generate incorrect legal citations in ~5 % of responses, a non‑trivial risk for downstream decisions.
- Compliance: Most outputs respect the “no‑advice‑beyond‑scope” rule, but edge cases (e.g., ambiguous statutory language) trigger compliance violations.
The benchmarks reveal that while LLMs can reliably extract and paraphrase legal text, deeper reasoning—especially where statutory interpretation hinges on nuanced policy intent—still lags behind human experts.
Practical Implications
- Legal tech vendors can integrate fine‑tuned LLMs for first‑draft contract review, cutting manual review time by up to 30 % (according to internal pilot studies).
- In‑house counsel may use LLM‑powered Q&A assistants to quickly surface relevant statutes, but must implement a “human‑in‑the‑loop” verification step to catch hallucinations.
- Compliance teams gain a concrete compliance‑score metric to monitor AI outputs against EU AI Act requirements, facilitating audit trails.
- Open‑source communities have a clear benchmark to benchmark new legal‑specific LLMs, accelerating innovation beyond the dominant commercial models.
Limitations & Future Work
- Dataset scope: Benchmarks focus on EU and U.S. law; non‑common‑law jurisdictions (e.g., China, civil‑law systems) remain under‑represented.
- Interpretability: The study does not yet provide fine‑grained explanations for why a model chose a particular legal interpretation.
- Regulatory dynamics: Rapidly evolving AI regulations mean compliance scores may need continual recalibration.
- Future directions suggested include expanding multilingual legal corpora, integrating retrieval‑augmented generation to reduce hallucinations, and developing model‑agnostic audit tools for real‑time compliance monitoring.
Authors
- Simone Corbo
Paper Information
- arXiv ID: 2512.09830v1
- Categories: cs.CL, cs.AI
- Published: December 10, 2025
- PDF: Download PDF