[Paper] LLMs in Interpreting Legal Documents

Published: 2 months ago (December 10, 2025 at 12:09 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.09830v1

Overview

Simone Corbo’s recent chapter investigates how Large Language Models (LLMs) can be harnessed to interpret legal documents—statutes, contracts, and case law. By mapping out concrete use‑cases and benchmarking performance, the work shows both the promise and the pitfalls of plugging generative AI into the legal workflow.

Key Contributions

Use‑case taxonomy for LLM‑driven legal tasks (e.g., statutory interpretation, contract summarisation, negotiation assistance, and legal information retrieval).
Risk analysis covering algorithmic monoculture, hallucinations, and regulatory compliance (EU AI Act, U.S. AI initiatives, emerging Chinese guidelines).
Two novel benchmarks tailored to the legal domain, measuring factual accuracy, interpretability, and compliance of LLM outputs.
Guidelines for responsible deployment, linking technical safeguards to specific legal requirements across jurisdictions.

Methodology

Corbo adopts a mixed‑methods approach that is easy to follow for developers:

Task definition – Real‑world legal activities were broken down into discrete NLP sub‑tasks (e.g., clause extraction, legal reasoning, summarisation).
Model selection – Off‑the‑shelf LLMs (GPT‑4, Claude, LLaMA‑2) were fine‑tuned on publicly available legal corpora and evaluated both zero‑shot and few‑shot.
Benchmark construction – Two datasets were curated:
- Statute‑QA: 1,200 multiple‑choice questions derived from EU and U.S. statutes.
- Contract‑Interpret: 500 contract excerpts with expert‑annotated interpretations.
Evaluation metrics – Accuracy, factual consistency (hallucination rate), and a compliance score (how often the model’s answer aligns with regulatory constraints).
Risk assessment – Simulated deployment scenarios to surface failure modes such as “algorithmic monoculture” (over‑reliance on a single model) and privacy leakage.

Results & Findings

Task	Best Model (Fine‑tuned)	Accuracy	Hallucination Rate	Compliance Score
Statute‑QA	GPT‑4‑FT	78%	4%	92%
Contract‑Interpret	LLaMA‑2‑FT	71%	6%	88%

Accuracy: Fine‑tuned LLMs outperform zero‑shot baselines by 12–18 percentage points.
Hallucinations: Even top models still generate incorrect legal citations in ~5 % of responses, a non‑trivial risk for downstream decisions.
Compliance: Most outputs respect the “no‑advice‑beyond‑scope” rule, but edge cases (e.g., ambiguous statutory language) trigger compliance violations.

The benchmarks reveal that while LLMs can reliably extract and paraphrase legal text, deeper reasoning—especially where statutory interpretation hinges on nuanced policy intent—still lags behind human experts.

Practical Implications

Legal tech vendors can integrate fine‑tuned LLMs for first‑draft contract review, cutting manual review time by up to 30 % (according to internal pilot studies).
In‑house counsel may use LLM‑powered Q&A assistants to quickly surface relevant statutes, but must implement a “human‑in‑the‑loop” verification step to catch hallucinations.
Compliance teams gain a concrete compliance‑score metric to monitor AI outputs against EU AI Act requirements, facilitating audit trails.
Open‑source communities have a clear benchmark to benchmark new legal‑specific LLMs, accelerating innovation beyond the dominant commercial models.

Limitations & Future Work

Dataset scope: Benchmarks focus on EU and U.S. law; non‑common‑law jurisdictions (e.g., China, civil‑law systems) remain under‑represented.
Interpretability: The study does not yet provide fine‑grained explanations for why a model chose a particular legal interpretation.
Regulatory dynamics: Rapidly evolving AI regulations mean compliance scores may need continual recalibration.
Future directions suggested include expanding multilingual legal corpora, integrating retrieval‑augmented generation to reduce hallucinations, and developing model‑agnostic audit tools for real‑time compliance monitoring.

Authors

Simone Corbo

Paper Information

arXiv ID: 2512.09830v1
Categories: cs.CL, cs.AI
Published: December 10, 2025
PDF: Download PDF

[Paper] LLMs in Interpreting Legal Documents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models