[Paper] Legal RAG Bench: an end-to-end benchmark for legal RAG

Published: (March 2, 2026 at 05:34 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.01710v1

Overview

The paper presents Legal RAG Bench, a new end‑to‑end benchmark designed to evaluate Retrieval‑Augmented Generation (RAG) systems that operate on legal texts. By pairing a curated set of 4,876 passages from the Victorian Criminal Charge Book with 100 expert‑level questions, the authors provide a realistic testbed for measuring how well retrieval and reasoning components work together in a legal context.

Key Contributions

  • Comprehensive benchmark: 4,876 annotated legal passages + 100 hand‑crafted, complex criminal‑law questions with reference long‑form answers and supporting citations.
  • Full‑factorial evaluation framework: isolates the impact of the retrieval model vs. the generative LLM, enabling “apples‑to‑apples” comparisons.
  • Hierarchical error decomposition: breaks down failures into retrieval errors, reasoning errors, and hallucinations, revealing the true source of mistakes.
  • Empirical study: evaluates three state‑of‑the‑art embedding retrievers (Kanon 2, Gemini Embedding 001, OpenAI Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro, GPT‑5.2).
  • Open‑source release: code, data, and evaluation scripts are publicly available for reproducibility and community extensions.

Methodology

  1. Dataset construction – Passages were extracted from the Victorian Criminal Charge Book, a publicly available statutory source, and 100 multi‑step legal questions were written that require synthesis of several passages and procedural knowledge.
  2. Retrieval component – Each passage is indexed with dense embeddings from the three selected models. At query time, the top‑k passages (k = 5, 10, 20) are retrieved.
  3. Generation component – Retrieved passages are fed to the LLM (either Gemini 3.1 Pro or GPT‑5.2) using a standard RAG prompt that asks the model to produce a long‑form answer and cite the supporting passages.
  4. Full factorial design – Every retriever is paired with every generator, yielding six system configurations. This isolates the contribution of each component.
  5. Hierarchical error analysis – Errors are classified as:
    • Retrieval failure (relevant passage missing)
    • Reasoning failure (relevant passage present but answer incorrect)
    • Hallucination (answer contains unsupported claims).
      Human annotators score correctness (0‑100) and groundedness (extent of citation coverage).

Results & Findings

RetrieverLLMCorrectness ↑Groundedness ↑Retrieval Accuracy ↑
Kanon 2Gemini 3.1 Pro+17.5 pts vs. baseline+4.5 pts+34 pts
Gemini Embedding 001GPT‑5.2+9.2 pts+2.1 pts+18 pts
OpenAI Text Embedding 3 LargeGemini 3.1 Pro+6.4 pts+1.8 pts+12 pts
  • Retrieval dominates performance: improvements in embedding quality translate to the biggest jumps in both correctness and groundedness.
  • LLM impact is modest: swapping Gemini 3.1 Pro for GPT‑5.2 changes correctness by < 3 points, suggesting that once the right passages are retrieved, current LLMs perform similarly.
  • Hallucinations often stem from missing evidence: many “fabricated” statements disappear when the correct passage is retrieved, confirming that retrieval sets the performance ceiling.

Practical Implications

  • Prioritize high‑quality retrieval: For legal‑tech products (e.g., contract analysis, case‑law assistants), investing in domain‑specific embedding models or fine‑tuning retrievers yields larger ROI than chasing ever larger LLMs.
  • Benchmark‑driven development: Legal RAG Bench offers a ready‑made test suite that mirrors real‑world lawyer queries, enabling teams to iterate quickly and measure progress in a reproducible way.
  • Safety & compliance: By exposing the retrieval‑driven source of hallucinations, developers can implement guardrails (e.g., “cite‑first” policies) that reject answers when the supporting passage set is insufficient, reducing the risk of providing incorrect legal advice.
  • Model selection guidance: The study suggests that a strong embedder like Kanon 2 combined with a competent but not necessarily cutting‑edge LLM is sufficient for many enterprise legal use‑cases, allowing cost‑effective deployments.

Limitations & Future Work

  • Jurisdiction‑specific scope: The benchmark is built around Victorian (Australia) criminal law, so findings may not directly transfer to other legal systems or civil‑law domains.
  • Question diversity: Only 100 questions were handcrafted; expanding the question set and covering more practice areas (e.g., corporate, IP) would improve generalizability.
  • Static passage collection: The benchmark does not model updates to statutes or case law, a realistic challenge for production systems.
  • Future directions proposed by the authors include: extending the benchmark to multi‑jurisdictional corpora, exploring hybrid retrieval (sparse + dense) strategies, and integrating tool‑use (e.g., calculators, citation managers) into the generation step.

Authors

  • Abdur-Rahman Butler
  • Umar Butler

Paper Information

  • arXiv ID: 2603.01710v1
  • Categories: cs.CL, cs.IR, cs.LG
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »