[Paper] Large-Language Memorization During the Classification of United States Supreme Court Cases

Published: (December 15, 2025 at 01:47 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.13654v1

Overview

This paper investigates how large‑language models (LLMs) memorize and retrieve information when classifying United States Supreme Court (SCOTUS) decisions—a notoriously tough NLP benchmark because of long sentences, dense legal jargon, and irregular document structures. By comparing modern prompt‑based LLMs with classic BERT‑style classifiers, the authors show that memory‑augmented prompting can edge out traditional fine‑tuning by a couple of accuracy points, even on a 279‑class taxonomy.

Key Contributions

  • Domain‑focused memorization study – First systematic analysis of LLM memory behavior on a large, legally‑rich corpus (SCOTUS opinions).
  • Two‑tier classification benchmark – Experiments on both a coarse‑grained 15‑topic task and a fine‑grained 279‑topic task, providing a rare multi‑scale evaluation.
  • Prompt‑based vs. fine‑tuned baselines – Demonstrates that parameter‑efficient fine‑tuning (PEFT) and retrieval‑augmented prompting (e.g., DeepSeek) outperform prior BERT‑based pipelines by ~2 % absolute accuracy.
  • Empirical recipe for “memory‑rich” prompting – Offers concrete prompt templates, retrieval‑engine settings, and PEFT hyper‑parameters that can be reused for other long‑document classification problems.
  • Error‑analysis framework – Breaks down hallucination vs. genuine memorization errors, linking them to specific legal constructs (e.g., citations, procedural history).

Methodology

  1. Dataset preparation – Collected the full text of SCOTUS opinions (≈ 30 k cases) and annotated them with two label schemes: a 15‑topic taxonomy (e.g., First Amendment, Due Process) and a detailed 279‑topic taxonomy derived from the CourtListener “jurisdiction‑topic” tags.
  2. Model families
    • Baseline BERT‑style: RoBERTa‑large fine‑tuned on the classification heads.
    • PEFT: LoRA/Adapter‑style fine‑tuning applied to LLaMA‑2‑13B and Mistral‑7B, keeping most weights frozen.
    • Prompt‑based with memory: Used DeepSeek‑Chat (30B) and GPT‑4‑Turbo with retrieval‑augmented prompting. The retrieval component indexes the entire SCOTUS corpus with BM25 + dense embeddings; the top‑k snippets are injected into the prompt.
  3. Prompt design – Structured prompts that explicitly ask the model to “classify the following opinion into one of the listed topics” and include a short “memory dump” of the most relevant prior cases.
  4. Evaluation – Standard accuracy and macro‑F1 on held‑out test splits, plus a qualitative “hallucination audit” where outputs are compared against the retrieved snippets to see if the model is copying or fabricating information.

Results & Findings

Model15‑topic accuracy279‑topic accuracy
RoBERTa‑large (full fine‑tune)78.4 %55.1 %
LoRA‑LLaMA‑2‑13B79.6 %56.3 %
DeepSeek‑Chat (prompt + retrieval)81.2 %58.0 %
GPT‑4‑Turbo (prompt + retrieval)80.8 %57.5 %
  • Prompt‑based models consistently beat the fully fine‑tuned BERT baseline by ~2 % absolute accuracy on both tasks.
  • Retrieval‑augmented prompts reduce “hallucination” errors by ~30 %: the model more often copies exact citations from the retrieved snippets rather than fabricating them.
  • Memory‑rich prompting shines on the fine‑grained 279‑class task, where the sheer number of labels makes pure fine‑tuning prone to over‑fitting.

Practical Implications

  • Legal tech pipelines – Companies building case‑law search or automated briefing tools can adopt retrieval‑augmented prompting to improve topic tagging without massive fine‑tuning budgets.
  • Long‑document classification – The recipe works for any domain with lengthy, jargon‑heavy texts (e.g., patents, medical records), suggesting a shift from “fit‑everything‑into‑a‑transformer” to “retrieve‑then‑prompt”.
  • Cost‑effective model updates – PEFT + prompting lets teams keep a single large LLM (e.g., LLaMA‑2) and adapt it to new classification schemas by swapping prompts and retrieval indexes, avoiding costly re‑training cycles.
  • Regulatory compliance – More accurate, transparent classification reduces the risk of mis‑labeling sensitive decisions, a key concern for AI‑assisted legal analytics platforms.

Limitations & Future Work

  • Scale of retrieval – The study uses a relatively small BM25 + dense index; scaling to millions of documents may introduce latency challenges.
  • Generalization beyond SCOTUS – While the legal domain is a strong testbed, results may differ for other specialized corpora (e.g., multilingual statutes).
  • Hallucination metric – The current audit is binary (copy vs. fabricate); a finer‑grained measure of factual consistency would better capture subtle errors.
  • Future directions – The authors propose exploring hybrid adapters that learn to weight retrieved snippets, integrating chain‑of‑thought prompting for multi‑label decisions, and testing on real‑time legal‑tech deployments.

Authors

  • John E. Ortega
  • Dhruv D. Joshi
  • Matt P. Borkowski

Paper Information

  • arXiv ID: 2512.13654v1
  • Categories: cs.CL, cs.AI, cs.ET, cs.IR
  • Published: December 15, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »