[Paper] Classifying long legal documents using short random chunks
Source: arXiv - 2512.24997v1
Overview
Legal document classification is notoriously hard: the texts are massive, domain‑specific, and often exceed the token limits of modern Transformer models. In this paper, Luis Adrián Cabrera‑Diego proposes a lightweight yet powerful pipeline that classifies long legal files by feeding a model only 48 random short chunks (≤ 128 tokens each) drawn from each document. The approach combines a DeBERTa v3 encoder with an LSTM aggregator and demonstrates strong accuracy while keeping inference costs modest enough for CPU‑only deployment.
Key Contributions
- Random‑Chunk Sampling: Introduces a simple strategy of selecting 48 random 128‑token excerpts per document, sidestepping the need for full‑document encoding.
- Hybrid Architecture: Couples a state‑of‑the‑art DeBERTa v3 encoder (for chunk‑level representations) with a lightweight LSTM that fuses the chunk embeddings into a document‑level prediction.
- Production‑Ready Pipeline: Implements the end‑to‑end workflow on Temporal, a durable execution platform, ensuring reliable, fault‑tolerant batch processing.
- Performance Benchmark: Achieves a weighted F‑score of 0.898 on a real‑world legal corpus, with a median processing time of ~5 seconds per file (≈ 498 s per 100 files) on a single CPU core.
Methodology
- Chunk Extraction – For each legal file, 48 non‑overlapping windows of up to 128 tokens are sampled uniformly at random. This keeps the input size well within the 512‑token limit of DeBERTa v3 and reduces memory pressure.
- Chunk Encoding – Each chunk is passed through a pre‑trained DeBERTa v3 model (fine‑tuned on the classification task). The model outputs a fixed‑size embedding (typically the CLS token).
- Sequence Aggregation – The 48 embeddings form a short sequence that is fed into a single‑layer LSTM. The LSTM learns to capture inter‑chunk dependencies and produces a final hidden state used for classification.
- Training Regime – The system is trained end‑to‑end with cross‑entropy loss, using standard data‑augmentation (different random seeds per epoch) to make the model robust to the stochastic chunk selection.
- Deployment via Temporal – Inference jobs are wrapped as Temporal workflows, which handle retries, scaling, and state persistence, allowing the pipeline to run on commodity CPU machines without GPU acceleration.
Results & Findings
| Metric | Value |
|---|---|
| Weighted F‑score | 0.898 |
| Median inference time (100 files, CPU) | 498 s |
| Tokens processed per file (average) | 48 × ≤ 128 ≈ 6 k tokens |
- The random‑chunk approach retains most of the discriminative signal despite seeing only ~5 % of a typical 120 k‑token legal document.
- The LSTM aggregator consistently outperformed simple averaging or max‑pooling of chunk embeddings, indicating that order‑agnostic aggregation loses useful context.
- CPU‑only inference proved viable for batch workloads, eliminating the need for costly GPU infrastructure in many legal tech settings.
Practical Implications
- Scalable Legal Tech Services: Companies can now offer document triage, routing, or compliance checks without provisioning expensive GPU clusters.
- Rapid Prototyping: The random‑chunk method is model‑agnostic; developers can swap DeBERTa for any other encoder (e.g., RoBERTa, LLaMA) and retain the same pipeline skeleton.
- Cost‑Effective Cloud Deployments: Running on CPUs reduces cloud spend dramatically—especially for batch jobs that can be scheduled during off‑peak hours.
- Robust Production: Temporal’s workflow engine provides built‑in retry, timeout, and audit capabilities, making the system resilient to flaky data sources or transient hardware failures.
- Privacy‑Friendly Processing: Since only small excerpts are ever loaded into memory, the approach can be combined with on‑premise chunk extraction to minimize data exposure.
Limitations & Future Work
- Sampling Bias: Random chunks may miss rare but decisive sections (e.g., specific clauses), potentially limiting performance on highly heterogeneous corpora.
- Fixed Chunk Count: The choice of 48 chunks is heuristic; adaptive strategies based on document length or confidence could yield better efficiency.
- Domain Transfer: The model is fine‑tuned on a specific legal dataset; applying it to other jurisdictions or document types may require additional labeled data.
- Explainability: Aggregating many chunk embeddings via an LSTM makes it harder to pinpoint which parts of the document drove a particular classification—future work could integrate attention‑based aggregators or post‑hoc interpretability tools.
Overall, the paper demonstrates that clever sampling combined with a modest neural architecture can bring high‑quality legal document classification within reach of everyday development teams, opening the door to more accessible AI‑powered legal workflows.
Authors
- Luis Adrián Cabrera-Diego
Paper Information
- arXiv ID: 2512.24997v1
- Categories: cs.CL, cs.AI
- Published: December 31, 2025
- PDF: Download PDF