[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Published: 3 weeks ago (April 17, 2026 at 01:28 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.16270v1

Overview

Vietnam’s legal code is notoriously dense, making it hard for ordinary citizens to understand their rights and obligations. This paper proposes a dual‑aspect evaluation framework that not only benchmarks the raw performance of four leading large language models (LLMs) on Vietnamese legal texts but also digs into why they succeed or fail. By combining quantitative scores with a large‑scale, expert‑validated error analysis, the study offers a practical roadmap for anyone looking to deploy LLMs in legal‑tech products for Vietnamese (and potentially other low‑resource) jurisdictions.

Key Contributions

Dual‑aspect evaluation pipeline that couples standard benchmark metrics (Accuracy, Readability, Consistency) with a fine‑grained error taxonomy validated by legal experts.
Comprehensive benchmark covering four state‑of‑the‑art LLMs—GPT‑4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok‑1—on a curated set of 60 complex Vietnamese legal articles.
Large‑scale error analysis on > 10,000 model outputs, revealing the most common failure modes (e.g., Incorrect Example, Misinterpretation).
Trade‑off insights showing that models optimized for readability often sacrifice legal accuracy, while high‑accuracy models may hide subtle reasoning errors.
Open‑source artifacts (benchmark dataset, error‑type schema, evaluation scripts) to enable reproducibility and further research.

Methodology

Dataset Construction – The authors selected 60 representative Vietnamese legal articles covering civil, criminal, and administrative law. Each article was paired with a gold‑standard simplified summary written by domain experts.
Model Inference – The four LLMs were prompted with the same instruction set (e.g., “Summarize this article in plain Vietnamese while preserving legal meaning”). Ten independent runs per model were collected to account for stochasticity.
Quantitative Scoring
- Accuracy: Exact‑match and semantic similarity (BERTScore) against the expert summary.
- Readability: Vietnamese‑specific Flesch‑Kincaid and a human‑rated fluency score (1‑5).
- Consistency: Pairwise overlap across the ten runs for the same article (Jaccard index).
Error Taxonomy Development – Legal scholars iteratively defined 12 error categories (e.g., Incorrect Example, Misinterpretation, Omission, Hallucination). The taxonomy was validated through inter‑annotator agreement (Cohen’s κ = 0.82).
Large‑Scale Error Annotation – Every model output was annotated against the taxonomy using a semi‑automated UI, yielding a structured error matrix for downstream analysis.

The pipeline is deliberately modular: developers can swap in new models, languages, or legal domains with minimal code changes.

Results & Findings

Model	Accuracy (↑)	Readability (↑)	Consistency (↑)
GPT‑4o	0.78	0.71	0.84
Claude 3 Opus	0.84	0.68	0.88
Gemini 1.5 Pro	0.73	0.77	0.80
Grok‑1	0.66	0.82	0.90

Accuracy vs. Readability Trade‑off: Grok‑1 leads on readability and consistency but lags behind on legal accuracy (0.66). Claude 3 Opus hits the highest accuracy but exhibits a hidden layer of subtle reasoning errors.
Error Distribution: Across all models, Incorrect Example (≈ 38 % of errors) and Misinterpretation (≈ 31 %) dominate. Hallucination and omission are relatively rare (< 5 %).
Reasoning Gaps: Even high‑accuracy models occasionally produce “plausible but legally incorrect” statements—e.g., misapplying a statutory provision to a fact pattern—highlighting the need for controlled reasoning mechanisms.
Consistency Insight: Models with higher consistency (Grok‑1, Claude 3 Opus) generate more stable outputs across runs, which is valuable for auditability in legal workflows.

Practical Implications

Legal‑Tech Product Design – When building a Vietnamese legal‑assistant, prioritize a model like Gemini 1.5 Pro if readability for lay users is paramount, but supplement it with a post‑processing verification layer (e.g., rule‑based checks) to catch accuracy gaps.
Human‑in‑the‑Loop Workflows – The error taxonomy can be directly integrated into UI annotations, allowing lawyers to quickly flag Incorrect Example or Misinterpretation instances, reducing review time by up to 30 % (as suggested by the authors’ pilot study).
Regulatory Compliance – Consistency scores provide a quantitative metric for audit trails—important for jurisdictions that may require reproducible legal advice.
Low‑Resource Language Strategies – The dual‑aspect framework is language‑agnostic; teams working on other under‑represented languages can adopt the same pipeline to surface hidden reasoning flaws early.
Model Selection Guidance – The paper’s trade‑off matrix helps product managers make evidence‑based decisions rather than relying on headline model rankings.

Limitations & Future Work

Scope of Legal Domains – The benchmark focuses on a limited set of Vietnamese statutes; extending to case law or regulatory guidance could reveal new error patterns.
Prompt Uniformity – All models received identical prompts; exploring prompt engineering or chain‑of‑thought prompting might shift the accuracy‑readability balance.
Human Evaluation Scale – Readability and consistency were partially judged by a small pool of native speakers; larger crowdsourced studies could improve reliability.
Dynamic Legal Updates – The dataset is static; future work should incorporate continuous learning pipelines to keep models aligned with evolving legislation.
Cross‑Lingual Transfer – Investigating whether insights from Vietnamese legal reasoning transfer to other low‑resource legal systems remains an open question.

By marrying hard numbers with a nuanced error lens, this research equips developers with the tools they need to responsibly harness LLMs for legal text simplification—turning a promising technology into a trustworthy, real‑world solution.

Authors

Van-Truong Le

Paper Information

arXiv ID: 2604.16270v1
Categories: cs.CL, cs.AI
Published: April 17, 2026
PDF: Download PDF

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] Detecting and Suppressing Reward Hacking with Gradient Fingerprints

[Paper] BAGEL: Benchmarking Animal Knowledge Expertise in Language Models