[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

Published: (April 17, 2026 at 01:28 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.16270v1

Overview

Vietnam’s legal code is notoriously dense, making it hard for ordinary citizens to understand their rights and obligations. This paper proposes a dual‑aspect evaluation framework that not only benchmarks the raw performance of four leading large language models (LLMs) on Vietnamese legal texts but also digs into why they succeed or fail. By combining quantitative scores with a large‑scale, expert‑validated error analysis, the study offers a practical roadmap for anyone looking to deploy LLMs in legal‑tech products for Vietnamese (and potentially other low‑resource) jurisdictions.

Key Contributions

  • Dual‑aspect evaluation pipeline that couples standard benchmark metrics (Accuracy, Readability, Consistency) with a fine‑grained error taxonomy validated by legal experts.
  • Comprehensive benchmark covering four state‑of‑the‑art LLMs—GPT‑4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok‑1—on a curated set of 60 complex Vietnamese legal articles.
  • Large‑scale error analysis on > 10,000 model outputs, revealing the most common failure modes (e.g., Incorrect Example, Misinterpretation).
  • Trade‑off insights showing that models optimized for readability often sacrifice legal accuracy, while high‑accuracy models may hide subtle reasoning errors.
  • Open‑source artifacts (benchmark dataset, error‑type schema, evaluation scripts) to enable reproducibility and further research.

Methodology

  1. Dataset Construction – The authors selected 60 representative Vietnamese legal articles covering civil, criminal, and administrative law. Each article was paired with a gold‑standard simplified summary written by domain experts.
  2. Model Inference – The four LLMs were prompted with the same instruction set (e.g., “Summarize this article in plain Vietnamese while preserving legal meaning”). Ten independent runs per model were collected to account for stochasticity.
  3. Quantitative Scoring
    • Accuracy: Exact‑match and semantic similarity (BERTScore) against the expert summary.
    • Readability: Vietnamese‑specific Flesch‑Kincaid and a human‑rated fluency score (1‑5).
    • Consistency: Pairwise overlap across the ten runs for the same article (Jaccard index).
  4. Error Taxonomy Development – Legal scholars iteratively defined 12 error categories (e.g., Incorrect Example, Misinterpretation, Omission, Hallucination). The taxonomy was validated through inter‑annotator agreement (Cohen’s κ = 0.82).
  5. Large‑Scale Error Annotation – Every model output was annotated against the taxonomy using a semi‑automated UI, yielding a structured error matrix for downstream analysis.

The pipeline is deliberately modular: developers can swap in new models, languages, or legal domains with minimal code changes.

Results & Findings

ModelAccuracy (↑)Readability (↑)Consistency (↑)
GPT‑4o0.780.710.84
Claude 3 Opus0.840.680.88
Gemini 1.5 Pro0.730.770.80
Grok‑10.660.820.90
  • Accuracy vs. Readability Trade‑off: Grok‑1 leads on readability and consistency but lags behind on legal accuracy (0.66). Claude 3 Opus hits the highest accuracy but exhibits a hidden layer of subtle reasoning errors.
  • Error Distribution: Across all models, Incorrect Example (≈ 38 % of errors) and Misinterpretation (≈ 31 %) dominate. Hallucination and omission are relatively rare (< 5 %).
  • Reasoning Gaps: Even high‑accuracy models occasionally produce “plausible but legally incorrect” statements—e.g., misapplying a statutory provision to a fact pattern—highlighting the need for controlled reasoning mechanisms.
  • Consistency Insight: Models with higher consistency (Grok‑1, Claude 3 Opus) generate more stable outputs across runs, which is valuable for auditability in legal workflows.

Practical Implications

  • Legal‑Tech Product Design – When building a Vietnamese legal‑assistant, prioritize a model like Gemini 1.5 Pro if readability for lay users is paramount, but supplement it with a post‑processing verification layer (e.g., rule‑based checks) to catch accuracy gaps.
  • Human‑in‑the‑Loop Workflows – The error taxonomy can be directly integrated into UI annotations, allowing lawyers to quickly flag Incorrect Example or Misinterpretation instances, reducing review time by up to 30 % (as suggested by the authors’ pilot study).
  • Regulatory Compliance – Consistency scores provide a quantitative metric for audit trails—important for jurisdictions that may require reproducible legal advice.
  • Low‑Resource Language Strategies – The dual‑aspect framework is language‑agnostic; teams working on other under‑represented languages can adopt the same pipeline to surface hidden reasoning flaws early.
  • Model Selection Guidance – The paper’s trade‑off matrix helps product managers make evidence‑based decisions rather than relying on headline model rankings.

Limitations & Future Work

  • Scope of Legal Domains – The benchmark focuses on a limited set of Vietnamese statutes; extending to case law or regulatory guidance could reveal new error patterns.
  • Prompt Uniformity – All models received identical prompts; exploring prompt engineering or chain‑of‑thought prompting might shift the accuracy‑readability balance.
  • Human Evaluation Scale – Readability and consistency were partially judged by a small pool of native speakers; larger crowdsourced studies could improve reliability.
  • Dynamic Legal Updates – The dataset is static; future work should incorporate continuous learning pipelines to keep models aligned with evolving legislation.
  • Cross‑Lingual Transfer – Investigating whether insights from Vietnamese legal reasoning transfer to other low‑resource legal systems remains an open question.

By marrying hard numbers with a nuanced error lens, this research equips developers with the tools they need to responsibly harness LLMs for legal text simplification—turning a promising technology into a trustworthy, real‑world solution.

Authors

  • Van-Truong Le

Paper Information

  • arXiv ID: 2604.16270v1
  • Categories: cs.CL, cs.AI
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »