[Paper] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora

Published: (December 19, 2025 at 11:28 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.17756v1

Overview

The paper introduces AncientBench, a new evaluation suite designed to test how well large language models (LLMs) understand excavated and transmitted Chinese ancient texts. By covering everything from glyph shapes to contextual meaning, the benchmark fills a glaring gap in current Chinese NLP resources, which focus almost exclusively on modern language or literary classics.

Key Contributions

  • First comprehensive benchmark for excavated Chinese corpora – captures the unique challenges of ancient inscriptions, bamboo slips, and other archaeological artifacts.
  • Four‑dimensional competency framework – evaluates glyph, pronunciation, semantic, and contextual comprehension.
  • Ten diverse task types (radical identification, phonetic‑radical matching, homophone detection, cloze, translation, etc.) that together form a holistic testbed.
  • Baseline “Ancient Model” fine‑tuned on historical data, providing a reference point for future work.
  • Extensive evaluation of state‑of‑the‑art LLMs (e.g., GPT‑4, Claude, LLaMA) against expert archaeologists, exposing both strengths and remaining gaps.

Methodology

  1. Corpus Construction – The authors collected a balanced mix of transmitted (canonical) and excavated (epigraphic) Chinese texts, spanning several dynasties.
  2. Task Design – Each of the four comprehension dimensions is operationalized through specific tasks:
    • Glyph: Identify radicals, strokes, or classify characters by visual components.
    • Pronunciation: Map characters to phonetic radicals or detect homophones.
    • Meaning: Cloze‑style fill‑in‑the‑blank, synonym/antonym judgment, and short translation.
    • Context: Passage‑level inference, chronology ordering, and entity linking.
  3. Human Baseline – A panel of archaeologists and sinologists annotated the test set and provided gold‑standard answers.
  4. Model Evaluation – Both the newly trained Ancient Model and several leading LLMs were prompted with the same tasks; performance metrics (accuracy, F1, BLEU for translation) were computed and compared to human scores.

Results & Findings

  • LLMs are surprisingly capable: GPT‑4 achieved ~78 % of human accuracy on transmitted texts but dropped to ~55 % on excavated material.
  • Glyph tasks remain hardest: Even the best LLMs struggled with radical identification on damaged characters, indicating a need for visual‑symbolic reasoning.
  • Pronunciation comprehension is relatively strong: Models correctly matched phonetic radicals >80 % of the time, likely benefitting from large multilingual phonetic corpora.
  • Contextual inference lags: Passage‑level tasks showed the largest human‑model gap (~30 % absolute difference), reflecting limited exposure to fragmented historical narratives.
  • Ancient Model baseline outperformed generic LLMs on glyph and homophone tasks, confirming the value of domain‑specific fine‑tuning.

Practical Implications

  • Archaeology workflows – Automated glyph recognition and preliminary translation can accelerate cataloguing of newly unearthed inscriptions, freeing scholars to focus on higher‑level analysis.
  • Cultural heritage tech – Museums and digital archives can embed AncientBench‑validated models into interactive exhibits, offering visitors real‑time explanations of ancient scripts.
  • LLM product development – Companies building multilingual assistants can use AncientBench as a stress test for rare‑language handling, ensuring robustness beyond modern corpora.
  • Education & outreach – Language‑learning platforms could incorporate ancient Chinese modules, leveraging models that have passed the benchmark to generate authentic practice material.

Limitations & Future Work

  • Data sparsity – Excavated texts are inherently fragmentary; the benchmark still covers a limited set of scripts (e.g., oracle bone, bronze, bamboo) and may not generalize to all epigraphic forms.
  • Visual information – Current evaluations treat characters as Unicode tokens; integrating image‑based glyph features could improve performance on damaged or stylized inscriptions.
  • Cross‑dialect phonology – The benchmark assumes a unified historical pronunciation, which oversimplifies regional variations that archaeologists often need to consider.
  • Scalability – Extending AncientBench to other ancient languages (e.g., Classical Japanese, Sanskrit) would test the universality of the proposed four‑dimensional framework.

AncientBench opens the door for LLMs to move from modern chatbots to genuine partners in deciphering humanity’s oldest written records. Developers and researchers alike now have a concrete yardstick to measure progress—and a clear roadmap for the next wave of historically aware AI.

Authors

  • Zhihan Zhou
  • Daqian Shi
  • Rui Song
  • Lida Shi
  • Xiaolei Diao
  • Hao Xu

Paper Information

  • arXiv ID: 2512.17756v1
  • Categories: cs.CL, cs.AI
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...