[Paper] SumTablets: A Transliteration Dataset of Sumerian Tablets

Published: 3 days ago (February 25, 2026 at 01:50 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.22200v1

Overview

The paper introduces SumTablets, the first large‑scale, openly licensed dataset that pairs Unicode‑encoded images of Sumerian cuneiform tablets with their scholarly transliterations. By bridging the gap between ancient script and modern text, the authors enable NLP researchers and developers to apply state‑of‑the‑art language models to a millennial‑old writing system.

Key Contributions

Dataset Release – 91,606 Sumerian tablets (≈ 7 M glyphs) aligned with high‑quality transliterations from the Oracc project, packaged as a Hugging Face Dataset (CC BY 4.0).
Standardized Pre‑processing Pipeline – Open‑source code that normalizes transliterations, maps each reading back to its Unicode glyph, and preserves structural cues (surfaces, line breaks, broken segments) via special tokens.
Baseline Transliteration Models
1. Weighted Sampling from a glyph’s possible readings.
2. Fine‑tuned Autoregressive Transformer (GPT‑style) achieving a character‑level chrF of 97.55.
Reproducibility Infrastructure – All data, scripts, and model checkpoints are publicly available on GitHub and Hugging Face, encouraging community extensions.

Methodology

Data Harvesting – The authors scraped the Open‑Access Repository of Assyriological Cuneiform (Oracc), extracting both the Unicode glyph strings (the “raw” tablet) and the corresponding transliteration text.
Normalization – Transliteration strings were cleaned (e.g., unified sign‑lists, removed editorial brackets) and tokenized so that each glyph aligns with one or more possible readings.
Alignment & Token Insertion – Special tokens (<SURF>, <NL>, <BROKEN>) were inserted to retain tablet layout information, which is crucial for downstream models that need to respect line breaks and broken signs.
Baseline Models
- Weighted Sampling: For each glyph, the probability distribution over its possible readings (derived from the Oracc sign‑list) is used to sample a transliteration.
- Transformer Fine‑tuning: A pretrained autoregressive language model (e.g., GPT‑2) is further trained on the paired glyph‑transliteration sequences, treating the task as a character‑level sequence‑to‑sequence problem.

Results & Findings

The weighted‑sampling baseline yields a modest chrF (~ 71), confirming that naïve probabilistic decoding is insufficient for high‑quality transliteration.
The fine‑tuned transformer reaches chrF = 97.55, rivaling human expert consistency on many tablets. Errors are mostly confined to rare or heavily damaged signs where the model lacks sufficient context.
Structural tokens improve performance by ~ 1.2 chrF points, demonstrating that preserving tablet layout helps the model learn context‑dependent readings.

Practical Implications

Rapid Draft Transliteration – Researchers can generate a first‑pass transliteration for thousands of tablets, cutting manual effort from weeks to minutes per tablet.
Assistive Editing Tools – Integrated into IDE‑like environments (e.g., digital epigraphy platforms), the model can suggest readings that scholars accept, modify, or reject, streamlining the verification workflow.
Cross‑Disciplinary NLP – The dataset opens a new benchmark for low‑resource, non‑alphabetic script transliteration, encouraging the development of models that handle multimodal inputs (glyph images → Unicode → text).
Cultural Heritage Preservation – Automated pipelines can be built to digitize and annotate newly discovered tablets, accelerating cataloguing for museums and archives.

Limitations & Future Work

Coverage Bias – The dataset reflects only tablets that have been entered into Oracc, which skews toward well‑studied periods and regions; many fragmentary or unpublished tablets remain absent.
Glyph Ambiguity – Some cuneiform signs have multiple legitimate readings depending on context; the current model treats each glyph independently, leading to occasional mis‑disambiguation.
Evaluation Scope – chrF measures character overlap but does not capture higher‑level linguistic correctness (e.g., syntactic or semantic plausibility). Future work could incorporate downstream tasks like automatic grammar checking or semantic parsing.
Multimodal Extensions – Incorporating raw tablet images (pixel data) alongside Unicode glyphs could improve robustness to damaged signs and enable end‑to‑end OCR‑to‑transliteration pipelines.

Authors

Cole Simmons
Richard Diehl Martinez
Dan Jurafsky

Paper Information

arXiv ID: 2602.22200v1
Categories: cs.CL
Published: February 25, 2026
PDF: Download PDF

[Paper] SumTablets: A Transliteration Dataset of Sumerian Tablets

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables