[Paper] SumTablets: A Transliteration Dataset of Sumerian Tablets
Source: arXiv - 2602.22200v1
Overview
The paper introduces SumTablets, the first large‑scale, openly licensed dataset that pairs Unicode‑encoded images of Sumerian cuneiform tablets with their scholarly transliterations. By bridging the gap between ancient script and modern text, the authors enable NLP researchers and developers to apply state‑of‑the‑art language models to a millennial‑old writing system.
Key Contributions
- Dataset Release – 91,606 Sumerian tablets (≈ 7 M glyphs) aligned with high‑quality transliterations from the Oracc project, packaged as a Hugging Face Dataset (CC BY 4.0).
- Standardized Pre‑processing Pipeline – Open‑source code that normalizes transliterations, maps each reading back to its Unicode glyph, and preserves structural cues (surfaces, line breaks, broken segments) via special tokens.
- Baseline Transliteration Models
- Weighted Sampling from a glyph’s possible readings.
- Fine‑tuned Autoregressive Transformer (GPT‑style) achieving a character‑level chrF of 97.55.
- Reproducibility Infrastructure – All data, scripts, and model checkpoints are publicly available on GitHub and Hugging Face, encouraging community extensions.
Methodology
- Data Harvesting – The authors scraped the Open‑Access Repository of Assyriological Cuneiform (Oracc), extracting both the Unicode glyph strings (the “raw” tablet) and the corresponding transliteration text.
- Normalization – Transliteration strings were cleaned (e.g., unified sign‑lists, removed editorial brackets) and tokenized so that each glyph aligns with one or more possible readings.
- Alignment & Token Insertion – Special tokens (
<SURF>,<NL>,<BROKEN>) were inserted to retain tablet layout information, which is crucial for downstream models that need to respect line breaks and broken signs. - Baseline Models
- Weighted Sampling: For each glyph, the probability distribution over its possible readings (derived from the Oracc sign‑list) is used to sample a transliteration.
- Transformer Fine‑tuning: A pretrained autoregressive language model (e.g., GPT‑2) is further trained on the paired glyph‑transliteration sequences, treating the task as a character‑level sequence‑to‑sequence problem.
Results & Findings
- The weighted‑sampling baseline yields a modest chrF (~ 71), confirming that naïve probabilistic decoding is insufficient for high‑quality transliteration.
- The fine‑tuned transformer reaches chrF = 97.55, rivaling human expert consistency on many tablets. Errors are mostly confined to rare or heavily damaged signs where the model lacks sufficient context.
- Structural tokens improve performance by ~ 1.2 chrF points, demonstrating that preserving tablet layout helps the model learn context‑dependent readings.
Practical Implications
- Rapid Draft Transliteration – Researchers can generate a first‑pass transliteration for thousands of tablets, cutting manual effort from weeks to minutes per tablet.
- Assistive Editing Tools – Integrated into IDE‑like environments (e.g., digital epigraphy platforms), the model can suggest readings that scholars accept, modify, or reject, streamlining the verification workflow.
- Cross‑Disciplinary NLP – The dataset opens a new benchmark for low‑resource, non‑alphabetic script transliteration, encouraging the development of models that handle multimodal inputs (glyph images → Unicode → text).
- Cultural Heritage Preservation – Automated pipelines can be built to digitize and annotate newly discovered tablets, accelerating cataloguing for museums and archives.
Limitations & Future Work
- Coverage Bias – The dataset reflects only tablets that have been entered into Oracc, which skews toward well‑studied periods and regions; many fragmentary or unpublished tablets remain absent.
- Glyph Ambiguity – Some cuneiform signs have multiple legitimate readings depending on context; the current model treats each glyph independently, leading to occasional mis‑disambiguation.
- Evaluation Scope – chrF measures character overlap but does not capture higher‑level linguistic correctness (e.g., syntactic or semantic plausibility). Future work could incorporate downstream tasks like automatic grammar checking or semantic parsing.
- Multimodal Extensions – Incorporating raw tablet images (pixel data) alongside Unicode glyphs could improve robustness to damaged signs and enable end‑to‑end OCR‑to‑transliteration pipelines.
Authors
- Cole Simmons
- Richard Diehl Martinez
- Dan Jurafsky
Paper Information
- arXiv ID: 2602.22200v1
- Categories: cs.CL
- Published: February 25, 2026
- PDF: Download PDF