[Paper] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts
Source: arXiv - 2604.12978v1
Overview
The paper “GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts” introduces a new, large‑scale benchmark that tests how well modern OCR systems handle more than 100 different writing systems. By rendering real multilingual text into clean and degraded images, the authors expose a stark gap: even state‑of‑the‑art vision‑language models reliably recognize only a few dozen scripts, and many fail completely on the rest.
Key Contributions
- GlotOCR Bench dataset – >100 Unicode scripts, each rendered in multiple fonts (Google Fonts), with both left‑to‑right and right‑to‑left shaping via HarfBuzz and rasterized by FreeType. Includes clean and synthetically degraded versions.
- Rigorous validation pipeline – manual checks ensure every script is correctly rendered, making the benchmark trustworthy for reproducibility.
- Comprehensive evaluation – tests a wide range of open‑weight (e.g., TrOCR, Donut) and proprietary vision‑language OCR models (e.g., Google Cloud Vision, Azure OCR).
- Empirical insight – demonstrates that OCR performance correlates strongly with the amount of script‑level pre‑training data, not just visual feature learning.
- Open‑source release – both the benchmark dataset and the rendering pipeline are publicly available (GitHub + Hugging Face), enabling the community to extend or adapt the test suite.
Methodology
- Text source selection – multilingual corpora were sampled to obtain representative sentences for each Unicode script.
- Rendering pipeline – each sentence is shaped with HarfBuzz (handling complex scripts, ligatures, RTL direction) and rasterized with FreeType using a random font from Google Fonts, yielding high‑quality PNGs.
- Degradation simulation – Gaussian blur, noise, compression artifacts, and perspective distortion are applied to create “noisy” variants that mimic real‑world scans or camera captures.
- Manual sanity check – a small team inspected a stratified sample from every script to confirm correct glyph rendering and proper directionality.
- Model evaluation – OCR outputs are compared against ground‑truth Unicode strings using exact match and character‑level edit distance. Scripts are grouped by the amount of pre‑training data the models have seen (e.g., Latin vs. N’Ko).
The pipeline is deliberately modular, so developers can plug in new fonts, degradation types, or OCR engines with minimal effort.
Results & Findings
| Metric | Best open‑weight model | Best proprietary model |
|---|---|---|
| Scripts with >90% exact match | 12 | 18 |
| Scripts with >50% exact match | 28 | 33 |
| Scripts with <10% exact match | 57 | 49 |
- Coverage ceiling – Even the strongest models correctly recognize fewer than 30 of the 100+ scripts.
- Pre‑training matters – Scripts that appear frequently in the model’s language‑model pre‑training corpus (e.g., Latin, Cyrillic, Arabic) achieve dramatically higher scores.
- Failure modes – When faced with an unseen script, models either output garbled noise or “hallucinate” characters from a script they know (e.g., confusing Devanagari with Bengali).
- Degradation impact – Accuracy drops roughly 15–20% across the board on the degraded image set, highlighting that visual noise compounds the script‑generalization problem.
Practical Implications
- Product road‑maps – Companies building OCR SaaS should prioritize expanding script coverage in their pre‑training pipelines rather than relying solely on visual feature improvements.
- Internationalization – Apps targeting emerging markets (e.g., Africa, South‑East Asia) cannot assume out‑of‑the‑box OCR will work; custom data collection for low‑resource scripts is still required.
- Testing & QA – The GlotOCR Bench can be integrated into CI pipelines to catch regressions in script support when updating OCR models.
- Hybrid approaches – Combining a visual recognizer with a lightweight script‑identification module could route inputs to script‑specific fine‑tuned models, mitigating hallucination.
- Open‑source tooling – The rendering pipeline can be repurposed to generate synthetic training data for under‑represented scripts, accelerating data‑centric development.
Limitations & Future Work
- Synthetic vs. real data – While the benchmark mimics real‑world noise, it still relies on synthetic degradations; performance on truly scanned documents may differ.
- Script granularity – Some scripts share glyphs (e.g., Latin‑derived alphabets) and are not distinguished, potentially inflating scores for closely related scripts.
- Model scope – The study focuses on vision‑language models; classic OCR pipelines (e.g., Tesseract with language packs) were not evaluated.
- Future directions – The authors suggest expanding the benchmark with handwritten samples, adding more extreme degradations, and exploring curriculum‑learning strategies that gradually introduce new scripts during pre‑training.
Authors
- Amir Hossein Kargaran
- Nafiseh Nikeghbal
- Jana Diesner
- François Yvon
- Hinrich Schütze
Paper Information
- arXiv ID: 2604.12978v1
- Categories: cs.CL, cs.CV
- Published: April 14, 2026
- PDF: Download PDF