[Paper] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

Published: 3 weeks ago (April 14, 2026 at 01:12 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.12978v1

Overview

The paper “GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts” introduces a new, large‑scale benchmark that tests how well modern OCR systems handle more than 100 different writing systems. By rendering real multilingual text into clean and degraded images, the authors expose a stark gap: even state‑of‑the‑art vision‑language models reliably recognize only a few dozen scripts, and many fail completely on the rest.

Key Contributions

GlotOCR Bench dataset – >100 Unicode scripts, each rendered in multiple fonts (Google Fonts), with both left‑to‑right and right‑to‑left shaping via HarfBuzz and rasterized by FreeType. Includes clean and synthetically degraded versions.
Rigorous validation pipeline – manual checks ensure every script is correctly rendered, making the benchmark trustworthy for reproducibility.
Comprehensive evaluation – tests a wide range of open‑weight (e.g., TrOCR, Donut) and proprietary vision‑language OCR models (e.g., Google Cloud Vision, Azure OCR).
Empirical insight – demonstrates that OCR performance correlates strongly with the amount of script‑level pre‑training data, not just visual feature learning.
Open‑source release – both the benchmark dataset and the rendering pipeline are publicly available (GitHub + Hugging Face), enabling the community to extend or adapt the test suite.

Methodology

Text source selection – multilingual corpora were sampled to obtain representative sentences for each Unicode script.
Rendering pipeline – each sentence is shaped with HarfBuzz (handling complex scripts, ligatures, RTL direction) and rasterized with FreeType using a random font from Google Fonts, yielding high‑quality PNGs.
Degradation simulation – Gaussian blur, noise, compression artifacts, and perspective distortion are applied to create “noisy” variants that mimic real‑world scans or camera captures.
Manual sanity check – a small team inspected a stratified sample from every script to confirm correct glyph rendering and proper directionality.
Model evaluation – OCR outputs are compared against ground‑truth Unicode strings using exact match and character‑level edit distance. Scripts are grouped by the amount of pre‑training data the models have seen (e.g., Latin vs. N’Ko).

The pipeline is deliberately modular, so developers can plug in new fonts, degradation types, or OCR engines with minimal effort.

Results & Findings

Metric	Best open‑weight model	Best proprietary model
Scripts with >90% exact match	12	18
Scripts with >50% exact match	28	33
Scripts with <10% exact match	57	49

Coverage ceiling – Even the strongest models correctly recognize fewer than 30 of the 100+ scripts.
Pre‑training matters – Scripts that appear frequently in the model’s language‑model pre‑training corpus (e.g., Latin, Cyrillic, Arabic) achieve dramatically higher scores.
Failure modes – When faced with an unseen script, models either output garbled noise or “hallucinate” characters from a script they know (e.g., confusing Devanagari with Bengali).
Degradation impact – Accuracy drops roughly 15–20% across the board on the degraded image set, highlighting that visual noise compounds the script‑generalization problem.

Practical Implications

Product road‑maps – Companies building OCR SaaS should prioritize expanding script coverage in their pre‑training pipelines rather than relying solely on visual feature improvements.
Internationalization – Apps targeting emerging markets (e.g., Africa, South‑East Asia) cannot assume out‑of‑the‑box OCR will work; custom data collection for low‑resource scripts is still required.
Testing & QA – The GlotOCR Bench can be integrated into CI pipelines to catch regressions in script support when updating OCR models.
Hybrid approaches – Combining a visual recognizer with a lightweight script‑identification module could route inputs to script‑specific fine‑tuned models, mitigating hallucination.
Open‑source tooling – The rendering pipeline can be repurposed to generate synthetic training data for under‑represented scripts, accelerating data‑centric development.

Limitations & Future Work

Synthetic vs. real data – While the benchmark mimics real‑world noise, it still relies on synthetic degradations; performance on truly scanned documents may differ.
Script granularity – Some scripts share glyphs (e.g., Latin‑derived alphabets) and are not distinguished, potentially inflating scores for closely related scripts.
Model scope – The study focuses on vision‑language models; classic OCR pipelines (e.g., Tesseract with language packs) were not evaluated.
Future directions – The authors suggest expanding the benchmark with handwritten samples, adding more extreme degradations, and exploring curriculum‑learning strategies that gradually introduce new scripts during pre‑training.

Authors

Amir Hossein Kargaran
Nafiseh Nikeghbal
Jana Diesner
François Yvon
Hinrich Schütze

Paper Information

arXiv ID: 2604.12978v1
Categories: cs.CL, cs.CV
Published: April 14, 2026
PDF: Download PDF

[Paper] GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

[Paper] MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

[Paper] SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments