[Paper] Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

Published: (April 22, 2026 at 01:43 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.20813v1

Overview

The paper introduces the first successful adaptation of TrOCR, a transformer‑based OCR system, to recognize printed Tigrinya text written in the Ge’ez script. By extending the model’s tokenizer and adding a novel Word‑Aware Loss Weighting technique, the authors turn a Latin‑centric OCR engine into a high‑accuracy recognizer for an African syllabic writing system—achieving sub‑0.3 % character error rate in just a few hours of training on a consumer‑grade GPU.

Key Contributions

  • First TrOCR adaptation for Ge’ez (Tigrinya) – demonstrates that a state‑of‑the‑art Latin‑focused OCR can be repurposed for a non‑Latin, syllabic script.
  • Tokenizer extension – expands the byte‑pair‑encoding (BPE) vocabulary to cover 230 unique Ge’ez characters while preserving the original model architecture.
  • Word‑Aware Loss Weighting (WALW) – a loss‑scaling scheme that penalizes errors at word boundaries, fixing systematic failures caused by Latin‑centric BPE tokenization.
  • Efficient training pipeline – full adaptation (tokenizer, WALW, fine‑tuning) completes in < 3 hours on a single 8 GB GPU.
  • Open‑source release – code, pretrained weights, synthetic dataset, and evaluation scripts are publicly available, encouraging reproducibility and further research.

Methodology

  1. Base Model Selection – The authors start from the publicly released TrOCR model pre‑trained on large Latin‑script OCR corpora.
  2. Tokenizer Augmentation – They generate a new BPE vocabulary that includes all 230 Ge’ez glyphs, keeping the original byte‑level tokens for backward compatibility.
  3. Synthetic Training Data – Using the GLOCR pipeline, 5 k printed‑font images of Tigrinya sentences are rendered, providing a clean, labeled dataset for fine‑tuning.
  4. Word‑Aware Loss Weighting – During training, the cross‑entropy loss for tokens that sit at word boundaries is multiplied by a higher weight (empirically set to 5×). This forces the model to learn the correct segmentation of Ge’ez syllables, which are often merged under a Latin‑centric BPE scheme.
  5. Fine‑tuning – The extended model is trained for 3 epochs with a modest learning rate, using mixed‑precision on a single consumer GPU. No architectural changes are made beyond the tokenizer and loss weighting.

Results & Findings

MetricBaseline (no adaptation)Tokenizer‑onlyFull (Tokenizer + WALW)
Character Error Rate (CER)> 30 % (unusable)2.1 %0.22 %
Exact‑Match Accuracy< 1 %45 %97.20 %
Training Time2.8 h2.9 h
  • Ablation study shows that simply extending the vocabulary reduces CER from >30 % to ~2 %, but the Word‑Aware Loss Weighting delivers the final order‑of‑magnitude improvement.
  • The model reaches 97 % exact‑match on a held‑out test set of 5 k synthetic images, indicating near‑human performance for printed Tigrinya.
  • Training completes in under three hours on an 8 GB GPU, proving the approach is practical for small research labs or startups.

Practical Implications

  • Rapid Localization – Companies looking to add OCR support for African languages can now fine‑tune a pre‑existing transformer OCR model rather than building one from scratch.
  • Low‑Cost Deployment – The modest hardware requirements (single consumer GPU) make it feasible to embed Tigrinya OCR in edge devices, mobile apps, or low‑budget cloud services.
  • Template for Other Scripts – The Word‑Aware Loss Weighting concept can be transplanted to any script where token boundaries are ambiguous (e.g., Amharic, Burmese, or even historic scripts).
  • Improved Data Pipelines – Accurate printed‑text OCR enables automated digitization of government forms, educational materials, and archival newspapers in Tigrinya, unlocking data for NLP, search, and analytics.
  • Open‑Source Ecosystem – By releasing the code and weights, the authors lower the barrier for community contributions, such as extending the model to handwritten Tigrinya or integrating it into OCR SaaS platforms.

Limitations & Future Work

  • Synthetic‑Only Training – The current evaluation uses purely synthetic images; real‑world scans (varying lighting, noise, or paper quality) may expose robustness gaps.
  • Handwritten Text – The model is tuned for printed fonts; extending to cursive or handwritten Tigrinya will likely require additional data and possibly architectural tweaks.
  • Vocabulary Size Trade‑off – Adding 230 new tokens inflates the tokenizer slightly; scaling to scripts with thousands of characters could strain memory on very small devices.
  • Cross‑Script Generalization – While WALW works well for Ge’ez, its effectiveness on scripts with different morphological properties (e.g., agglutinative languages) remains to be validated.

Future research directions include fine‑tuning on mixed synthetic‑real datasets, exploring hierarchical tokenizers to keep vocabulary compact, and applying the WALW strategy to multilingual OCR models that simultaneously handle Latin, CJK, and African scripts.

Authors

  • Yonatan Haile Medhanie
  • Yuanhua Ni

Paper Information

  • arXiv ID: 2604.20813v1
  • Categories: cs.CV
  • Published: April 22, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »