[Paper] Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

Published: 1 day ago (April 22, 2026 at 01:43 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.20813v1

Overview

The paper introduces the first successful adaptation of TrOCR, a transformer‑based OCR system, to recognize printed Tigrinya text written in the Ge’ez script. By extending the model’s tokenizer and adding a novel Word‑Aware Loss Weighting technique, the authors turn a Latin‑centric OCR engine into a high‑accuracy recognizer for an African syllabic writing system—achieving sub‑0.3 % character error rate in just a few hours of training on a consumer‑grade GPU.

Key Contributions

First TrOCR adaptation for Ge’ez (Tigrinya) – demonstrates that a state‑of‑the‑art Latin‑focused OCR can be repurposed for a non‑Latin, syllabic script.
Tokenizer extension – expands the byte‑pair‑encoding (BPE) vocabulary to cover 230 unique Ge’ez characters while preserving the original model architecture.
Word‑Aware Loss Weighting (WALW) – a loss‑scaling scheme that penalizes errors at word boundaries, fixing systematic failures caused by Latin‑centric BPE tokenization.
Efficient training pipeline – full adaptation (tokenizer, WALW, fine‑tuning) completes in < 3 hours on a single 8 GB GPU.
Open‑source release – code, pretrained weights, synthetic dataset, and evaluation scripts are publicly available, encouraging reproducibility and further research.

Methodology

Base Model Selection – The authors start from the publicly released TrOCR model pre‑trained on large Latin‑script OCR corpora.
Tokenizer Augmentation – They generate a new BPE vocabulary that includes all 230 Ge’ez glyphs, keeping the original byte‑level tokens for backward compatibility.
Synthetic Training Data – Using the GLOCR pipeline, 5 k printed‑font images of Tigrinya sentences are rendered, providing a clean, labeled dataset for fine‑tuning.
Word‑Aware Loss Weighting – During training, the cross‑entropy loss for tokens that sit at word boundaries is multiplied by a higher weight (empirically set to 5×). This forces the model to learn the correct segmentation of Ge’ez syllables, which are often merged under a Latin‑centric BPE scheme.
Fine‑tuning – The extended model is trained for 3 epochs with a modest learning rate, using mixed‑precision on a single consumer GPU. No architectural changes are made beyond the tokenizer and loss weighting.

Results & Findings

Metric	Baseline (no adaptation)	Tokenizer‑only	Full (Tokenizer + WALW)
Character Error Rate (CER)	> 30 % (unusable)	2.1 %	0.22 %
Exact‑Match Accuracy	< 1 %	45 %	97.20 %
Training Time	–	2.8 h	2.9 h

Ablation study shows that simply extending the vocabulary reduces CER from >30 % to ~2 %, but the Word‑Aware Loss Weighting delivers the final order‑of‑magnitude improvement.
The model reaches 97 % exact‑match on a held‑out test set of 5 k synthetic images, indicating near‑human performance for printed Tigrinya.
Training completes in under three hours on an 8 GB GPU, proving the approach is practical for small research labs or startups.

Practical Implications

Rapid Localization – Companies looking to add OCR support for African languages can now fine‑tune a pre‑existing transformer OCR model rather than building one from scratch.
Low‑Cost Deployment – The modest hardware requirements (single consumer GPU) make it feasible to embed Tigrinya OCR in edge devices, mobile apps, or low‑budget cloud services.
Template for Other Scripts – The Word‑Aware Loss Weighting concept can be transplanted to any script where token boundaries are ambiguous (e.g., Amharic, Burmese, or even historic scripts).
Improved Data Pipelines – Accurate printed‑text OCR enables automated digitization of government forms, educational materials, and archival newspapers in Tigrinya, unlocking data for NLP, search, and analytics.
Open‑Source Ecosystem – By releasing the code and weights, the authors lower the barrier for community contributions, such as extending the model to handwritten Tigrinya or integrating it into OCR SaaS platforms.

Limitations & Future Work

Synthetic‑Only Training – The current evaluation uses purely synthetic images; real‑world scans (varying lighting, noise, or paper quality) may expose robustness gaps.
Handwritten Text – The model is tuned for printed fonts; extending to cursive or handwritten Tigrinya will likely require additional data and possibly architectural tweaks.
Vocabulary Size Trade‑off – Adding 230 new tokens inflates the tokenizer slightly; scaling to scripts with thousands of characters could strain memory on very small devices.
Cross‑Script Generalization – While WALW works well for Ge’ez, its effectiveness on scripts with different morphological properties (e.g., agglutinative languages) remains to be validated.

Future research directions include fine‑tuning on mixed synthetic‑real datasets, exploring hierarchical tokenizers to keep vocabulary compact, and applying the WALW strategy to multilingual OCR models that simultaneously handle Latin, CJK, and African scripts.

Authors

Yonatan Haile Medhanie
Yuanhua Ni

Paper Information

arXiv ID: 2604.20813v1
Categories: cs.CV
Published: April 22, 2026
PDF: Download PDF

[Paper] Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

[Paper] Context Unrolling in Omni Models

[Paper] Vista4D: Video Reshooting with 4D Point Clouds