[Paper] Application of deep learning approaches for medieval historical documents transcription
Source: arXiv - 2512.18865v1
Overview
The paper introduces a deep‑learning pipeline that can automatically transcribe Latin handwritten texts from medieval manuscripts (9th–11th c.). By tailoring modern OCR/HTR techniques to the quirks of early‑medieval scripts, the authors achieve a level of accuracy that makes large‑scale digitisation of historical archives feasible.
Key Contributions
- Domain‑aware dataset creation – a curated collection of medieval Latin manuscript images with line‑ and word‑level annotations, plus a thorough exploratory data analysis.
- End‑to‑end transcription pipeline – combines object detection (locating text blocks), a classification model for word‑level recognition, and a learned embedding space for handling out‑of‑vocabulary glyphs.
- Comprehensive evaluation – reports recall, precision, F1, IoU, confusion matrices, and mean string distance, providing a transparent view of performance across script variations.
- Open‑source implementation – full code, trained models, and data preprocessing scripts are released on GitHub, enabling reproducibility and community extensions.
Methodology
-
Data Preparation
- Scanned manuscript pages are pre‑processed (binarisation, deskewing).
- Manual annotations define bounding boxes for individual words and lines.
- Augmentation (random rotation, elastic distortion) mimics the variability of ink, parchment, and scribe styles.
-
Object Detection
- A lightweight CNN‑based detector (e.g., Faster R‑CNN) scans each page to locate word‑sized regions.
- Detected boxes are filtered by an Intersection‑over‑Union (IoU) threshold to reduce false positives.
-
Word Recognition
- Detected word images are fed into a classification network (ResNet‑based) that maps them to a fixed vocabulary of Latin lemmas.
- For out‑of‑vocabulary or ambiguous glyphs, a word‑embedding branch learns a continuous representation, allowing similarity‑based decoding.
-
Post‑Processing
- A language model trained on medieval Latin corpora (character‑level LSTM) refines the raw predictions, correcting unlikely sequences.
-
Evaluation
- Metrics are computed at both the detection (IoU, precision/recall) and transcription levels (F1, mean string distance).
Results & Findings
| Metric | Value |
|---|---|
| Detection Precision | 0.92 |
| Detection Recall | 0.88 |
| Word‑level F1 Score | 0.84 |
| Mean String Distance (Levenshtein) | 1.7 characters |
| IoU (average) | 0.78 |
- The detector reliably isolates words despite irregular spacing and ink bleed.
- Classification accuracy remains high even for rare ligatures, thanks to the embedding fallback.
- Language‑model post‑processing cuts the average edit distance by ~30 %, demonstrating the value of contextual constraints.
Practical Implications
- Mass Digitisation – Archives can process thousands of pages with minimal human oversight, dramatically reducing the time and cost of creating searchable corpora.
- Digital Humanities Tools – Researchers gain near‑real‑time access to transcribed texts, enabling large‑scale linguistic, palaeographic, and cultural analyses that were previously impractical.
- Cross‑Domain Transfer – The modular pipeline (detector + classifier + embedding) can be re‑trained for other low‑resource historical scripts (e.g., early Cyrillic, Arabic) with modest data.
- Integration with Existing Platforms – The open‑source code can be wrapped as a micro‑service (REST API) and plugged into document‑management systems, library catalogues, or crowdsourcing platforms like Zooniverse.
Limitations & Future Work
- Vocabulary Coverage – The classifier relies on a predefined Latin lemma list; rare or corrupted words still fall back to the embedding route, which yields lower confidence.
- Script Diversity – Experiments are limited to 9th–11th c. Latin scripts; later medieval scripts with more elaborate abbreviations may require additional model capacity.
- Ground‑Truth Scarcity – Manual annotation is labor‑intensive; semi‑supervised or active‑learning strategies could further reduce the labeling burden.
- Real‑World Deployment – The current evaluation uses relatively clean scans; robustness to low‑resolution photographs or heavily damaged folios remains an open question.
The authors’ GitHub repository provides the full pipeline, trained weights, and instructions for extending the system to new manuscript collections.
Authors
- Maksym Voloshchuk
- Bohdana Zarembovska
- Mykola Kozlenko
Paper Information
- arXiv ID: 2512.18865v1
- Categories: cs.CV, cs.CL, cs.LG
- Published: December 21, 2025
- PDF: Download PDF