[Paper] RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering
Source: arXiv - 2604.20738v1
Overview
The RespondeoQA benchmark introduces the first large‑scale question‑answering (QA) dataset that pairs Latin with English. Comprising roughly 7,800 curated QA pairs drawn from textbooks, exam sheets, and quiz‑bowl style trivia spanning two centuries, the resource lets researchers and engineers evaluate how well modern language models handle a “dead” language in both comprehension and translation tasks.
Key Contributions
- A bilingual Latin‑English QA corpus (≈ 7.8 k pairs) covering diverse question types: factual recall, multihop reasoning, constrained translation, and literary‑device analysis.
- A reproducible pipeline for extracting, cleaning, and manually validating QA items from legacy pedagogical sources—easily adaptable to other low‑resource or historical languages.
- Baseline evaluation of three state‑of‑the‑art LLMs (LLaMA 3, Qwen QwQ, OpenAI o3‑mini) that highlights systematic weaknesses on skill‑oriented Latin queries.
- Open‑source release (GitHub) with data, scripts, and evaluation scripts, encouraging community contributions and cross‑lingual benchmarking.
Methodology
- Source Mining – The authors scraped publicly available Latin teaching materials (exam archives, quiz‑bowl databases, and classic textbooks).
- Automated Extraction – Regular‑expression patterns and simple NLP heuristics identified question stems, answer keys, and any accompanying English translations.
- Cleaning & Normalization – Duplicate removal, spelling normalization (Latin diacritics, English orthography), and token‑level alignment were performed automatically.
- Human Review – A team of Latin scholars manually verified each pair for correctness, language consistency, and difficulty level, resulting in a high‑quality gold standard.
- Task Formulation – Each entry can be used in two ways:
- QA – given a question in either Latin or English, generate the answer;
- Translation QA – translate the question before answering, testing cross‑lingual reasoning.
- Baseline Experiments – The three LLMs were prompted in zero‑shot mode with both language variants, and performance was measured using exact‑match and F1 scores across the different question categories.
Results & Findings
| Model | Best overall score (F1) | Strongest area | Weakest area |
|---|---|---|---|
| LLaMA 3 | 0.42 | Scansion & literary‑device detection (Latin) | Skill‑oriented factual recall (English) |
| Qwen QwQ | 0.44 | Slight edge on Latin‑language questions | Multihop reasoning |
| OpenAI o3‑mini | 0.38 | Consistent across languages for simple fact QA | Complex reasoning & translation constraints |
- All models struggled most with skill‑oriented questions that require knowledge of Latin grammar, meter, or rhetorical devices.
- Reasoning‑enhanced prompts (chain‑of‑thought) gave modest gains on scansion tasks but did not close the gap on multihop or translation‑heavy items.
- The language of the prompt matters: QwQ performed slightly better when the question was presented in Latin, suggesting some models retain language‑specific priors even after extensive multilingual pre‑training.
Practical Implications
- Educational Tech – Platforms that auto‑grade Latin exams or generate practice quizzes can now benchmark their pipelines against a realistic, diverse dataset rather than synthetic examples.
- Cross‑lingual Retrieval – Search engines targeting historical texts (e.g., digitized manuscripts) can use RespondeoQA to fine‑tune retrieval‑augmented generation models for Latin‑English query translation.
- Low‑Resource Model Development – The open pipeline demonstrates a viable path to bootstrap QA resources for other under‑represented languages (e.g., Classical Greek, Old Norse).
- Prompt Engineering – The observed sensitivity to question language underscores the need for language‑aware prompting strategies when deploying multilingual LLMs in production.
Limitations & Future Work
- Domain Concentration – The dataset leans heavily on academic and trivia sources; real‑world user queries (e.g., casual historical curiosity) are under‑represented.
- Size – At ~7.8 k pairs, RespondeoQA is modest compared to mainstream QA corpora, limiting its utility for large‑scale fine‑tuning.
- Evaluation Scope – Only zero‑shot performance was examined; future work could explore few‑shot or adapter‑based fine‑tuning to quantify potential gains.
- Extension to Other Classical Languages – The authors propose adapting the pipeline to Greek, Sanskrit, or even extinct scripts, but this remains to be demonstrated.
RespondeoQA opens a new frontier for evaluating language models in a niche yet culturally rich domain. By providing both the data and a reproducible creation workflow, it invites developers to experiment with multilingual reasoning, improve educational tooling, and extend the approach to other low‑resource languages.
Authors
- Marisa Hudspeth
- Patrick J. Burns
- Brendan O’Connor
Paper Information
- arXiv ID: 2604.20738v1
- Categories: cs.CL
- Published: April 22, 2026
- PDF: Download PDF