[Paper] RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

Published: 3 days ago (April 22, 2026 at 12:24 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.20738v1

Overview

The RespondeoQA benchmark introduces the first large‑scale question‑answering (QA) dataset that pairs Latin with English. Comprising roughly 7,800 curated QA pairs drawn from textbooks, exam sheets, and quiz‑bowl style trivia spanning two centuries, the resource lets researchers and engineers evaluate how well modern language models handle a “dead” language in both comprehension and translation tasks.

Key Contributions

A bilingual Latin‑English QA corpus (≈ 7.8 k pairs) covering diverse question types: factual recall, multihop reasoning, constrained translation, and literary‑device analysis.
A reproducible pipeline for extracting, cleaning, and manually validating QA items from legacy pedagogical sources—easily adaptable to other low‑resource or historical languages.
Baseline evaluation of three state‑of‑the‑art LLMs (LLaMA 3, Qwen QwQ, OpenAI o3‑mini) that highlights systematic weaknesses on skill‑oriented Latin queries.
Open‑source release (GitHub) with data, scripts, and evaluation scripts, encouraging community contributions and cross‑lingual benchmarking.

Methodology

Source Mining – The authors scraped publicly available Latin teaching materials (exam archives, quiz‑bowl databases, and classic textbooks).
Automated Extraction – Regular‑expression patterns and simple NLP heuristics identified question stems, answer keys, and any accompanying English translations.
Cleaning & Normalization – Duplicate removal, spelling normalization (Latin diacritics, English orthography), and token‑level alignment were performed automatically.
Human Review – A team of Latin scholars manually verified each pair for correctness, language consistency, and difficulty level, resulting in a high‑quality gold standard.
Task Formulation – Each entry can be used in two ways:
- QA – given a question in either Latin or English, generate the answer;
- Translation QA – translate the question before answering, testing cross‑lingual reasoning.
Baseline Experiments – The three LLMs were prompted in zero‑shot mode with both language variants, and performance was measured using exact‑match and F1 scores across the different question categories.

Results & Findings

Model	Best overall score (F1)	Strongest area	Weakest area
LLaMA 3	0.42	Scansion & literary‑device detection (Latin)	Skill‑oriented factual recall (English)
Qwen QwQ	0.44	Slight edge on Latin‑language questions	Multihop reasoning
OpenAI o3‑mini	0.38	Consistent across languages for simple fact QA	Complex reasoning & translation constraints

All models struggled most with skill‑oriented questions that require knowledge of Latin grammar, meter, or rhetorical devices.
Reasoning‑enhanced prompts (chain‑of‑thought) gave modest gains on scansion tasks but did not close the gap on multihop or translation‑heavy items.
The language of the prompt matters: QwQ performed slightly better when the question was presented in Latin, suggesting some models retain language‑specific priors even after extensive multilingual pre‑training.

Practical Implications

Educational Tech – Platforms that auto‑grade Latin exams or generate practice quizzes can now benchmark their pipelines against a realistic, diverse dataset rather than synthetic examples.
Cross‑lingual Retrieval – Search engines targeting historical texts (e.g., digitized manuscripts) can use RespondeoQA to fine‑tune retrieval‑augmented generation models for Latin‑English query translation.
Low‑Resource Model Development – The open pipeline demonstrates a viable path to bootstrap QA resources for other under‑represented languages (e.g., Classical Greek, Old Norse).
Prompt Engineering – The observed sensitivity to question language underscores the need for language‑aware prompting strategies when deploying multilingual LLMs in production.

Limitations & Future Work

Domain Concentration – The dataset leans heavily on academic and trivia sources; real‑world user queries (e.g., casual historical curiosity) are under‑represented.
Size – At ~7.8 k pairs, RespondeoQA is modest compared to mainstream QA corpora, limiting its utility for large‑scale fine‑tuning.
Evaluation Scope – Only zero‑shot performance was examined; future work could explore few‑shot or adapter‑based fine‑tuning to quantify potential gains.
Extension to Other Classical Languages – The authors propose adapting the pipeline to Greek, Sanskrit, or even extinct scripts, but this remains to be demonstrated.

RespondeoQA opens a new frontier for evaluating language models in a niche yet culturally rich domain. By providing both the data and a reproducible creation workflow, it invites developers to experiment with multilingual reasoning, improve educational tooling, and extend the approach to other low‑resource languages.

Authors

Marisa Hudspeth
Patrick J. Burns
Brendan O’Connor

Paper Information

arXiv ID: 2604.20738v1
Categories: cs.CL
Published: April 22, 2026
PDF: Download PDF

[Paper] RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation