[Paper] RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

Published: (April 22, 2026 at 12:24 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.20738v1

Overview

The RespondeoQA benchmark introduces the first large‑scale question‑answering (QA) dataset that pairs Latin with English. Comprising roughly 7,800 curated QA pairs drawn from textbooks, exam sheets, and quiz‑bowl style trivia spanning two centuries, the resource lets researchers and engineers evaluate how well modern language models handle a “dead” language in both comprehension and translation tasks.

Key Contributions

  • A bilingual Latin‑English QA corpus (≈ 7.8 k pairs) covering diverse question types: factual recall, multihop reasoning, constrained translation, and literary‑device analysis.
  • A reproducible pipeline for extracting, cleaning, and manually validating QA items from legacy pedagogical sources—easily adaptable to other low‑resource or historical languages.
  • Baseline evaluation of three state‑of‑the‑art LLMs (LLaMA 3, Qwen QwQ, OpenAI o3‑mini) that highlights systematic weaknesses on skill‑oriented Latin queries.
  • Open‑source release (GitHub) with data, scripts, and evaluation scripts, encouraging community contributions and cross‑lingual benchmarking.

Methodology

  1. Source Mining – The authors scraped publicly available Latin teaching materials (exam archives, quiz‑bowl databases, and classic textbooks).
  2. Automated Extraction – Regular‑expression patterns and simple NLP heuristics identified question stems, answer keys, and any accompanying English translations.
  3. Cleaning & Normalization – Duplicate removal, spelling normalization (Latin diacritics, English orthography), and token‑level alignment were performed automatically.
  4. Human Review – A team of Latin scholars manually verified each pair for correctness, language consistency, and difficulty level, resulting in a high‑quality gold standard.
  5. Task Formulation – Each entry can be used in two ways:
    • QA – given a question in either Latin or English, generate the answer;
    • Translation QA – translate the question before answering, testing cross‑lingual reasoning.
  6. Baseline Experiments – The three LLMs were prompted in zero‑shot mode with both language variants, and performance was measured using exact‑match and F1 scores across the different question categories.

Results & Findings

ModelBest overall score (F1)Strongest areaWeakest area
LLaMA 30.42Scansion & literary‑device detection (Latin)Skill‑oriented factual recall (English)
Qwen QwQ0.44Slight edge on Latin‑language questionsMultihop reasoning
OpenAI o3‑mini0.38Consistent across languages for simple fact QAComplex reasoning & translation constraints
  • All models struggled most with skill‑oriented questions that require knowledge of Latin grammar, meter, or rhetorical devices.
  • Reasoning‑enhanced prompts (chain‑of‑thought) gave modest gains on scansion tasks but did not close the gap on multihop or translation‑heavy items.
  • The language of the prompt matters: QwQ performed slightly better when the question was presented in Latin, suggesting some models retain language‑specific priors even after extensive multilingual pre‑training.

Practical Implications

  • Educational Tech – Platforms that auto‑grade Latin exams or generate practice quizzes can now benchmark their pipelines against a realistic, diverse dataset rather than synthetic examples.
  • Cross‑lingual Retrieval – Search engines targeting historical texts (e.g., digitized manuscripts) can use RespondeoQA to fine‑tune retrieval‑augmented generation models for Latin‑English query translation.
  • Low‑Resource Model Development – The open pipeline demonstrates a viable path to bootstrap QA resources for other under‑represented languages (e.g., Classical Greek, Old Norse).
  • Prompt Engineering – The observed sensitivity to question language underscores the need for language‑aware prompting strategies when deploying multilingual LLMs in production.

Limitations & Future Work

  • Domain Concentration – The dataset leans heavily on academic and trivia sources; real‑world user queries (e.g., casual historical curiosity) are under‑represented.
  • Size – At ~7.8 k pairs, RespondeoQA is modest compared to mainstream QA corpora, limiting its utility for large‑scale fine‑tuning.
  • Evaluation Scope – Only zero‑shot performance was examined; future work could explore few‑shot or adapter‑based fine‑tuning to quantify potential gains.
  • Extension to Other Classical Languages – The authors propose adapting the pipeline to Greek, Sanskrit, or even extinct scripts, but this remains to be demonstrated.

RespondeoQA opens a new frontier for evaluating language models in a niche yet culturally rich domain. By providing both the data and a reproducible creation workflow, it invites developers to experiment with multilingual reasoning, improve educational tooling, and extend the approach to other low‑resource languages.

Authors

  • Marisa Hudspeth
  • Patrick J. Burns
  • Brendan O’Connor

Paper Information

  • arXiv ID: 2604.20738v1
  • Categories: cs.CL
  • Published: April 22, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »