[Paper] JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs
Source: arXiv - 2601.23183v1
Overview
The paper presents JobResQA, a new multilingual benchmark that tests how well large language models (LLMs) can read and understand résumé‑job‑description (JD) pairs. By covering five languages and three difficulty levels, the dataset shines a light on the current strengths and blind spots of LLM‑driven HR tools, especially when it comes to privacy‑preserving data and fairness analysis.
Key Contributions
- A multilingual HR‑focused QA benchmark – 581 questions over 105 synthetic résumé‑JD pairs in English, Spanish, Italian, German, and Chinese.
- Three-tiered question complexity – ranging from simple fact extraction to cross‑document reasoning that mimics real recruiter queries.
- Privacy‑first data generation pipeline – de‑identifies real résumés, injects controllable demographic and professional placeholders, and synthesizes realistic content.
- Cost‑effective human‑in‑the‑loop translation (TEaR) – combines machine translation, MQM error annotation, and selective post‑editing to produce high‑quality parallel data.
- Baseline evaluation using “LLM‑as‑judge” – shows strong performance on English/Spanish but steep drops for Italian, German, and Chinese, exposing multilingual gaps.
- Open‑source release – full dataset, generation scripts, and evaluation code are publicly available for reproducibility.
Methodology
- Source Collection & De‑identification – Real résumés and JDs were stripped of personal identifiers.
- Synthetic Pair Creation – Placeholders (e.g.,
<AGE>,<ROLE>) were inserted to control demographics and job titles, then filled with realistic values using rule‑based generators. - Question Design – Domain experts authored questions at three levels:
- Level 1: Direct facts (e.g., “What is the candidate’s years‑of‑experience?”)
- Level 2: Intra‑document inference (e.g., “Which skill is mentioned most often?”)
- Level 3: Cross‑document reasoning (e.g., “Is the candidate qualified for the senior data‑engineer role?”)
- Multilingual Translation (TEaR) – Machine translation produced initial drafts; annotators applied MQM (Multidimensional Quality Metrics) to flag errors, followed by targeted post‑editing only where the error score exceeded a threshold.
- Evaluation Framework – Several open‑weight LLM families (e.g., Llama‑2, Mistral, Bloom) were prompted to answer the questions. An LLM‑as‑judge model scored answer correctness, enabling a language‑agnostic performance snapshot.
Results & Findings
- English & Spanish: Average exact‑match scores > 70 % for Level 1 and ~55 % for Level 3, indicating solid factual and reasoning abilities.
- Italian, German, Chinese: Scores dropped by 20‑35 % across all levels, with Level 3 often falling below 30 %.
- Cross‑language transfer: Models fine‑tuned on English data performed only marginally better on other languages, suggesting limited multilingual generalization.
- Bias detection: The placeholder‑driven design allowed the authors to surface subtle gender and seniority biases in model outputs, confirming the benchmark’s utility for fairness audits.
Practical Implications
- Recruitment automation: Companies can use JobResQA to benchmark their in‑house LLMs before deploying résumé screening or JD matching bots, ensuring they meet language‑specific quality thresholds.
- Fairness & compliance: The controlled demographic attributes make it easy to run bias checks (e.g., “Does the model favor male candidates for senior roles?”) and align with GDPR‑style privacy requirements.
- Product roadmap: The stark performance gap for non‑English languages signals a need for targeted multilingual fine‑tuning or hybrid pipelines (e.g., translate‑then‑answer) in global HR SaaS platforms.
- Cost‑effective localization: The TEaR translation workflow demonstrates a scalable way to create high‑quality multilingual training data without the expense of full human translation—useful for any product needing localized QA datasets.
Limitations & Future Work
- Synthetic nature – Although grounded in real résumés, the data is still synthetic; edge‑case language use or industry‑specific jargon may be under‑represented.
- Evaluation reliance on LLM‑as‑judge – The scoring model itself can inherit biases; human validation on a subset would strengthen reliability.
- Scope of languages – Only five languages were covered; expanding to low‑resource languages (e.g., Arabic, Hindi) is a natural next step.
- Dynamic HR contexts – Real‑time job market changes (new skill terms, remote‑work terminology) are not captured; periodic dataset refreshes are needed.
JobResQA opens the door for more transparent, fair, and multilingual HR AI systems—making it a valuable resource for developers building the next generation of recruitment tools.
Authors
- Casimiro Pio Carrino
- Paula Estrella
- Rabih Zbib
- Carlos Escolano
- José A. R. Fonollosa
Paper Information
- arXiv ID: 2601.23183v1
- Categories: cs.CL
- Published: January 30, 2026
- PDF: Download PDF