[Paper] JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs

Published: 3 months ago (January 30, 2026 at 12:06 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.23183v1

Overview

The paper presents JobResQA, a new multilingual benchmark that tests how well large language models (LLMs) can read and understand résumé‑job‑description (JD) pairs. By covering five languages and three difficulty levels, the dataset shines a light on the current strengths and blind spots of LLM‑driven HR tools, especially when it comes to privacy‑preserving data and fairness analysis.

Key Contributions

A multilingual HR‑focused QA benchmark – 581 questions over 105 synthetic résumé‑JD pairs in English, Spanish, Italian, German, and Chinese.
Three-tiered question complexity – ranging from simple fact extraction to cross‑document reasoning that mimics real recruiter queries.
Privacy‑first data generation pipeline – de‑identifies real résumés, injects controllable demographic and professional placeholders, and synthesizes realistic content.
Cost‑effective human‑in‑the‑loop translation (TEaR) – combines machine translation, MQM error annotation, and selective post‑editing to produce high‑quality parallel data.
Baseline evaluation using “LLM‑as‑judge” – shows strong performance on English/Spanish but steep drops for Italian, German, and Chinese, exposing multilingual gaps.
Open‑source release – full dataset, generation scripts, and evaluation code are publicly available for reproducibility.

Methodology

Source Collection & De‑identification – Real résumés and JDs were stripped of personal identifiers.
Synthetic Pair Creation – Placeholders (e.g., <AGE>, <ROLE>) were inserted to control demographics and job titles, then filled with realistic values using rule‑based generators.
Question Design – Domain experts authored questions at three levels:
- Level 1: Direct facts (e.g., “What is the candidate’s years‑of‑experience?”)
- Level 2: Intra‑document inference (e.g., “Which skill is mentioned most often?”)
- Level 3: Cross‑document reasoning (e.g., “Is the candidate qualified for the senior data‑engineer role?”)
Multilingual Translation (TEaR) – Machine translation produced initial drafts; annotators applied MQM (Multidimensional Quality Metrics) to flag errors, followed by targeted post‑editing only where the error score exceeded a threshold.
Evaluation Framework – Several open‑weight LLM families (e.g., Llama‑2, Mistral, Bloom) were prompted to answer the questions. An LLM‑as‑judge model scored answer correctness, enabling a language‑agnostic performance snapshot.

Results & Findings

English & Spanish: Average exact‑match scores > 70 % for Level 1 and ~55 % for Level 3, indicating solid factual and reasoning abilities.
Italian, German, Chinese: Scores dropped by 20‑35 % across all levels, with Level 3 often falling below 30 %.
Cross‑language transfer: Models fine‑tuned on English data performed only marginally better on other languages, suggesting limited multilingual generalization.
Bias detection: The placeholder‑driven design allowed the authors to surface subtle gender and seniority biases in model outputs, confirming the benchmark’s utility for fairness audits.

Practical Implications

Recruitment automation: Companies can use JobResQA to benchmark their in‑house LLMs before deploying résumé screening or JD matching bots, ensuring they meet language‑specific quality thresholds.
Fairness & compliance: The controlled demographic attributes make it easy to run bias checks (e.g., “Does the model favor male candidates for senior roles?”) and align with GDPR‑style privacy requirements.
Product roadmap: The stark performance gap for non‑English languages signals a need for targeted multilingual fine‑tuning or hybrid pipelines (e.g., translate‑then‑answer) in global HR SaaS platforms.
Cost‑effective localization: The TEaR translation workflow demonstrates a scalable way to create high‑quality multilingual training data without the expense of full human translation—useful for any product needing localized QA datasets.

Limitations & Future Work

Synthetic nature – Although grounded in real résumés, the data is still synthetic; edge‑case language use or industry‑specific jargon may be under‑represented.
Evaluation reliance on LLM‑as‑judge – The scoring model itself can inherit biases; human validation on a subset would strengthen reliability.
Scope of languages – Only five languages were covered; expanding to low‑resource languages (e.g., Arabic, Hindi) is a natural next step.
Dynamic HR contexts – Real‑time job market changes (new skill terms, remote‑work terminology) are not captured; periodic dataset refreshes are needed.

JobResQA opens the door for more transparent, fair, and multilingual HR AI systems—making it a valuable resource for developers building the next generation of recruitment tools.

Authors

Casimiro Pio Carrino
Paula Estrella
Rabih Zbib
Carlos Escolano
José A. R. Fonollosa

Paper Information

arXiv ID: 2601.23183v1
Categories: cs.CL
Published: January 30, 2026
PDF: Download PDF

[Paper] JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound

[Paper] UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection

[Paper] PaperBanana: Automating Academic Illustration for AI Scientists

[Paper] Agnostic Language Identification and Generation