[Paper] JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs

Published: (January 30, 2026 at 12:06 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.23183v1

Overview

The paper presents JobResQA, a new multilingual benchmark that tests how well large language models (LLMs) can read and understand résumé‑job‑description (JD) pairs. By covering five languages and three difficulty levels, the dataset shines a light on the current strengths and blind spots of LLM‑driven HR tools, especially when it comes to privacy‑preserving data and fairness analysis.

Key Contributions

  • A multilingual HR‑focused QA benchmark – 581 questions over 105 synthetic résumé‑JD pairs in English, Spanish, Italian, German, and Chinese.
  • Three-tiered question complexity – ranging from simple fact extraction to cross‑document reasoning that mimics real recruiter queries.
  • Privacy‑first data generation pipeline – de‑identifies real résumés, injects controllable demographic and professional placeholders, and synthesizes realistic content.
  • Cost‑effective human‑in‑the‑loop translation (TEaR) – combines machine translation, MQM error annotation, and selective post‑editing to produce high‑quality parallel data.
  • Baseline evaluation using “LLM‑as‑judge” – shows strong performance on English/Spanish but steep drops for Italian, German, and Chinese, exposing multilingual gaps.
  • Open‑source release – full dataset, generation scripts, and evaluation code are publicly available for reproducibility.

Methodology

  1. Source Collection & De‑identification – Real résumés and JDs were stripped of personal identifiers.
  2. Synthetic Pair Creation – Placeholders (e.g., <AGE>, <ROLE>) were inserted to control demographics and job titles, then filled with realistic values using rule‑based generators.
  3. Question Design – Domain experts authored questions at three levels:
    • Level 1: Direct facts (e.g., “What is the candidate’s years‑of‑experience?”)
    • Level 2: Intra‑document inference (e.g., “Which skill is mentioned most often?”)
    • Level 3: Cross‑document reasoning (e.g., “Is the candidate qualified for the senior data‑engineer role?”)
  4. Multilingual Translation (TEaR) – Machine translation produced initial drafts; annotators applied MQM (Multidimensional Quality Metrics) to flag errors, followed by targeted post‑editing only where the error score exceeded a threshold.
  5. Evaluation Framework – Several open‑weight LLM families (e.g., Llama‑2, Mistral, Bloom) were prompted to answer the questions. An LLM‑as‑judge model scored answer correctness, enabling a language‑agnostic performance snapshot.

Results & Findings

  • English & Spanish: Average exact‑match scores > 70 % for Level 1 and ~55 % for Level 3, indicating solid factual and reasoning abilities.
  • Italian, German, Chinese: Scores dropped by 20‑35 % across all levels, with Level 3 often falling below 30 %.
  • Cross‑language transfer: Models fine‑tuned on English data performed only marginally better on other languages, suggesting limited multilingual generalization.
  • Bias detection: The placeholder‑driven design allowed the authors to surface subtle gender and seniority biases in model outputs, confirming the benchmark’s utility for fairness audits.

Practical Implications

  • Recruitment automation: Companies can use JobResQA to benchmark their in‑house LLMs before deploying résumé screening or JD matching bots, ensuring they meet language‑specific quality thresholds.
  • Fairness & compliance: The controlled demographic attributes make it easy to run bias checks (e.g., “Does the model favor male candidates for senior roles?”) and align with GDPR‑style privacy requirements.
  • Product roadmap: The stark performance gap for non‑English languages signals a need for targeted multilingual fine‑tuning or hybrid pipelines (e.g., translate‑then‑answer) in global HR SaaS platforms.
  • Cost‑effective localization: The TEaR translation workflow demonstrates a scalable way to create high‑quality multilingual training data without the expense of full human translation—useful for any product needing localized QA datasets.

Limitations & Future Work

  • Synthetic nature – Although grounded in real résumés, the data is still synthetic; edge‑case language use or industry‑specific jargon may be under‑represented.
  • Evaluation reliance on LLM‑as‑judge – The scoring model itself can inherit biases; human validation on a subset would strengthen reliability.
  • Scope of languages – Only five languages were covered; expanding to low‑resource languages (e.g., Arabic, Hindi) is a natural next step.
  • Dynamic HR contexts – Real‑time job market changes (new skill terms, remote‑work terminology) are not captured; periodic dataset refreshes are needed.

JobResQA opens the door for more transparent, fair, and multilingual HR AI systems—making it a valuable resource for developers building the next generation of recruitment tools.

Authors

  • Casimiro Pio Carrino
  • Paula Estrella
  • Rabih Zbib
  • Carlos Escolano
  • José A. R. Fonollosa

Paper Information

  • arXiv ID: 2601.23183v1
  • Categories: cs.CL
  • Published: January 30, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »