[Paper] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions
Source: arXiv - 2511.21568v1
Overview
Large Language Models (LLMs) still stumble when the same question is re‑phrased, revealing that they often latch onto surface wording instead of true meaning. The paper introduces RoParQ, a benchmark that measures how consistently LLMs answer paraphrased multiple‑choice questions, and proposes a fine‑tuning recipe that makes models far more robust to such variations.
Key Contributions
- RoParQ benchmark – a curated set of closed‑book multiple‑choice QA items with multiple paraphrased variants, selected to expose inconsistency in a “judge” model.
- XParaCon metric – a simple, interpretable statistic (standard deviation of accuracies across paraphrase groups) that quantifies cross‑paraphrase robustness.
- Paraphrase‑aware Supervised Fine‑Tuning (SFT) – a reasoning‑centric training regime that explicitly teaches the model to produce the same answer regardless of surface wording.
- Empirical evidence that lightweight, fine‑tuned models can match or surpass the consistency of much larger, off‑the‑shelf LLMs.
Methodology
- Data creation – Starting from existing QA datasets (e.g., RACE, ARC), the authors used proprietary paraphrase generators to produce several re‑phrasings of each question.
- Inconsistency filtering – A separate “judge” LLM evaluated each variant; only those where the judge’s confidence varied significantly were kept, ensuring the benchmark focuses on truly ambiguous cases.
- Metric design (XParaCon) – For each original question, the accuracies of all its paraphrases are computed; the standard deviation across these accuracies becomes the robustness score (lower = more consistent).
- Paraphrase‑aware SFT – During fine‑tuning, each training example includes all its paraphrases together with a shared target answer. The loss encourages the model to produce identical logits for every variant, effectively aligning its internal reasoning to the underlying semantics rather than the wording.
Results & Findings
- Baseline inconsistency – Off‑the‑shelf LLMs (e.g., GPT‑3.5, LLaMA‑13B) showed XParaCon scores around 0.12–0.15, indicating noticeable variance across paraphrases.
- After SFT – Fine‑tuned LLaMA‑7B achieved an XParaCon of 0.04, a ~70 % reduction in variance, while maintaining comparable overall accuracy.
- Size vs. consistency trade‑off – A 1.3 B parameter model, after paraphrase‑aware SFT, matched the consistency of a 13 B model with no fine‑tuning, suggesting that targeted training can compensate for raw model size.
- Reasoning prompts – Adding chain‑of‑thought style explanations during SFT further lowered variance, confirming that explicit reasoning helps the model focus on semantics.
Practical Implications
- More reliable chatbots & assistants – Users often rephrase queries; a model trained with RoParQ‑style alignment will give stable answers, reducing confusion and support tickets.
- Robust evaluation pipelines – Developers can adopt XParaCon as a quick sanity check for any new LLM deployment, catching brittleness before release.
- Cost‑effective scaling – Smaller models can be fine‑tuned to reach the consistency of larger, more expensive APIs, enabling on‑premise or edge deployments with predictable behavior.
- Improved downstream tasks – Tasks that rely on QA consistency (e.g., automated grading, knowledge‑base extraction) benefit from fewer false negatives caused by paraphrase noise.
Limitations & Future Work
- Paraphrase generation reliance – The benchmark depends on proprietary paraphrasing models; diversity may be limited compared to human‑written variations.
- Closed‑book focus – RoParQ evaluates only multiple‑choice QA without external retrieval; extending to open‑ended or retrieval‑augmented settings remains open.
- Metric simplicity – XParaCon captures variance but not systematic bias (e.g., consistently wrong answers across paraphrases). Future metrics could combine consistency with correctness.
- Scalability of SFT – While effective for medium‑size models, applying the same fine‑tuning regime to the largest LLMs may require more compute and careful regularization to avoid over‑fitting.
Bottom line: By explicitly training LLMs to treat paraphrased inputs as semantically identical, RoParQ and its paraphrase‑aware fine‑tuning recipe give developers a practical path to more dependable AI assistants—without needing to chase ever‑larger model sizes.
Authors
- Minjoon Choi
Paper Information
- arXiv ID: 2511.21568v1
- Categories: cs.CL
- Published: November 26, 2025
- PDF: Download PDF