[Paper] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Published: 2 months ago (November 26, 2025 at 11:40 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2511.21568v1

Overview

Large Language Models (LLMs) still stumble when the same question is re‑phrased, revealing that they often latch onto surface wording instead of true meaning. The paper introduces RoParQ, a benchmark that measures how consistently LLMs answer paraphrased multiple‑choice questions, and proposes a fine‑tuning recipe that makes models far more robust to such variations.

Key Contributions

RoParQ benchmark – a curated set of closed‑book multiple‑choice QA items with multiple paraphrased variants, selected to expose inconsistency in a “judge” model.
XParaCon metric – a simple, interpretable statistic (standard deviation of accuracies across paraphrase groups) that quantifies cross‑paraphrase robustness.
Paraphrase‑aware Supervised Fine‑Tuning (SFT) – a reasoning‑centric training regime that explicitly teaches the model to produce the same answer regardless of surface wording.
Empirical evidence that lightweight, fine‑tuned models can match or surpass the consistency of much larger, off‑the‑shelf LLMs.

Methodology

Data creation – Starting from existing QA datasets (e.g., RACE, ARC), the authors used proprietary paraphrase generators to produce several re‑phrasings of each question.
Inconsistency filtering – A separate “judge” LLM evaluated each variant; only those where the judge’s confidence varied significantly were kept, ensuring the benchmark focuses on truly ambiguous cases.
Metric design (XParaCon) – For each original question, the accuracies of all its paraphrases are computed; the standard deviation across these accuracies becomes the robustness score (lower = more consistent).
Paraphrase‑aware SFT – During fine‑tuning, each training example includes all its paraphrases together with a shared target answer. The loss encourages the model to produce identical logits for every variant, effectively aligning its internal reasoning to the underlying semantics rather than the wording.

Results & Findings

Baseline inconsistency – Off‑the‑shelf LLMs (e.g., GPT‑3.5, LLaMA‑13B) showed XParaCon scores around 0.12–0.15, indicating noticeable variance across paraphrases.
After SFT – Fine‑tuned LLaMA‑7B achieved an XParaCon of 0.04, a ~70 % reduction in variance, while maintaining comparable overall accuracy.
Size vs. consistency trade‑off – A 1.3 B parameter model, after paraphrase‑aware SFT, matched the consistency of a 13 B model with no fine‑tuning, suggesting that targeted training can compensate for raw model size.
Reasoning prompts – Adding chain‑of‑thought style explanations during SFT further lowered variance, confirming that explicit reasoning helps the model focus on semantics.

Practical Implications

More reliable chatbots & assistants – Users often rephrase queries; a model trained with RoParQ‑style alignment will give stable answers, reducing confusion and support tickets.
Robust evaluation pipelines – Developers can adopt XParaCon as a quick sanity check for any new LLM deployment, catching brittleness before release.
Cost‑effective scaling – Smaller models can be fine‑tuned to reach the consistency of larger, more expensive APIs, enabling on‑premise or edge deployments with predictable behavior.
Improved downstream tasks – Tasks that rely on QA consistency (e.g., automated grading, knowledge‑base extraction) benefit from fewer false negatives caused by paraphrase noise.

Limitations & Future Work

Paraphrase generation reliance – The benchmark depends on proprietary paraphrasing models; diversity may be limited compared to human‑written variations.
Closed‑book focus – RoParQ evaluates only multiple‑choice QA without external retrieval; extending to open‑ended or retrieval‑augmented settings remains open.
Metric simplicity – XParaCon captures variance but not systematic bias (e.g., consistently wrong answers across paraphrases). Future metrics could combine consistency with correctness.
Scalability of SFT – While effective for medium‑size models, applying the same fine‑tuning regime to the largest LLMs may require more compute and careful regularization to avoid over‑fitting.

Bottom line: By explicitly training LLMs to treat paraphrased inputs as semantically identical, RoParQ and its paraphrase‑aware fine‑tuning recipe give developers a practical path to more dependable AI assistants—without needing to chase ever‑larger model sizes.

Authors

Minjoon Choi

Paper Information

arXiv ID: 2511.21568v1
Categories: cs.CL
Published: November 26, 2025
PDF: Download PDF

[Paper] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization

[Paper] Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Ambiguity Awareness Optimization: Towards Semantic Disambiguation for Direct Preference Optimization

[Paper] Is Passive Expertise-Based Personalization Enough? A Case Study in AI-Assisted Test-Taking

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation