[Paper] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Published: (November 26, 2025 at 11:40 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21568v1

Overview

Large Language Models (LLMs) still stumble when the same question is re‑phrased, revealing that they often latch onto surface wording instead of true meaning. The paper introduces RoParQ, a benchmark that measures how consistently LLMs answer paraphrased multiple‑choice questions, and proposes a fine‑tuning recipe that makes models far more robust to such variations.

Key Contributions

  • RoParQ benchmark – a curated set of closed‑book multiple‑choice QA items with multiple paraphrased variants, selected to expose inconsistency in a “judge” model.
  • XParaCon metric – a simple, interpretable statistic (standard deviation of accuracies across paraphrase groups) that quantifies cross‑paraphrase robustness.
  • Paraphrase‑aware Supervised Fine‑Tuning (SFT) – a reasoning‑centric training regime that explicitly teaches the model to produce the same answer regardless of surface wording.
  • Empirical evidence that lightweight, fine‑tuned models can match or surpass the consistency of much larger, off‑the‑shelf LLMs.

Methodology

  1. Data creation – Starting from existing QA datasets (e.g., RACE, ARC), the authors used proprietary paraphrase generators to produce several re‑phrasings of each question.
  2. Inconsistency filtering – A separate “judge” LLM evaluated each variant; only those where the judge’s confidence varied significantly were kept, ensuring the benchmark focuses on truly ambiguous cases.
  3. Metric design (XParaCon) – For each original question, the accuracies of all its paraphrases are computed; the standard deviation across these accuracies becomes the robustness score (lower = more consistent).
  4. Paraphrase‑aware SFT – During fine‑tuning, each training example includes all its paraphrases together with a shared target answer. The loss encourages the model to produce identical logits for every variant, effectively aligning its internal reasoning to the underlying semantics rather than the wording.

Results & Findings

  • Baseline inconsistency – Off‑the‑shelf LLMs (e.g., GPT‑3.5, LLaMA‑13B) showed XParaCon scores around 0.12–0.15, indicating noticeable variance across paraphrases.
  • After SFT – Fine‑tuned LLaMA‑7B achieved an XParaCon of 0.04, a ~70 % reduction in variance, while maintaining comparable overall accuracy.
  • Size vs. consistency trade‑off – A 1.3 B parameter model, after paraphrase‑aware SFT, matched the consistency of a 13 B model with no fine‑tuning, suggesting that targeted training can compensate for raw model size.
  • Reasoning prompts – Adding chain‑of‑thought style explanations during SFT further lowered variance, confirming that explicit reasoning helps the model focus on semantics.

Practical Implications

  • More reliable chatbots & assistants – Users often rephrase queries; a model trained with RoParQ‑style alignment will give stable answers, reducing confusion and support tickets.
  • Robust evaluation pipelines – Developers can adopt XParaCon as a quick sanity check for any new LLM deployment, catching brittleness before release.
  • Cost‑effective scaling – Smaller models can be fine‑tuned to reach the consistency of larger, more expensive APIs, enabling on‑premise or edge deployments with predictable behavior.
  • Improved downstream tasks – Tasks that rely on QA consistency (e.g., automated grading, knowledge‑base extraction) benefit from fewer false negatives caused by paraphrase noise.

Limitations & Future Work

  • Paraphrase generation reliance – The benchmark depends on proprietary paraphrasing models; diversity may be limited compared to human‑written variations.
  • Closed‑book focus – RoParQ evaluates only multiple‑choice QA without external retrieval; extending to open‑ended or retrieval‑augmented settings remains open.
  • Metric simplicity – XParaCon captures variance but not systematic bias (e.g., consistently wrong answers across paraphrases). Future metrics could combine consistency with correctness.
  • Scalability of SFT – While effective for medium‑size models, applying the same fine‑tuning regime to the largest LLMs may require more compute and careful regularization to avoid over‑fitting.

Bottom line: By explicitly training LLMs to treat paraphrased inputs as semantically identical, RoParQ and its paraphrase‑aware fine‑tuning recipe give developers a practical path to more dependable AI assistants—without needing to chase ever‑larger model sizes.

Authors

  • Minjoon Choi

Paper Information

  • arXiv ID: 2511.21568v1
  • Categories: cs.CL
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »