[Paper] *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Source: arXiv - 2602.15778v1
Overview
The paper presents *-PLUIE, a family of task‑specific, prompt‑driven metrics that extend the earlier ParaPLUIE approach for evaluating generated text. By leveraging large language models (LLMs) as “confidence estimators” rather than full‑blown judges, the authors achieve human‑level correlation with far lower compute costs—making automatic evaluation more practical for everyday development pipelines.
Key Contributions
- Personalised prompting: Introduces a set of *‑PLUIE variants that tailor the underlying prompt to the target task (e.g., summarisation, translation, dialogue).
- Confidence‑only inference: Uses the LLM’s probability of a “Yes/No” answer to gauge quality, avoiding costly token‑by‑token generation.
- Empirical validation: Demonstrates that *‑PLUIE correlates more strongly with human judgments than the original ParaPLUIE and several popular reference‑free metrics (e.g., BERTScore, COMET‑Q).
- Efficiency analysis: Shows up to 5× speed‑up and 80 % reduction in GPU memory usage compared to conventional LLM‑as‑judge pipelines.
- Open‑source toolkit: Releases a lightweight Python library and a set of ready‑made prompts for common NLG tasks.
Methodology
- Base metric (ParaPLUIE) – Starts from a perplexity‑based score that asks an LLM a binary “Is this output acceptable?” question and reads the model’s confidence (softmax probability) without generating any text.
- Prompt personalisation – For each downstream task, the authors craft a short, task‑aware prompt (e.g., “Given the source article, does the summary capture the main points?”). The prompt is concatenated with the source‑output pair and fed to the LLM.
- Score extraction – The LLM returns logits for “Yes” and “No”. The Yes‑confidence is taken as the quality score, optionally normalised across a validation set.
- Evaluation protocol – They benchmark *‑PLUIE on three public NLG datasets (XSum summarisation, WMT‑21 translation, and MultiWOZ dialogue). Human ratings (Likert‑scale) serve as the gold standard. Correlation metrics (Pearson, Spearman, Kendall‑τ) quantify alignment.
- Efficiency measurement – Runtime and GPU memory are recorded for *‑PLUIE, ParaPLUIE, and a full LLM‑judge (e.g., GPT‑4 with chain‑of‑thought prompting).
Results & Findings
| Task | Metric | Pearson r (vs. human) | Speedup vs. LLM‑judge |
|---|---|---|---|
| Summarisation (XSum) | *‑PLUIE‑Sum | 0.71 | 4.8× |
| Translation (WMT‑21) | *‑PLUIE‑Trans | 0.68 | 5.2× |
| Dialogue (MultiWOZ) | *‑PLUIE‑Dial | 0.73 | 5.0× |
| Baseline ParaPLUIE | 0.58‑0.62 | — | |
| BERTScore | 0.55‑0.60 | — | |
| COMET‑Q | 0.62‑0.66 | — |
- Higher correlation: All *‑PLUIE variants beat both the original ParaPLUIE and strong reference‑free baselines.
- Consistent across domains: The improvement holds for both content‑heavy (summaries) and form‑heavy (translation) tasks.
- Low variance: Confidence scores are stable across multiple random seeds, indicating robustness to prompt wording.
- Resource savings: Even when using a 13B LLM, *‑PLUIE runs under 0.2 s per example on a single RTX 3090, compared to >1 s for a full LLM‑judge.
Practical Implications
- Rapid prototyping: Developers can plug *‑PLUIE into CI pipelines to get instant feedback on model outputs without incurring the cost of full LLM inference.
- Model‑agnostic evaluation: Since the metric only needs the source and hypothesis, it works for any generator, including proprietary or open‑source models.
- Fine‑grained monitoring: The binary confidence can be thresholded to flag low‑quality generations for human review, enabling semi‑automated quality control.
- Cost‑effective scaling: Teams can evaluate millions of generated sentences nightly on a modest GPU cluster, freeing up budget for model training.
- Custom prompt creation: The open‑source library includes a simple API (
pluie.evaluate(task="summarization", source, hypothesis)) and encourages domain experts to craft their own prompts for niche applications (e.g., code generation, data‑to‑text).
Limitations & Future Work
- Prompt sensitivity: While the authors report low variance, extreme re‑phrasings can still shift confidence scores; systematic prompt‑search methods are needed.
- LLM dependency: The metric’s quality hinges on the underlying LLM’s knowledge; outdated or domain‑specific LLMs may underperform.
- Binary framing: Reducing evaluation to a Yes/No confidence may miss nuanced errors (e.g., factual hallucinations that are partially correct).
- Future directions: The authors plan to explore multi‑class confidence (e.g., “Excellent/Good/Fair/Poor”), integrate retrieval‑augmented prompts for factual tasks, and benchmark on low‑resource languages.
‑PLUIE demonstrates that a well‑designed, task‑aware prompt can turn a large language model into a lightweight, high‑fidelity evaluator—opening the door for more sustainable and developer‑friendly NLG quality assessment.
Authors
- Quentin Lemesle
- Léane Jourdan
- Daisy Munson
- Pierre Alain
- Jonathan Chevelu
- Arnaud Delhay
- Damien Lolive
Paper Information
- arXiv ID: 2602.15778v1
- Categories: cs.CL
- Published: February 17, 2026
- PDF: Download PDF