[Paper] *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Published: 3 days ago (February 17, 2026 at 01:10 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15778v1

Overview

The paper presents *-PLUIE, a family of task‑specific, prompt‑driven metrics that extend the earlier ParaPLUIE approach for evaluating generated text. By leveraging large language models (LLMs) as “confidence estimators” rather than full‑blown judges, the authors achieve human‑level correlation with far lower compute costs—making automatic evaluation more practical for everyday development pipelines.

Key Contributions

Personalised prompting: Introduces a set of *‑PLUIE variants that tailor the underlying prompt to the target task (e.g., summarisation, translation, dialogue).
Confidence‑only inference: Uses the LLM’s probability of a “Yes/No” answer to gauge quality, avoiding costly token‑by‑token generation.
Empirical validation: Demonstrates that *‑PLUIE correlates more strongly with human judgments than the original ParaPLUIE and several popular reference‑free metrics (e.g., BERTScore, COMET‑Q).
Efficiency analysis: Shows up to 5× speed‑up and 80 % reduction in GPU memory usage compared to conventional LLM‑as‑judge pipelines.
Open‑source toolkit: Releases a lightweight Python library and a set of ready‑made prompts for common NLG tasks.

Methodology

Base metric (ParaPLUIE) – Starts from a perplexity‑based score that asks an LLM a binary “Is this output acceptable?” question and reads the model’s confidence (softmax probability) without generating any text.
Prompt personalisation – For each downstream task, the authors craft a short, task‑aware prompt (e.g., “Given the source article, does the summary capture the main points?”). The prompt is concatenated with the source‑output pair and fed to the LLM.
Score extraction – The LLM returns logits for “Yes” and “No”. The Yes‑confidence is taken as the quality score, optionally normalised across a validation set.
Evaluation protocol – They benchmark *‑PLUIE on three public NLG datasets (XSum summarisation, WMT‑21 translation, and MultiWOZ dialogue). Human ratings (Likert‑scale) serve as the gold standard. Correlation metrics (Pearson, Spearman, Kendall‑τ) quantify alignment.
Efficiency measurement – Runtime and GPU memory are recorded for *‑PLUIE, ParaPLUIE, and a full LLM‑judge (e.g., GPT‑4 with chain‑of‑thought prompting).

Results & Findings

Task	Metric	Pearson r (vs. human)	Speedup vs. LLM‑judge
Summarisation (XSum)	*‑PLUIE‑Sum	0.71	4.8×
Translation (WMT‑21)	*‑PLUIE‑Trans	0.68	5.2×
Dialogue (MultiWOZ)	*‑PLUIE‑Dial	0.73	5.0×
Baseline ParaPLUIE	0.58‑0.62	—
BERTScore	0.55‑0.60	—
COMET‑Q	0.62‑0.66	—

Higher correlation: All *‑PLUIE variants beat both the original ParaPLUIE and strong reference‑free baselines.
Consistent across domains: The improvement holds for both content‑heavy (summaries) and form‑heavy (translation) tasks.
Low variance: Confidence scores are stable across multiple random seeds, indicating robustness to prompt wording.
Resource savings: Even when using a 13B LLM, *‑PLUIE runs under 0.2 s per example on a single RTX 3090, compared to >1 s for a full LLM‑judge.

Practical Implications

Rapid prototyping: Developers can plug *‑PLUIE into CI pipelines to get instant feedback on model outputs without incurring the cost of full LLM inference.
Model‑agnostic evaluation: Since the metric only needs the source and hypothesis, it works for any generator, including proprietary or open‑source models.
Fine‑grained monitoring: The binary confidence can be thresholded to flag low‑quality generations for human review, enabling semi‑automated quality control.
Cost‑effective scaling: Teams can evaluate millions of generated sentences nightly on a modest GPU cluster, freeing up budget for model training.
Custom prompt creation: The open‑source library includes a simple API (pluie.evaluate(task="summarization", source, hypothesis)) and encourages domain experts to craft their own prompts for niche applications (e.g., code generation, data‑to‑text).

Limitations & Future Work

Prompt sensitivity: While the authors report low variance, extreme re‑phrasings can still shift confidence scores; systematic prompt‑search methods are needed.
LLM dependency: The metric’s quality hinges on the underlying LLM’s knowledge; outdated or domain‑specific LLMs may underperform.
Binary framing: Reducing evaluation to a Yes/No confidence may miss nuanced errors (e.g., factual hallucinations that are partially correct).
Future directions: The authors plan to explore multi‑class confidence (e.g., “Excellent/Good/Fair/Poor”), integrate retrieval‑augmented prompts for factual tasks, and benchmark on low‑resource languages.

‑PLUIE demonstrates that a well‑designed, task‑aware prompt can turn a large language model into a lightweight, high‑fidelity evaluator—opening the door for more sustainable and developer‑friendly NLG quality assessment.

Authors

Quentin Lemesle
Léane Jourdan
Daisy Munson
Pierre Alain
Jonathan Chevelu
Arnaud Delhay
Damien Lolive

Paper Information

arXiv ID: 2602.15778v1
Categories: cs.CL
Published: February 17, 2026
PDF: Download PDF

[Paper] *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Sink-Aware Pruning for Diffusion Language Models

[Paper] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

[Paper] What Language is This? Ask Your Tokenizer

[Paper] Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking