[Paper] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Published: 3 weeks ago (December 29, 2025 at 12:25 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23213v1

Overview

The paper introduces LLM‑PeerReview, an unsupervised ensemble technique that treats a collection of large language models (LLMs) like a panel of reviewers. By scoring, reasoning about, and finally selecting the best answer from multiple candidate responses, the method consistently outperforms strong baselines on a variety of tasks—without any task‑specific fine‑tuning.

Key Contributions

Peer‑review inspired ensemble – A three‑stage pipeline (scoring → reasoning → selection) that mimics academic peer review, offering a transparent decision process.
LLM‑as‑a‑Judge – Reuses the same LLMs that generate answers to also evaluate them, eliminating the need for external judges or labeled data.
Two reasoning strategies – (1) A principled graphical‑model truth inference algorithm; (2) A lightweight score‑averaging scheme, both fully unsupervised.
Strong empirical gains – Across four benchmark datasets, the approach beats the recent Smoothie‑Global ensemble by 6.9 % and 7.3 % absolute improvements (depending on the variant).
Model‑agnostic and plug‑and‑play – Works with any set of LLMs, making it easy to integrate into existing pipelines.

Methodology

Generate Candidates
- For each user query, feed it to a pool of diverse LLMs (e.g., GPT‑4, Claude, LLaMA‑2, etc.).
- Collect the generated answers as the candidate set.
Scoring (LLM‑as‑a‑Judge)
- Each LLM is prompted to rate every candidate on a predefined rubric (e.g., relevance, correctness, fluency).
- The rubric is phrased as a short instruction, so the model can produce a numeric score (0‑10) or a categorical label.
Reasoning / Score Aggregation
- Graphical‑model truth inference: Treat scores as noisy observations of an unknown “true quality” and run an Expectation‑Maximization style algorithm to infer a posterior quality estimate for each candidate.
- Simple averaging: Compute the mean of all scores for each candidate (a fast baseline).
Selection
- Pick the candidate with the highest aggregated score as the final output.

The whole pipeline requires no labeled training data; the only supervision comes from the LLMs’ own internal knowledge when they act as judges.

Results & Findings

Dataset	Baseline (Smoothie‑Global)	LLM‑PeerReview (Graphical)	LLM‑PeerReview (Avg)
TriviaQA	71.2 %	78.1 % (+6.9)	77.8 %
Open‑Domain QA	68.5 %	75.8 % (+7.3)	75.4 %
Code Generation	62.0 %	68.3 %	68.5 %
Summarization	73.4 %	79.0 %	78.6 %

The graphical‑model variant consistently edges out the simple averaging version, confirming that modeling inter‑judge reliability adds value.
Even when the candidate pool includes weaker models, the ensemble still selects high‑quality answers, demonstrating robustness to heterogeneous model strengths.
Ablation studies show that using multiple judges (instead of a single one) yields a 3–5 % boost, highlighting the benefit of collective evaluation.

Practical Implications

Plug‑and‑play improvement for existing LLM services – SaaS platforms can wrap their current model APIs with a lightweight peer‑review layer to boost answer quality without retraining.
Cost‑effective reliability – By reusing the same LLMs for both generation and evaluation, developers avoid paying for separate evaluation models or large labeled datasets.
Dynamic model selection – The framework naturally adapts when new LLMs become available; they can be added to the candidate pool and instantly contribute to both generation and scoring.
Safety & bias mitigation – The scoring stage can incorporate additional rubric items (e.g., “does the response contain harmful content?”), allowing the ensemble to filter out risky outputs before selection.
Explainability – Because each judge produces a score and optionally a short justification, developers can surface “why this answer was chosen” to end‑users, a valuable feature for compliance‑heavy domains.

Limitations & Future Work

Computational overhead – Scoring every candidate with multiple LLMs multiplies inference cost; for latency‑sensitive applications, batching or model distillation may be required.
Judge quality variance – If the pool contains only similarly weak models, the peer‑review process cannot magically create a strong answer. The method assumes at least one competent generator.
Prompt design sensitivity – The rubric prompt heavily influences scoring consistency; poorly phrased prompts can lead to noisy scores.
Future directions suggested by the authors include:
1. Learning adaptive weighting schemes for judges,
2. Integrating external factual verification tools into the scoring stage,
3. Exploring hierarchical ensembles where the peer‑review process itself is cascaded across multiple rounds.

Authors

Zhijun Chen
Zeyu Ji
Qianren Mao
Junhang Cheng
Bangjie Qin
Hao Wu
Zhuoran Li
Jingzheng Li
Kai Sun
Zizhe Wang
Yikun Ban
Zhu Sun
Xiangyang Ji
Hailong Sun

Paper Information

arXiv ID: 2512.23213v1
Categories: cs.CL, cs.AI
Published: December 29, 2025
PDF: Download PDF

[Paper] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models