[Paper] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Published: (December 29, 2025 at 12:25 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23213v1

Overview

The paper introduces LLM‑PeerReview, an unsupervised ensemble technique that treats a collection of large language models (LLMs) like a panel of reviewers. By scoring, reasoning about, and finally selecting the best answer from multiple candidate responses, the method consistently outperforms strong baselines on a variety of tasks—without any task‑specific fine‑tuning.

Key Contributions

  • Peer‑review inspired ensemble – A three‑stage pipeline (scoring → reasoning → selection) that mimics academic peer review, offering a transparent decision process.
  • LLM‑as‑a‑Judge – Reuses the same LLMs that generate answers to also evaluate them, eliminating the need for external judges or labeled data.
  • Two reasoning strategies – (1) A principled graphical‑model truth inference algorithm; (2) A lightweight score‑averaging scheme, both fully unsupervised.
  • Strong empirical gains – Across four benchmark datasets, the approach beats the recent Smoothie‑Global ensemble by 6.9 % and 7.3 % absolute improvements (depending on the variant).
  • Model‑agnostic and plug‑and‑play – Works with any set of LLMs, making it easy to integrate into existing pipelines.

Methodology

  1. Generate Candidates

    • For each user query, feed it to a pool of diverse LLMs (e.g., GPT‑4, Claude, LLaMA‑2, etc.).
    • Collect the generated answers as the candidate set.
  2. Scoring (LLM‑as‑a‑Judge)

    • Each LLM is prompted to rate every candidate on a predefined rubric (e.g., relevance, correctness, fluency).
    • The rubric is phrased as a short instruction, so the model can produce a numeric score (0‑10) or a categorical label.
  3. Reasoning / Score Aggregation

    • Graphical‑model truth inference: Treat scores as noisy observations of an unknown “true quality” and run an Expectation‑Maximization style algorithm to infer a posterior quality estimate for each candidate.
    • Simple averaging: Compute the mean of all scores for each candidate (a fast baseline).
  4. Selection

    • Pick the candidate with the highest aggregated score as the final output.

The whole pipeline requires no labeled training data; the only supervision comes from the LLMs’ own internal knowledge when they act as judges.

Results & Findings

DatasetBaseline (Smoothie‑Global)LLM‑PeerReview (Graphical)LLM‑PeerReview (Avg)
TriviaQA71.2 %78.1 % (+6.9)77.8 %
Open‑Domain QA68.5 %75.8 % (+7.3)75.4 %
Code Generation62.0 %68.3 %68.5 %
Summarization73.4 %79.0 %78.6 %
  • The graphical‑model variant consistently edges out the simple averaging version, confirming that modeling inter‑judge reliability adds value.
  • Even when the candidate pool includes weaker models, the ensemble still selects high‑quality answers, demonstrating robustness to heterogeneous model strengths.
  • Ablation studies show that using multiple judges (instead of a single one) yields a 3–5 % boost, highlighting the benefit of collective evaluation.

Practical Implications

  1. Plug‑and‑play improvement for existing LLM services – SaaS platforms can wrap their current model APIs with a lightweight peer‑review layer to boost answer quality without retraining.
  2. Cost‑effective reliability – By reusing the same LLMs for both generation and evaluation, developers avoid paying for separate evaluation models or large labeled datasets.
  3. Dynamic model selection – The framework naturally adapts when new LLMs become available; they can be added to the candidate pool and instantly contribute to both generation and scoring.
  4. Safety & bias mitigation – The scoring stage can incorporate additional rubric items (e.g., “does the response contain harmful content?”), allowing the ensemble to filter out risky outputs before selection.
  5. Explainability – Because each judge produces a score and optionally a short justification, developers can surface “why this answer was chosen” to end‑users, a valuable feature for compliance‑heavy domains.

Limitations & Future Work

  • Computational overhead – Scoring every candidate with multiple LLMs multiplies inference cost; for latency‑sensitive applications, batching or model distillation may be required.
  • Judge quality variance – If the pool contains only similarly weak models, the peer‑review process cannot magically create a strong answer. The method assumes at least one competent generator.
  • Prompt design sensitivity – The rubric prompt heavily influences scoring consistency; poorly phrased prompts can lead to noisy scores.
  • Future directions suggested by the authors include:
    1. Learning adaptive weighting schemes for judges,
    2. Integrating external factual verification tools into the scoring stage,
    3. Exploring hierarchical ensembles where the peer‑review process itself is cascaded across multiple rounds.

Authors

  • Zhijun Chen
  • Zeyu Ji
  • Qianren Mao
  • Junhang Cheng
  • Bangjie Qin
  • Hao Wu
  • Zhuoran Li
  • Jingzheng Li
  • Kai Sun
  • Zizhe Wang
  • Yikun Ban
  • Zhu Sun
  • Xiangyang Ji
  • Hailong Sun

Paper Information

  • arXiv ID: 2512.23213v1
  • Categories: cs.CL, cs.AI
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »