[Paper] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
Source: arXiv - 2512.23213v1
Overview
The paper introduces LLM‑PeerReview, an unsupervised ensemble technique that treats a collection of large language models (LLMs) like a panel of reviewers. By scoring, reasoning about, and finally selecting the best answer from multiple candidate responses, the method consistently outperforms strong baselines on a variety of tasks—without any task‑specific fine‑tuning.
Key Contributions
- Peer‑review inspired ensemble – A three‑stage pipeline (scoring → reasoning → selection) that mimics academic peer review, offering a transparent decision process.
- LLM‑as‑a‑Judge – Reuses the same LLMs that generate answers to also evaluate them, eliminating the need for external judges or labeled data.
- Two reasoning strategies – (1) A principled graphical‑model truth inference algorithm; (2) A lightweight score‑averaging scheme, both fully unsupervised.
- Strong empirical gains – Across four benchmark datasets, the approach beats the recent Smoothie‑Global ensemble by 6.9 % and 7.3 % absolute improvements (depending on the variant).
- Model‑agnostic and plug‑and‑play – Works with any set of LLMs, making it easy to integrate into existing pipelines.
Methodology
-
Generate Candidates
- For each user query, feed it to a pool of diverse LLMs (e.g., GPT‑4, Claude, LLaMA‑2, etc.).
- Collect the generated answers as the candidate set.
-
Scoring (LLM‑as‑a‑Judge)
- Each LLM is prompted to rate every candidate on a predefined rubric (e.g., relevance, correctness, fluency).
- The rubric is phrased as a short instruction, so the model can produce a numeric score (0‑10) or a categorical label.
-
Reasoning / Score Aggregation
- Graphical‑model truth inference: Treat scores as noisy observations of an unknown “true quality” and run an Expectation‑Maximization style algorithm to infer a posterior quality estimate for each candidate.
- Simple averaging: Compute the mean of all scores for each candidate (a fast baseline).
-
Selection
- Pick the candidate with the highest aggregated score as the final output.
The whole pipeline requires no labeled training data; the only supervision comes from the LLMs’ own internal knowledge when they act as judges.
Results & Findings
| Dataset | Baseline (Smoothie‑Global) | LLM‑PeerReview (Graphical) | LLM‑PeerReview (Avg) |
|---|---|---|---|
| TriviaQA | 71.2 % | 78.1 % (+6.9) | 77.8 % |
| Open‑Domain QA | 68.5 % | 75.8 % (+7.3) | 75.4 % |
| Code Generation | 62.0 % | 68.3 % | 68.5 % |
| Summarization | 73.4 % | 79.0 % | 78.6 % |
- The graphical‑model variant consistently edges out the simple averaging version, confirming that modeling inter‑judge reliability adds value.
- Even when the candidate pool includes weaker models, the ensemble still selects high‑quality answers, demonstrating robustness to heterogeneous model strengths.
- Ablation studies show that using multiple judges (instead of a single one) yields a 3–5 % boost, highlighting the benefit of collective evaluation.
Practical Implications
- Plug‑and‑play improvement for existing LLM services – SaaS platforms can wrap their current model APIs with a lightweight peer‑review layer to boost answer quality without retraining.
- Cost‑effective reliability – By reusing the same LLMs for both generation and evaluation, developers avoid paying for separate evaluation models or large labeled datasets.
- Dynamic model selection – The framework naturally adapts when new LLMs become available; they can be added to the candidate pool and instantly contribute to both generation and scoring.
- Safety & bias mitigation – The scoring stage can incorporate additional rubric items (e.g., “does the response contain harmful content?”), allowing the ensemble to filter out risky outputs before selection.
- Explainability – Because each judge produces a score and optionally a short justification, developers can surface “why this answer was chosen” to end‑users, a valuable feature for compliance‑heavy domains.
Limitations & Future Work
- Computational overhead – Scoring every candidate with multiple LLMs multiplies inference cost; for latency‑sensitive applications, batching or model distillation may be required.
- Judge quality variance – If the pool contains only similarly weak models, the peer‑review process cannot magically create a strong answer. The method assumes at least one competent generator.
- Prompt design sensitivity – The rubric prompt heavily influences scoring consistency; poorly phrased prompts can lead to noisy scores.
- Future directions suggested by the authors include:
- Learning adaptive weighting schemes for judges,
- Integrating external factual verification tools into the scoring stage,
- Exploring hierarchical ensembles where the peer‑review process itself is cascaded across multiple rounds.
Authors
- Zhijun Chen
- Zeyu Ji
- Qianren Mao
- Junhang Cheng
- Bangjie Qin
- Hao Wu
- Zhuoran Li
- Jingzheng Li
- Kai Sun
- Zizhe Wang
- Yikun Ban
- Zhu Sun
- Xiangyang Ji
- Hailong Sun
Paper Information
- arXiv ID: 2512.23213v1
- Categories: cs.CL, cs.AI
- Published: December 29, 2025
- PDF: Download PDF