[Paper] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models

Published: (January 16, 2026 at 12:07 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.11444v1

Overview

Diffusion models have become the go‑to method for generating high‑fidelity images, but most research focuses on building ever larger single models. This paper asks a simple, practical question: can we get better results by ensembling (combining) multiple diffusion models, just as we do with supervised classifiers? The authors systematically evaluate several ensemble strategies and uncover a surprising mismatch between traditional statistical metrics and perceptual image quality.

Key Contributions

  • Comprehensive empirical study of score‑based diffusion model ensembles on CIFAR‑10 and FFHQ, covering Deep Ensembles, Monte‑Carlo Dropout, and a variety of aggregation rules.
  • Metric divergence analysis showing that ensembles consistently improve score‑matching loss and likelihood yet often do not improve perceptual metrics such as FID.
  • Cross‑domain validation with tabular data (random forests) where one aggregation rule consistently outperforms the others, highlighting that the phenomenon is not limited to images.
  • Theoretical insights into how scores add up, linking ensemble behavior to other composition tricks like classifier‑free guidance.
  • Practical guidelines for developers on when (and when not) to invest in ensembling diffusion models.

Methodology

  1. Base models – The authors train several independent diffusion models (identical architecture, different random seeds) on standard image benchmarks.
  2. Ensemble constructions
    • Deep Ensembles: average the predicted scores from each model.
    • Monte‑Carlo Dropout: enable dropout at inference time and average multiple stochastic forward passes.
    • Alternative aggregations: weighted sums, median, and other robust statistics.
  3. Evaluation metrics
    • Statistical: score‑matching loss (the training objective) and exact log‑likelihood estimates.
    • Perceptual: Fréchet Inception Distance (FID), Inception Score (IS), and visual inspection.
  4. Tabular extension – Random forest regressors are ensembled using the same aggregation rules to see if the pattern holds beyond images.
  5. Theoretical analysis – The paper derives how the sum of score fields behaves under the diffusion SDE, shedding light on why likelihood improves while sample quality may not.

All steps are described with enough detail that a practitioner could reproduce the experiments using popular libraries (e.g., PyTorch, Diffusers).

Results & Findings

MetricSingle modelDeep EnsembleMC DropoutBest aggregation (tabular)
Score‑matching lossBaselineLower (≈ 5‑10 % reduction)Lower
Log‑likelihoodBaselineHigher (≈ 3‑7 % boost)Higher
FID (CIFAR‑10)3.94.1 (worse)4.0 (worse)
FID (FFHQ)7.27.5 (worse)7.4 (worse)
Tabular RMSEBaselineBest (weighted avg)
  • Statistical gains: Ensembles consistently reduce the training loss and improve estimated likelihood, confirming the classic “variance reduction” effect.
  • Perceptual stagnation: On image generation, the same ensembles either leave FID unchanged or slightly degrade it, despite better scores.
  • Domain dependence: For tabular regression, one aggregation rule (a variance‑aware weighted average) clearly outperforms others, indicating that the disconnect is specific to high‑dimensional generative tasks.
  • Theoretical takeaway: Adding scores corresponds to adding drift terms in the reverse diffusion SDE. While this can tighten the distribution (hence higher likelihood), it may also over‑regularize the stochastic trajectory, limiting the diversity needed for low FID.

Practical Implications

  • Ensembling is not a free win for image generation – If your primary goal is a lower FID or visually better samples, simply averaging diffusion scores is unlikely to help and may even hurt.
  • Use ensembles for likelihood‑sensitive applications – Tasks such as density estimation, anomaly detection, or any downstream that consumes the model’s log‑probability can benefit from the statistical improvements.
  • Guidance‑style tricks already embed ensemble ideas – The paper’s analysis shows that classifier‑free guidance is mathematically similar to a weighted sum of two scores (conditional + unconditional). Understanding this can help you tune guidance scales more systematically.
  • Resource budgeting – Training multiple diffusion models is expensive (GPU‑hours, memory). The modest gains in likelihood may not justify the cost for most generative pipelines.
  • When to consider ensembles – If you already have several pretrained diffusion checkpoints (e.g., from hyperparameter sweeps) and need a tighter likelihood estimate for evaluation or downstream scoring, a quick Deep Ensemble can be worthwhile.

Limitations & Future Work

  • Scope of datasets – Experiments are limited to CIFAR‑10 and FFHQ; larger, more diverse datasets (e.g., ImageNet) could reveal different dynamics.
  • Ensemble diversity – All base models share the same architecture and training schedule; richer diversity (different architectures, training objectives) was not explored.
  • Metric breadth – The study focuses on FID/IS; other perceptual metrics (e.g., CLIPScore, human preference studies) might react differently to ensembling.
  • Theoretical gaps – While the paper provides intuition on score addition, a full characterization of when likelihood improvements translate to perceptual gains remains open.

Future work could investigate heterogeneous ensembles, adaptive weighting schemes that balance likelihood and diversity, and apply the insights to conditional diffusion (text‑to‑image, inpainting) where guidance already plays a central role.

Authors

  • Raphaël Razafindralambo
  • Rémy Sun
  • Frédéric Precioso
  • Damien Garreau
  • Pierre-Alexandre Mattei

Paper Information

  • arXiv ID: 2601.11444v1
  • Categories: cs.LG, cs.CV, math.ST, stat.ME, stat.ML
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »