[Paper] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Source: arXiv - 2512.16899v1
Overview
The paper introduces Multimodal RewardBench 2 (MMRB2), the first large‑scale benchmark that evaluates reward models (RMs) on tasks involving interleaved text‑and‑image data. By providing 1,000 expert‑curated preference pairs for each of four realistic multimodal scenarios, the authors give the community a concrete way to measure how well “omni” models can judge the quality of generated content that mixes language and vision.
Key Contributions
- A comprehensive multimodal benchmark covering text‑to‑image generation, image editing, interleaved generation, and multimodal reasoning.
- 23 state‑of‑the‑art models and agents contribute responses, yielding a diverse pool of candidate outputs.
- Expert‑annotated preference pairs (1 k per task) with strong consensus, created via an ensemble filtering pipeline to ensure high‑quality ground truth.
- Extensive evaluation of existing judges, including LLM‑as‑a‑judge and fine‑tuned reward models, revealing current performance gaps.
- Correlation analysis showing that higher MMRB2 scores predict better downstream performance in Best‑of‑N sampling setups.
- Open‑source baseline (Qwen3‑VL‑32B) that matches the accuracy of a commercial Gemini 2.5 Flash model, establishing a solid reference point for future research.
Methodology
-
Task Design – The authors selected four representative multimodal use‑cases that developers actually encounter:
- Text‑to‑Image: generating an image from a textual prompt.
- Image Editing: modifying an existing image based on textual instructions.
- Interleaved Generation: producing alternating text and image segments (e.g., a tutorial with screenshots).
- Multimodal Reasoning: answering questions that require “thinking with images.”
-
Response Collection – For each prompt, 21 source tasks were used to generate outputs from 23 different models (including closed‑source giants like Gemini 3 Pro, GPT‑5, and open‑source Qwen3‑VL).
-
Preference Pair Creation – Human experts compared pairs of model outputs and selected the better one. To keep the annotation workload manageable, an ensemble filtering step first eliminated obviously inferior candidates, leaving only the most competitive pairs for expert review.
-
Judge Evaluation – The benchmark was used to test a variety of judges:
- LLM‑as‑a‑judge (e.g., Gemini 3 Pro, GPT‑5).
- Fine‑tuned multimodal reward models trained on human preference data.
-
Correlation Study – The authors measured how well a judge’s accuracy on MMRB2 predicts the success of Best‑of‑N sampling (selecting the highest‑scoring output from a set of candidates) on the same tasks.
Results & Findings
| Model (Judge) | Accuracy on MMRB2 (average across tasks) |
|---|---|
| Gemini 3 Pro (latest) | 75‑80 % |
| GPT‑5 / Gemini 2.5 Pro | 66‑75 % |
| Gemini 4 o (widely used) | ≈59 % |
| Human experts | >90 % |
| Open‑source Qwen3‑VL‑32B | ≈64 % (on par with Gemini 2.5 Flash) |
- Human consensus remains the gold standard, beating the best commercial judges by a comfortable margin.
- Open‑source models are catching up; Qwen3‑VL‑32B demonstrates that strong multimodal reward performance is achievable without proprietary data.
- Performance on MMRB2 correlates strongly (ρ ≈ 0.78) with downstream Best‑of‑N success, confirming the benchmark’s predictive value.
- Error analysis highlights three weak spots: (1) nuanced visual edits, (2) long‑range interleaved consistency, and (3) reasoning that requires joint text‑image inference.
Practical Implications
- Model‑as‑a‑Judge pipelines: Developers building generative assistants (e.g., AI‑powered design tools, chatbots that embed images) can now plug in a reward model evaluated on MMRB2 to reliably rank candidate outputs before presenting them to users.
- Fine‑tuning data selection: The benchmark’s preference pairs can serve as high‑quality training data for custom reward models, especially for niche domains like medical imaging or e‑commerce product visuals.
- Benchmark‑driven development: Companies can benchmark new multimodal LLMs against MMRB2 to quantify progress, similar to how GLUE and MMLU became standard for pure‑text models.
- Open‑source competitiveness: The strong results from Qwen3‑VL‑32B suggest that startups don’t need massive proprietary datasets to build useful multimodal reward models, lowering the barrier to entry for AI‑augmented products.
- Best‑of‑N sampling strategies: Since MMRB2 scores predict downstream quality, developers can safely adopt a “generate‑many‑then‑rank” workflow, reducing the need for expensive human post‑editing.
Limitations & Future Work
- Scope of modalities: The benchmark focuses on static images; video, audio, or 3‑D data are not covered.
- Prompt diversity: While prompts are “practical,” they are still curated; real‑world user inputs may be noisier or more ambiguous.
- Human annotation cost: Expert consensus is expensive to obtain, limiting rapid iteration on new tasks or domains.
- Model bias: Preference pairs reflect the annotators’ cultural and aesthetic biases, which could affect fairness in downstream applications.
Future research directions include extending MMRB2 to dynamic media (e.g., text‑to‑video), automating parts of the preference‑pair generation with semi‑supervised methods, and investigating debiasing techniques for multimodal reward models.
Authors
- Yushi Hu
- Reyhane Askari-Hemmat
- Melissa Hall
- Emily Dinan
- Luke Zettlemoyer
- Marjan Ghazvininejad
Paper Information
- arXiv ID: 2512.16899v1
- Categories: cs.CL, cs.CV
- Published: December 18, 2025
- PDF: Download PDF