[Paper] UEval: A Benchmark for Unified Multimodal Generation

Published: (January 29, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.22155v1

Overview

The paper presents UEval, a new benchmark designed to test “unified” AI models that can generate both images and text in a single response. By collecting 1,000 carefully curated, real‑world questions that demand multimodal output, the authors provide a way to measure how well current systems reason across vision and language simultaneously.

Key Contributions

  • Unified multimodal benchmark – 1,000 expert‑curated questions spanning eight diverse tasks (e.g., step‑by‑step guides, textbook‑style explanations).
  • Rubric‑based automatic scoring – A novel pipeline that uses a multimodal LLM to draft evaluation rubrics, which are then refined by human experts, yielding 10,417 validated criteria.
  • Fine‑grained, scalable evaluation – The rubric system enables automatic, detailed scoring of both image quality and textual correctness without relying solely on a single “LLM‑as‑judge”.
  • Empirical baseline results – State‑of‑the‑art unified models (including the proprietary “GPT‑5‑Thinking”) achieve only 66.4/100, while the best open‑source model scores 49.1/100.
  • Insight on reasoning – Models equipped with explicit reasoning (chain‑of‑thought) consistently outperform non‑reasoning counterparts; transferring reasoning traces narrows the performance gap.

Methodology

  1. Task Collection & Curation – The authors gathered real‑world prompts from eight domains (e.g., cooking instructions, scientific explanations) and had domain experts verify that each prompt truly requires both an image and a textual description.
  2. Reference Answers – For every prompt, a high‑quality image and a corresponding text answer were created to serve as ground truth.
  3. Rubric Generation Pipeline
    • A multimodal LLM receives the prompt, reference image, and reference text, and produces an initial set of evaluation criteria (e.g., “Is the generated diagram labeled correctly?” or “Does the caption explain the visual content?”).
    • Human experts review, edit, and validate these criteria, turning them into a rubric for that specific question.
  4. Automatic Scoring – When a model’s output is submitted, the same multimodal LLM applies the validated rubric to assign scores to each criterion, which are then aggregated into a final 0‑100 rating.
  5. Baseline Experiments – Several commercial and open‑source unified models were evaluated, with and without explicit reasoning steps, to establish performance baselines.

Results & Findings

Model (Unified)Score (out of 100)
GPT‑5‑Thinking (proprietary)66.4
Best open‑source model49.1
Non‑reasoning baseline (various)30‑45 range
  • Reasoning matters: Models that generate intermediate reasoning traces (e.g., “first draw the diagram, then write the caption”) beat those that output directly.
  • Trace transfer works: Feeding the reasoning trace from a strong reasoning model into a weaker non‑reasoning model improves its score by ~10 points, suggesting that the reasoning process itself is a valuable signal.
  • Current gap: Even the top commercial system leaves a sizable margin to perfect performance, indicating that unified multimodal generation is still an open research problem.

Practical Implications

  • Product developers can use UEval to benchmark any in‑house multimodal generation pipeline (e.g., AI assistants that produce annotated diagrams, marketing tools that auto‑create infographics).
  • Fine‑grained feedback from the rubric enables targeted improvements—if a model consistently loses points on “visual consistency with the caption,” engineers know where to focus data or architecture tweaks.
  • Reasoning pipelines: The clear benefit of chain‑of‑thought style reasoning suggests that adding a “think‑first, generate‑later” stage (even as a separate module) could boost real‑world applications such as automated report generation, educational content creation, and design‑assist tools.
  • Open‑source community: The benchmark’s public rubric files and scoring code give hobbyists and startups a low‑cost way to evaluate and iterate on multimodal models without needing expensive human annotation loops.

Limitations & Future Work

  • Rubric dependence on LLMs: Although human‑validated, the initial rubric generation still relies on a multimodal LLM, which could inherit its biases or blind spots.
  • Scope of tasks: UEval covers eight domains, but many industry scenarios (e.g., medical imaging reports, CAD design) remain untested.
  • Scoring granularity vs. subjectivity: Some criteria (e.g., “aesthetic appeal”) are inherently subjective; future versions could incorporate crowd‑sourced validation to reduce variance.
  • Reasoning trace transfer: The paper shows promising results, but a systematic study on how to best encode, store, and reuse reasoning traces across model families is still needed.

Overall, UEval sets a solid foundation for measuring the next generation of AI systems that must think and draw together—an essential step toward truly unified multimodal assistants.

Authors

  • Bo Li
  • Yida Yin
  • Wenhao Chai
  • Xingyu Fu
  • Zhuang Liu

Paper Information

  • arXiv ID: 2601.22155v1
  • Categories: cs.CV, cs.CL
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »