[Paper] MMGR: Multi-Modal Generative Reasoning

Published: 1 month ago (December 16, 2025 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14691v1

Overview

The paper MMGR: Multi‑Modal Generative Reasoning proposes a new way to test whether video‑ and image‑generation models do more than look good—they should also respect physics, logic, and spatial constraints. By introducing a benchmark that measures five core reasoning abilities, the authors expose a hidden performance gap in today’s “foundation” generative models.

Key Contributions

MMGR evaluation framework – a unified benchmark that assesses generative reasoning across five dimensions: Physical, Logical, 3‑D Spatial, 2‑D Spatial, and Temporal.
Cross‑domain test suite – three distinct domains (Abstract Reasoning, Embodied Navigation, Physical Commonsense) with carefully crafted tasks that require holistic correctness in both video and image outputs.
Fine‑grained metrics – beyond perceptual scores like FVD, the authors define accuracy‑style metrics that demand global state consistency and causal correctness.
Comprehensive model audit – systematic evaluation of leading video models (Veo‑3, Sora‑2, Wan‑2.2) and image models (Nano‑banana, Nano‑banana Pro, GPT‑4o‑image, Qwen‑image).
Diagnostic insights – analysis of why current models fail (over‑reliance on visual plausibility, weak long‑term planning, limited state tracking).

Methodology

Reasoning Taxonomy – The authors break down reasoning into five abilities:
- Physical: obeying gravity, collisions, material properties.
- Logical: cause‑and‑effect chains, rule‑based deductions.
- 3‑D Spatial: navigation, object placement in a 3‑D world.
- 2‑D Spatial: layout consistency on a single image plane.
- Temporal: maintaining coherent state over time.
Domain Construction –
- Abstract Reasoning: tasks like ARC‑AGI and Sudoku where the model must generate a correct solution grid.
- Embodied Navigation: agents must navigate realistic 3‑D environments and localize themselves, producing video of the trajectory.
- Physical Commonsense: sports scenes and compositional interactions that require correct physics (e.g., a ball bouncing).
Metric Design – For each task, the benchmark computes a holistic correctness score (e.g., does the final Sudoku grid satisfy all constraints? Does a generated video respect collision physics?). These scores are binary or percentage‑based, making them comparable across modalities.
Evaluation Pipeline – Models are prompted to produce either a single image or a short video. The output is automatically parsed (e.g., OCR for Sudoku digits, pose estimation for physical scenes) and fed into the reasoning checks.

Results & Findings

Domain	Best‑performing model	Physical	Logical	3‑D Spatial	2‑D Spatial	Temporal
Abstract Reasoning (ARC‑AGI)	– (all models)	< 5 %	< 10 %	N/A	N/A	N/A
Embodied Navigation	Sora‑2	38 %	22 %	31 %	45 %	27 %
Physical Commonsense (sports)	Nano‑banana Pro	71 %	64 %	58 %	73 %	66 %

Physical commonsense is the strongest area, yet even the top model fails on ~30 % of physics checks.
Abstract reasoning is a near‑zero success zone; models rarely generate a logically valid solution.
Long‑horizon spatial planning in navigation tasks shows the biggest drop‑off, indicating weak global state tracking.
Across the board, visual quality metrics (e.g., FVD) remain high, confirming that current training objectives reward “looks right” more than “behaves right.”

Practical Implications

Safety‑critical generation – For applications like simulation‑based training, autonomous‑vehicle scenario generation, or virtual‑world building, relying solely on perceptual metrics can produce unsafe or misleading content. MMGR highlights the need for reasoning‑aware checks before deployment.
Prompt engineering – Developers can use the benchmark’s failure modes to craft better prompts or incorporate external reasoning modules (e.g., physics engines, symbolic solvers) into pipelines.
Model selection – When choosing a generative model for tasks that require consistency (e.g., game level design, instructional video synthesis), MMGR scores give a more realistic picture of suitability than FVD alone.
Evaluation tooling – The open‑source MMGR suite can be integrated into CI pipelines, automatically flagging generated assets that violate basic physical or logical constraints.

Limitations & Future Work

Scope of tasks – While the three domains cover a broad spectrum, they still omit certain reasoning types (e.g., social interaction, language grounding).
Automated scoring reliability – Some metrics depend on downstream detectors (OCR, pose estimation) that can introduce noise, especially on low‑resolution outputs.
Model‑agnostic prompting – The benchmark assumes a uniform prompting interface; adapting it to models with vastly different APIs may require extra engineering.
Future directions – The authors suggest extending MMGR to multi‑agent scenarios, integrating differentiable physics simulators for tighter training loops, and exploring curriculum‑based fine‑tuning that directly optimizes the reasoning metrics.

Authors

Zefan Cai
Haoyi Qiu
Tianyi Ma
Haozhe Zhao
Gengze Zhou
Kung‑Hsiang Huang
Parisa Kordjamshidi
Minjia Zhang
Xiao Wen
Jiuxiang Gu
Nanyun Peng
Junjie Hu

Paper Information

arXiv ID: 2512.14691v1
Categories: cs.CL, cs.CV
Published: December 16, 2025
PDF: Download PDF

[Paper] MMGR: Multi-Modal Generative Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

[Paper] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

[Paper] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

[Paper] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction