[Paper] MMGR: Multi-Modal Generative Reasoning
Source: arXiv - 2512.14691v1
Overview
The paper MMGR: Multi‑Modal Generative Reasoning proposes a new way to test whether video‑ and image‑generation models do more than look good—they should also respect physics, logic, and spatial constraints. By introducing a benchmark that measures five core reasoning abilities, the authors expose a hidden performance gap in today’s “foundation” generative models.
Key Contributions
- MMGR evaluation framework – a unified benchmark that assesses generative reasoning across five dimensions: Physical, Logical, 3‑D Spatial, 2‑D Spatial, and Temporal.
- Cross‑domain test suite – three distinct domains (Abstract Reasoning, Embodied Navigation, Physical Commonsense) with carefully crafted tasks that require holistic correctness in both video and image outputs.
- Fine‑grained metrics – beyond perceptual scores like FVD, the authors define accuracy‑style metrics that demand global state consistency and causal correctness.
- Comprehensive model audit – systematic evaluation of leading video models (Veo‑3, Sora‑2, Wan‑2.2) and image models (Nano‑banana, Nano‑banana Pro, GPT‑4o‑image, Qwen‑image).
- Diagnostic insights – analysis of why current models fail (over‑reliance on visual plausibility, weak long‑term planning, limited state tracking).
Methodology
-
Reasoning Taxonomy – The authors break down reasoning into five abilities:
- Physical: obeying gravity, collisions, material properties.
- Logical: cause‑and‑effect chains, rule‑based deductions.
- 3‑D Spatial: navigation, object placement in a 3‑D world.
- 2‑D Spatial: layout consistency on a single image plane.
- Temporal: maintaining coherent state over time.
-
Domain Construction –
- Abstract Reasoning: tasks like ARC‑AGI and Sudoku where the model must generate a correct solution grid.
- Embodied Navigation: agents must navigate realistic 3‑D environments and localize themselves, producing video of the trajectory.
- Physical Commonsense: sports scenes and compositional interactions that require correct physics (e.g., a ball bouncing).
-
Metric Design – For each task, the benchmark computes a holistic correctness score (e.g., does the final Sudoku grid satisfy all constraints? Does a generated video respect collision physics?). These scores are binary or percentage‑based, making them comparable across modalities.
-
Evaluation Pipeline – Models are prompted to produce either a single image or a short video. The output is automatically parsed (e.g., OCR for Sudoku digits, pose estimation for physical scenes) and fed into the reasoning checks.
Results & Findings
| Domain | Best‑performing model | Physical | Logical | 3‑D Spatial | 2‑D Spatial | Temporal |
|---|---|---|---|---|---|---|
| Abstract Reasoning (ARC‑AGI) | – (all models) | < 5 % | < 10 % | N/A | N/A | N/A |
| Embodied Navigation | Sora‑2 | 38 % | 22 % | 31 % | 45 % | 27 % |
| Physical Commonsense (sports) | Nano‑banana Pro | 71 % | 64 % | 58 % | 73 % | 66 % |
- Physical commonsense is the strongest area, yet even the top model fails on ~30 % of physics checks.
- Abstract reasoning is a near‑zero success zone; models rarely generate a logically valid solution.
- Long‑horizon spatial planning in navigation tasks shows the biggest drop‑off, indicating weak global state tracking.
- Across the board, visual quality metrics (e.g., FVD) remain high, confirming that current training objectives reward “looks right” more than “behaves right.”
Practical Implications
- Safety‑critical generation – For applications like simulation‑based training, autonomous‑vehicle scenario generation, or virtual‑world building, relying solely on perceptual metrics can produce unsafe or misleading content. MMGR highlights the need for reasoning‑aware checks before deployment.
- Prompt engineering – Developers can use the benchmark’s failure modes to craft better prompts or incorporate external reasoning modules (e.g., physics engines, symbolic solvers) into pipelines.
- Model selection – When choosing a generative model for tasks that require consistency (e.g., game level design, instructional video synthesis), MMGR scores give a more realistic picture of suitability than FVD alone.
- Evaluation tooling – The open‑source MMGR suite can be integrated into CI pipelines, automatically flagging generated assets that violate basic physical or logical constraints.
Limitations & Future Work
- Scope of tasks – While the three domains cover a broad spectrum, they still omit certain reasoning types (e.g., social interaction, language grounding).
- Automated scoring reliability – Some metrics depend on downstream detectors (OCR, pose estimation) that can introduce noise, especially on low‑resolution outputs.
- Model‑agnostic prompting – The benchmark assumes a uniform prompting interface; adapting it to models with vastly different APIs may require extra engineering.
- Future directions – The authors suggest extending MMGR to multi‑agent scenarios, integrating differentiable physics simulators for tighter training loops, and exploring curriculum‑based fine‑tuning that directly optimizes the reasoning metrics.
Authors
- Zefan Cai
- Haoyi Qiu
- Tianyi Ma
- Haozhe Zhao
- Gengze Zhou
- Kung‑Hsiang Huang
- Parisa Kordjamshidi
- Minjia Zhang
- Xiao Wen
- Jiuxiang Gu
- Nanyun Peng
- Junjie Hu
Paper Information
- arXiv ID: 2512.14691v1
- Categories: cs.CL, cs.CV
- Published: December 16, 2025
- PDF: Download PDF