[Paper] PhyCritic: Multimodal Critic Models for Physical AI
Source: arXiv - 2602.11124v1
Overview
The paper PhyCritic presents a new multimodal “critic” model that can judge and score AI‑generated answers in tasks that require a solid grasp of physics—think robotics, simulation, or any system that must reason about objects, forces, and cause‑effect relationships. By training the critic with a two‑stage reinforcement‑learning‑with‑visual‑rewards (RLVR) pipeline, the authors show that it not only outperforms existing open‑source judges on standard benchmarks but also boosts the performance of downstream policy models that act in physically grounded environments.
Key Contributions
- Physical‑AI‑focused critic: First open‑source multimodal judge explicitly optimized for perception, causal reasoning, and planning in physical domains.
- Two‑stage RLVR training pipeline:
- Physical skill warm‑up – pre‑trains the model on physics‑rich perception and reasoning tasks.
- Self‑referential finetuning – the critic first generates its own answer as an internal reference, then judges candidate responses, improving consistency and reducing hallucinations.
- Strong empirical gains: Sets new state‑of‑the‑art scores on both physical‑AI judge benchmarks (e.g., PHY‑Eval, RoboBench) and general multimodal judge suites (e.g., MME, VQA‑2).
- Dual‑use as policy model: When repurposed as an action‑selection model, PhyCritic enhances perception and planning in simulated robotics tasks, demonstrating the synergy between judging and acting.
- Open‑source release: Model weights, training scripts, and a lightweight inference API are made publicly available, encouraging community adoption and further research.
Methodology
- Dataset Construction – The authors curated a Physical AI dataset containing image‑text pairs that require reasoning about object stability, motion trajectories, material properties, and tool use. Each entry includes a ground‑truth answer, a set of plausible distractors, and a numeric “physical correctness” score.
- Stage 1: Physical Skill Warm‑up – Using a standard vision‑language backbone (e.g., CLIP‑ViT + LLaMA), the model is trained with supervised cross‑entropy loss to predict the correct answer and a regression loss for the physical score. This stage injects domain‑specific perception (e.g., depth cues, contact detection) and causal reasoning.
- Stage 2: Self‑Referential Critic Finetuning – The model is placed in a self‑referential loop: given a prompt, it first generates its own answer (the “internal reference”). Then, when presented with a candidate answer from another model, it compares the two, outputs a pairwise preference, a numeric rating, and a short natural‑language justification. Reinforcement learning with visual reward signals (RLVR) optimizes the critic to maximize agreement with human‑annotated preferences while penalizing inconsistent justifications.
- Evaluation Protocol – Benchmarks are split into physical (requiring physics reasoning) and general (standard vision‑language tasks). Metrics include accuracy of pairwise preferences, correlation with human scores (Spearman’s ρ), and justification quality (BLEU/ROUGE).
Results & Findings
| Benchmark | PhyCritic | Open‑source Baseline (e.g., LLaVA‑1.5) | Δ |
|---|---|---|---|
| PHY‑Eval (pairwise) | 84.2 % | 71.5 % | +12.7 % |
| RoboBench (numeric score) | 0.78 (ρ) | 0.63 (ρ) | +0.15 |
| MME (general VQA) | 78.9 % | 73.1 % | +5.8 % |
| VQA‑2 (justification BLEU) | 31.4 | 27.0 | +4.4 |
- Stability boost: The self‑referential step reduced variance in scores across runs by ~30 %, indicating more reliable judgments.
- Policy transfer: When PhyCritic was used as a policy network in a simulated block‑stacking task, success rate rose from 62 % (baseline policy) to 78 %, confirming that the critic’s physics knowledge is transferable.
- Human alignment: User studies showed that explanations generated by PhyCritic were rated as “more trustworthy” 68 % of the time compared to other judges.
Practical Implications
- Better automated testing for robotics & simulation – Developers can plug PhyCritic into CI pipelines to automatically evaluate the physical plausibility of generated plans or simulated scenes.
- Preference‑aligned fine‑tuning – When training large language or vision‑language models for embodied agents, PhyCritic can provide high‑quality pairwise preferences and scores, accelerating RLHF‑style alignment without costly human labeling.
- Explainable AI for safety‑critical systems – The model’s natural‑language justifications give engineers insight into why a particular action is deemed unsafe or physically impossible, aiding debugging and compliance.
- Cross‑modal evaluation – Because PhyCritic works with images, videos, and text, it can serve as a universal judge for multimodal generative models (e.g., video‑to‑text, 3D scene generation) that need to respect physics constraints.
- Open‑source accessibility – The lightweight inference API (≈2 B parameters) runs on a single RTX 3090, making it feasible for startups and research labs to adopt without massive compute budgets.
Limitations & Future Work
- Domain coverage – The physical dataset focuses on tabletop manipulation and basic dynamics; more complex domains (fluid dynamics, deformable objects) remain under‑represented.
- Scale vs. performance trade‑off – While PhyCritic is competitive, scaling to >10 B parameters could further close the gap with proprietary judges but would increase inference cost.
- Self‑referential bias – Generating its own reference may reinforce the model’s own blind spots; future work could incorporate external expert references or ensemble judgments.
- Real‑world transfer – Benchmarks are largely simulated; validating the critic on real robot logs and sensor data is an open challenge.
Overall, PhyCritic demonstrates that a dedicated physics‑aware critic can dramatically improve both evaluation and action generation for physically grounded AI, opening a path toward safer, more reliable multimodal systems.
Authors
- Tianyi Xiong
- Shihao Wang
- Guilin Liu
- Yi Dong
- Ming Li
- Heng Huang
- Jan Kautz
- Zhiding Yu
Paper Information
- arXiv ID: 2602.11124v1
- Categories: cs.CV
- Published: February 11, 2026
- PDF: Download PDF