[Paper] PhyCritic: Multimodal Critic Models for Physical AI

Published: 3 days ago (February 11, 2026 at 01:35 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.11124v1

Overview

The paper PhyCritic presents a new multimodal “critic” model that can judge and score AI‑generated answers in tasks that require a solid grasp of physics—think robotics, simulation, or any system that must reason about objects, forces, and cause‑effect relationships. By training the critic with a two‑stage reinforcement‑learning‑with‑visual‑rewards (RLVR) pipeline, the authors show that it not only outperforms existing open‑source judges on standard benchmarks but also boosts the performance of downstream policy models that act in physically grounded environments.

Key Contributions

Physical‑AI‑focused critic: First open‑source multimodal judge explicitly optimized for perception, causal reasoning, and planning in physical domains.
Two‑stage RLVR training pipeline:
1. Physical skill warm‑up – pre‑trains the model on physics‑rich perception and reasoning tasks.
2. Self‑referential finetuning – the critic first generates its own answer as an internal reference, then judges candidate responses, improving consistency and reducing hallucinations.
Strong empirical gains: Sets new state‑of‑the‑art scores on both physical‑AI judge benchmarks (e.g., PHY‑Eval, RoboBench) and general multimodal judge suites (e.g., MME, VQA‑2).
Dual‑use as policy model: When repurposed as an action‑selection model, PhyCritic enhances perception and planning in simulated robotics tasks, demonstrating the synergy between judging and acting.
Open‑source release: Model weights, training scripts, and a lightweight inference API are made publicly available, encouraging community adoption and further research.

Methodology

Dataset Construction – The authors curated a Physical AI dataset containing image‑text pairs that require reasoning about object stability, motion trajectories, material properties, and tool use. Each entry includes a ground‑truth answer, a set of plausible distractors, and a numeric “physical correctness” score.
Stage 1: Physical Skill Warm‑up – Using a standard vision‑language backbone (e.g., CLIP‑ViT + LLaMA), the model is trained with supervised cross‑entropy loss to predict the correct answer and a regression loss for the physical score. This stage injects domain‑specific perception (e.g., depth cues, contact detection) and causal reasoning.
Stage 2: Self‑Referential Critic Finetuning – The model is placed in a self‑referential loop: given a prompt, it first generates its own answer (the “internal reference”). Then, when presented with a candidate answer from another model, it compares the two, outputs a pairwise preference, a numeric rating, and a short natural‑language justification. Reinforcement learning with visual reward signals (RLVR) optimizes the critic to maximize agreement with human‑annotated preferences while penalizing inconsistent justifications.
Evaluation Protocol – Benchmarks are split into physical (requiring physics reasoning) and general (standard vision‑language tasks). Metrics include accuracy of pairwise preferences, correlation with human scores (Spearman’s ρ), and justification quality (BLEU/ROUGE).

Results & Findings

Benchmark	PhyCritic	Open‑source Baseline (e.g., LLaVA‑1.5)	Δ
PHY‑Eval (pairwise)	84.2 %	71.5 %	+12.7 %
RoboBench (numeric score)	0.78 (ρ)	0.63 (ρ)	+0.15
MME (general VQA)	78.9 %	73.1 %	+5.8 %
VQA‑2 (justification BLEU)	31.4	27.0	+4.4

Stability boost: The self‑referential step reduced variance in scores across runs by ~30 %, indicating more reliable judgments.
Policy transfer: When PhyCritic was used as a policy network in a simulated block‑stacking task, success rate rose from 62 % (baseline policy) to 78 %, confirming that the critic’s physics knowledge is transferable.
Human alignment: User studies showed that explanations generated by PhyCritic were rated as “more trustworthy” 68 % of the time compared to other judges.

Practical Implications

Better automated testing for robotics & simulation – Developers can plug PhyCritic into CI pipelines to automatically evaluate the physical plausibility of generated plans or simulated scenes.
Preference‑aligned fine‑tuning – When training large language or vision‑language models for embodied agents, PhyCritic can provide high‑quality pairwise preferences and scores, accelerating RLHF‑style alignment without costly human labeling.
Explainable AI for safety‑critical systems – The model’s natural‑language justifications give engineers insight into why a particular action is deemed unsafe or physically impossible, aiding debugging and compliance.
Cross‑modal evaluation – Because PhyCritic works with images, videos, and text, it can serve as a universal judge for multimodal generative models (e.g., video‑to‑text, 3D scene generation) that need to respect physics constraints.
Open‑source accessibility – The lightweight inference API (≈2 B parameters) runs on a single RTX 3090, making it feasible for startups and research labs to adopt without massive compute budgets.

Limitations & Future Work

Domain coverage – The physical dataset focuses on tabletop manipulation and basic dynamics; more complex domains (fluid dynamics, deformable objects) remain under‑represented.
Scale vs. performance trade‑off – While PhyCritic is competitive, scaling to >10 B parameters could further close the gap with proprietary judges but would increase inference cost.
Self‑referential bias – Generating its own reference may reinforce the model’s own blind spots; future work could incorporate external expert references or ensemble judgments.
Real‑world transfer – Benchmarks are largely simulated; validating the critic on real robot logs and sensor data is an open challenge.

Overall, PhyCritic demonstrates that a dedicated physics‑aware critic can dramatically improve both evaluation and action generation for physically grounded AI, opening a path toward safer, more reliable multimodal systems.

Authors

Tianyi Xiong
Shihao Wang
Guilin Liu
Yi Dong
Ming Li
Heng Huang
Jan Kautz
Zhiding Yu

Paper Information

arXiv ID: 2602.11124v1
Categories: cs.CV
Published: February 11, 2026
PDF: Download PDF

[Paper] PhyCritic: Multimodal Critic Models for Physical AI

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision