[Paper] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

Published: (December 2, 2025 at 12:11 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.02942v1

Overview

A new benchmark called VideoScience-Bench pushes video‑generation models beyond visual fidelity and into the realm of scientific reasoning. By testing whether models can synthesize videos that obey undergraduate‑level physics and chemistry laws, the authors expose a critical blind spot in current video‑generation research and provide a concrete way to measure progress toward truly “zero‑shot” reasoning systems.

Key Contributions

  • First scientific‑reasoning benchmark for video generation – 200 curated prompts covering 14 topics and 103 distinct concepts in physics and chemistry.
  • Multi‑dimensional evaluation framework – scores models on Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio‑Temporal Continuity.
  • Human‑aligned automatic judging – a vision‑language model (VLM) is used as a “judge” and shown to correlate strongly with expert human ratings.
  • Comprehensive empirical study – seven state‑of‑the‑art text‑to‑video (T2V) and image‑to‑video (I2V) models are benchmarked, revealing systematic gaps in scientific understanding.
  • Open‑source data and evaluation code – the benchmark, prompts, and evaluation scripts are released publicly for reproducibility and community extension.

Methodology

  1. Prompt Design – Each benchmark entry is a natural‑language description that weaves together multiple scientific concepts (e.g., “a metal rod heated at one end while the other end is immersed in liquid nitrogen”). The prompts are vetted by domain experts to ensure they require genuine reasoning, not just visual pattern matching.
  2. Video Generation – The authors run seven leading video‑generation models (e.g., Make‑It‑3D, Imagen‑Video, Phenaki) in two settings:
    • T2V – generate directly from the textual prompt.
    • I2V – generate a keyframe image from the prompt, then animate it.
  3. Human Annotation – A panel of scientists rates each generated video on five dimensions that capture scientific correctness and temporal coherence.
  4. VLM‑as‑Judge – A large vision‑language model (e.g., GPT‑4V) is prompted to evaluate the same dimensions. Correlation analysis shows the VLM scores align closely with human judgments, enabling scalable benchmarking.

The pipeline is deliberately lightweight: prompts → model → VLM judge, making it easy for anyone to plug in new video generators.

Results & Findings

  • Overall low scientific fidelity – Even the best‑performing model achieved < 30 % on the composite score, indicating that current systems rarely respect basic physical or chemical laws.
  • Consistent failure modes – Models often get the appearance right but violate dynamics (e.g., objects float when they should fall) or ignore immutability (e.g., a chemical reaction that should be irreversible is shown reversing).
  • Prompt consistency is the easiest dimension – Models can follow the textual description superficially, yet still produce physically impossible motions.
  • VLM judge reliability – Pearson correlation > 0.85 between VLM scores and human ratings across all dimensions, validating the automated evaluation pipeline.
  • I2V vs. T2V – Image‑to‑video pipelines tend to preserve spatial details better but struggle more with temporal physics, while pure T2V models sometimes capture dynamics at the cost of visual realism.

Practical Implications

  • Safety‑critical simulations – Industries such as robotics, autonomous driving, or virtual labs cannot rely on current video generators for accurate physics; VideoScience‑Bench provides a diagnostic tool to assess readiness.
  • Prompt engineering for scientific content – Developers building educational or training videos now have a benchmark to test whether their prompts elicit scientifically plausible outputs.
  • Model selection & fine‑tuning – The multi‑dimensional scores help teams identify which aspects (e.g., dynamism vs. immutability) need targeted data augmentation or architectural tweaks.
  • Foundation model evaluation – As multimodal foundation models (e.g., GPT‑4V, Gemini) claim “reasoning” abilities, VideoScience‑Bench offers a concrete, downstream task to verify those claims in the visual domain.
  • Dataset creation pipelines – The benchmark’s prompt‑generation methodology can be adapted to other domains (e.g., biology, engineering) to stress‑test generative models on domain‑specific reasoning.

Limitations & Future Work

  • Scope limited to undergraduate physics/chemistry – More advanced topics (quantum phenomena, fluid dynamics) remain untested.
  • Static prompt set – While 200 prompts are diverse, they may not capture the full distribution of real‑world scientific scenarios; future work could include procedurally generated prompts.
  • Reliance on a single VLM judge – Although correlation is high, the judge inherits the biases of its training data; ensemble judging or task‑specific fine‑tuning could improve robustness.
  • Evaluation of reasoning depth – The current metrics assess outcome correctness but not the internal reasoning path of the model; probing model internals or using chain‑of‑thought prompts could provide richer insights.

By exposing these gaps, VideoScience‑Bench sets the stage for the next generation of video models that not only look good but also think like scientists.

Authors

  • Lanxiang Hu
  • Abhilash Shankarampeta
  • Yixin Huang
  • Zilin Dai
  • Haoyang Yu
  • Yujie Zhao
  • Haoqiang Kang
  • Daniel Zhao
  • Tajana Rosing
  • Hao Zhang

Paper Information

  • arXiv ID: 2512.02942v1
  • Categories: cs.CV, cs.AI
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »