[Paper] MME-CoF-Pro: 텍스트와 시각적 힌트를 활용한 비디오 생성 모델의 추론 일관성 평가

발행: 3일 전 (2026년 3월 21일 AM 02:59 GMT+9)

7 분 소요

원문: arXiv

Source: arXiv - 2603.20194v1

개요

논문은 MME‑CoF‑Pro라는 새로운 벤치마크를 소개한다. 이 벤치마크는 비디오 생성 모델에서 추론 일관성을 평가하도록 설계되었으며, 즉 모델이 생성한 사건들이 프레임 간에 인과적으로 일관되는지를 판단한다. 생성 과정에서 텍스트와 시각적 “힌트”를 제공함으로써, 저자들은 현재 모델들이 중간 추론 단계들을 어떻게 처리하는지를 드러낸다. 이는 실세계 시스템에 비디오 AI를 배포할 때 중요한 요소이다.

핵심 기여

Benchmark dataset: 16개 카테고리(논리 퍼즐, 과학 시나리오, 일상 행동)를 아우르는 303개의 비디오 샘플, 정밀하게 주석된 정답 추론 체인 포함.
Reasoning Score metric: 최종 프레임 품질만이 아니라 프로세스 수준 중간 단계의 정확성을 정량화.
세 가지 힌트 설정:
1. No hint – 순수 생성.
2. Text hint – 의도된 결과를 설명하는 자연어 단서.
3. Visual hint – 짧은 레퍼런스 클립 또는 키프레임.
Comprehensive evaluation: 7개의 오픈·클로즈드 소스 비디오 모델을 테스트하여 시각적 충실도와 논리적 일관성 사이의 체계적 격차를 밝혀냄.
Insightful analysis: 힌트가 겉보이는 정확성을 향상시키는 동시에 환각이나 모순된 추론을 초래할 수 있음을 보여줌.

Methodology

Dataset construction – Curators selected diverse scenarios where a clear causal chain is required (e.g., “a ball rolls down a ramp and knocks over a cup”). Each sample includes:
- A reference video (ground truth).
- A text description of the goal.
- Optional visual hint (a short clip of the initial state).
Reasoning Score calculation – Human annotators break down each video into a sequence of logical steps (e.g., “ball accelerates → reaches edge → falls”). The model‑generated video is automatically aligned to these steps using frame‑level similarity and then scored for presence, order, and causal correctness.
Evaluation pipeline – For each model and hint condition, the authors generate a video, compute the Reasoning Score, and also report standard visual quality metrics (FID, CLIP‑Score). This dual view isolates coherence from raw image quality.
Analysis – Statistical tests compare scores across hints and model families, and qualitative case studies illustrate typical failure modes (e.g., “hallucinated” objects that appear out of nowhere).

결과 및 발견

힌트 설정	평균 추론 점수	시각 품질 (FID)
힌트 없음	0.31	45.2
텍스트 힌트	0.38 (↑ 24%)	42.7 (약간 ↓)
시각 힌트	0.44 (↑ 42%)	48.9 (↑ 9)

전체적으로 일관성 약함: 가장 성능이 좋은 모델조차 0.5 이하의 점수를 받아, 요구되는 추론 단계의 절반 이상이 누락되었거나 잘못되었음을 의미합니다.
시각 품질과의 분리: FID가 낮아(시각적으로 뛰어난) 영상이 종종 가장 낮은 추론 점수를 보여, 시각적 사실감이 논리적 일관성을 보장하지 않음을 확인했습니다.
텍스트 힌트는 도움이 되지만 오히려 혼동을 유발: 모델이 최종 결과를 “추측”해 합리적인 프레임을 만들지만 인과 관계를 무시해, 원인 없이 물체가 나타나는 등 환각 현상이 발생합니다.
**시각 힌트는 구조화된 작업(예: 블록 쌓기)에서 뛰어나지만, 미세한 물체 상호작용이나 물리 기반 변형과 같은 세밀한 지각에서는 여전히 부족합니다.

실용적 시사점

Safety‑critical pipelines (예: 자율 주행 또는 로봇 공학을 위한 시뮬레이션) 은 명시적인 일관성 검사가 필요합니다; 시각적 충실도만 의존하면 위험한 논리적 오류를 가릴 수 있습니다.
Prompt engineering for video generation 은 가능하면 시각적 컨텍스트를 포함해야 하지만, 개발자는 텍스트 프롬프트가 “단축” 추론을 유발할 수 있음을 인식해야 합니다.
Model selection: 팀은 비디오 생성기를 강화 학습 에이전트 훈련이나 교육용 콘텐츠 제작과 같은 다운스트림 작업에 통합하기 전에 FID/CLIP‑Score 및 Reasoning Score 를 모두 벤치마크해야 합니다.
Tooling: 벤치마크의 API를 CI 파이프라인에 래핑하여 모델이 진화함에 따라 추론 일관성의 퇴보를 자동으로 표시할 수 있습니다.

Limitations & Future Work

Dataset size: 303 samples, while diverse, remain modest; scaling to thousands of scenarios could surface subtler failure modes.
Human‑annotated reasoning chains introduce subjectivity; future work may explore automated logical graph extraction.
Model coverage: Only seven models were evaluated; newer diffusion‑based video generators and multimodal transformers are not yet included.
Hint granularity: The study uses binary hint/no‑hint conditions; intermediate levels (e.g., partial captions) could provide a richer understanding of guidance mechanisms.

Bottom line: MME‑CoF‑Pro shines a light on a blind spot in video AI—logical consistency across time. For developers building systems that reason with video, this benchmark offers a practical yardstick to ensure that generated content does more than look good; it also makes sense.

저자

Yu Qi
Xinyi Xu
Ziyu Guo
Siyuan Ma
Renrui Zhang
Xinyan Chen
Ruichuan An
Ruofan Xing
Jiayi Zhang
Haojie Huang
Pheng-Ann Heng
Jonathan Tremblay
Lawson L. S. Wong

논문 정보

arXiv ID: 2603.20194v1
카테고리: cs.CV
출판일: 2026년 3월 20일
PDF: PDF 다운로드

[Paper] MME-CoF-Pro: 텍스트와 시각적 힌트를 활용한 비디오 생성 모델의 추론 일관성 평가

개요

핵심 기여

Methodology

결과 및 발견

실용적 시사점

Limitations & Future Work

저자

논문 정보

관련 글

[Paper] LumosX: 모든 정체성을 그들의 속성과 연결하여 맞춤형 비디오 생성

[Paper] 산불 확산 시나리오: Training-Free Methods를 이용한 Segmentation Diffusion Models의 샘플 다양성 증가

[Paper] MuSteerNet: 비디오에서 관찰‑반응 상호 스티어링을 통한 인간 반응 생성

[Paper] Rectified Flow 재구성을 통한 Image-to-Image Translation 향상