[Paper] OneThinker: All-in-one Reasoning Model for Image and Video

Published: 2 months ago (December 2, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03043v1

Overview

The paper introduces OneThinker, a single multimodal reasoning model that can handle both images and videos across a wide spectrum of visual tasks—from question answering and captioning to object tracking and segmentation. By training one unified model instead of a collection of task‑specific ones, the authors aim to create a more scalable, versatile “generalist” that can share knowledge across tasks and modalities.

Key Contributions

All‑in‑one architecture that jointly learns 10 fundamental visual tasks for both images and videos.
OneThinker‑600k dataset, a curated corpus covering diverse tasks, enriched with chain‑of‑thought (CoT) annotations generated by commercial LLMs.
OneThinker‑SFT‑340k, a supervised‑fine‑tuning (SFT) starter set that jump‑starts the model with high‑quality reasoning traces.
EMA‑GRPO algorithm, a novel multi‑task reinforcement‑learning optimizer that balances heterogeneous rewards by tracking task‑wise moving averages of reward standard deviations.
Extensive evaluation on 31 benchmarks, showing strong performance across all tasks and promising zero‑shot transfer capabilities.
Open‑source release of code, model weights, and data to foster reproducibility and community extensions.

Methodology

Unified Data Collection – The authors aggregated existing image‑ and video‑centric datasets (e.g., VQA, MS‑COCO, YouCook2, DAVIS) and harmonized them into a single training corpus of 600 k examples. Each example includes the raw visual input, a task label, and a CoT annotation that outlines step‑by‑step reasoning.
Supervised Fine‑Tuning (SFT) – A subset of 340 k examples with high‑quality CoT traces is used to warm‑start the model. This stage teaches the model how to articulate its reasoning in natural language.
Multi‑Task Reinforcement Learning – After SFT, the model is further refined with RL to maximize task‑specific metrics (e.g., accuracy for QA, IoU for segmentation). Because each task has a different reward scale, the EMA‑GRPO optimizer computes an exponential moving average of each task’s reward standard deviation and normalizes updates accordingly, preventing any single task from dominating training.
Model Backbone – OneThinker builds on a large multimodal transformer (vision encoder + language decoder) that processes both static frames and video clips (by treating video as a sequence of frames with temporal positional embeddings). The same parameters are shared across all tasks, enabling knowledge transfer.

Results & Findings

Across‑task performance: OneThinker matches or exceeds state‑of‑the‑art specialized models on 31 benchmarks covering QA, captioning, spatial grounding, temporal grounding, tracking, and segmentation.
Knowledge transfer: Training on captioning improves video QA, and segmentation data boosts object tracking accuracy, demonstrating cross‑task synergy.
Zero‑shot generalization: Without additional fine‑tuning, the model can handle unseen tasks (e.g., video‑based visual commonsense reasoning) with reasonable performance, hinting at emergent generalist abilities.
Efficiency: A single model replaces up to 12 separate task‑specific models, reducing deployment footprint and inference latency when serving multiple visual services.

Practical Implications

Unified AI services – Companies can expose a single API for a suite of visual capabilities (e.g., “describe this video”, “find the person in frame 42”, “track the ball”), simplifying product architecture and maintenance.
Cost‑effective scaling – Training and hosting one large model is cheaper than maintaining dozens of specialized models, especially for edge or cloud‑constrained environments.
Rapid prototyping – Developers can leverage the zero‑shot abilities to prototype new visual tasks (e.g., custom video QA) without collecting large labeled datasets.
Cross‑modal knowledge reuse – Improvements in one modality (e.g., better video segmentation) automatically benefit related tasks (e.g., video captioning), accelerating iteration cycles.
Open resources – The released dataset and code provide a ready‑made foundation for building domain‑specific extensions (e.g., medical imaging, autonomous driving) with minimal additional data.

Limitations & Future Work

Reward heterogeneity handling – While EMA‑GRPO balances task rewards, it still relies on manually chosen hyper‑parameters (e.g., decay rates) that may need retuning for new tasks.
Temporal resolution – The model processes video as a fixed‑length frame sequence; very long or high‑fps videos could strain memory and may require hierarchical temporal modeling.
Domain bias – The training corpus, though large, is dominated by publicly available datasets; performance on niche domains (e.g., satellite imagery) remains untested.
Explainability – Although CoT annotations improve interpretability, the internal reasoning of the transformer is still a black box; future work could integrate more explicit reasoning modules.
Continual learning – Extending OneThinker to continually absorb new tasks without catastrophic forgetting is an open research direction.

OneThinker marks a significant step toward a truly multimodal, multi‑task AI assistant that can reason about both images and videos with a single, reusable model. Its open‑source release invites the community to push the boundaries of unified visual reasoning further.

Authors

Kaituo Feng
Manyuan Zhang
Hongyu Li
Kaixuan Fan
Shuang Chen
Yilei Jiang
Dian Zheng
Peiwen Sun
Yiyuan Zhang
Haoze Sun
Yan Feng
Peng Pei
Xunliang Cai
Xiangyu Yue

Paper Information

arXiv ID: 2512.03043v1
Categories: cs.CV
Published: December 2, 2025
PDF: Download PDF

[Paper] OneThinker: All-in-one Reasoning Model for Image and Video

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] EditThinker: Unlocking Iterative Reasoning for Any Image Editor

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models