[Paper] OneThinker: All-in-one Reasoning Model for Image and Video
Source: arXiv - 2512.03043v1
Overview
The paper introduces OneThinker, a single multimodal reasoning model that can handle both images and videos across a wide spectrum of visual tasks—from question answering and captioning to object tracking and segmentation. By training one unified model instead of a collection of task‑specific ones, the authors aim to create a more scalable, versatile “generalist” that can share knowledge across tasks and modalities.
Key Contributions
- All‑in‑one architecture that jointly learns 10 fundamental visual tasks for both images and videos.
- OneThinker‑600k dataset, a curated corpus covering diverse tasks, enriched with chain‑of‑thought (CoT) annotations generated by commercial LLMs.
- OneThinker‑SFT‑340k, a supervised‑fine‑tuning (SFT) starter set that jump‑starts the model with high‑quality reasoning traces.
- EMA‑GRPO algorithm, a novel multi‑task reinforcement‑learning optimizer that balances heterogeneous rewards by tracking task‑wise moving averages of reward standard deviations.
- Extensive evaluation on 31 benchmarks, showing strong performance across all tasks and promising zero‑shot transfer capabilities.
- Open‑source release of code, model weights, and data to foster reproducibility and community extensions.
Methodology
- Unified Data Collection – The authors aggregated existing image‑ and video‑centric datasets (e.g., VQA, MS‑COCO, YouCook2, DAVIS) and harmonized them into a single training corpus of 600 k examples. Each example includes the raw visual input, a task label, and a CoT annotation that outlines step‑by‑step reasoning.
- Supervised Fine‑Tuning (SFT) – A subset of 340 k examples with high‑quality CoT traces is used to warm‑start the model. This stage teaches the model how to articulate its reasoning in natural language.
- Multi‑Task Reinforcement Learning – After SFT, the model is further refined with RL to maximize task‑specific metrics (e.g., accuracy for QA, IoU for segmentation). Because each task has a different reward scale, the EMA‑GRPO optimizer computes an exponential moving average of each task’s reward standard deviation and normalizes updates accordingly, preventing any single task from dominating training.
- Model Backbone – OneThinker builds on a large multimodal transformer (vision encoder + language decoder) that processes both static frames and video clips (by treating video as a sequence of frames with temporal positional embeddings). The same parameters are shared across all tasks, enabling knowledge transfer.
Results & Findings
- Across‑task performance: OneThinker matches or exceeds state‑of‑the‑art specialized models on 31 benchmarks covering QA, captioning, spatial grounding, temporal grounding, tracking, and segmentation.
- Knowledge transfer: Training on captioning improves video QA, and segmentation data boosts object tracking accuracy, demonstrating cross‑task synergy.
- Zero‑shot generalization: Without additional fine‑tuning, the model can handle unseen tasks (e.g., video‑based visual commonsense reasoning) with reasonable performance, hinting at emergent generalist abilities.
- Efficiency: A single model replaces up to 12 separate task‑specific models, reducing deployment footprint and inference latency when serving multiple visual services.
Practical Implications
- Unified AI services – Companies can expose a single API for a suite of visual capabilities (e.g., “describe this video”, “find the person in frame 42”, “track the ball”), simplifying product architecture and maintenance.
- Cost‑effective scaling – Training and hosting one large model is cheaper than maintaining dozens of specialized models, especially for edge or cloud‑constrained environments.
- Rapid prototyping – Developers can leverage the zero‑shot abilities to prototype new visual tasks (e.g., custom video QA) without collecting large labeled datasets.
- Cross‑modal knowledge reuse – Improvements in one modality (e.g., better video segmentation) automatically benefit related tasks (e.g., video captioning), accelerating iteration cycles.
- Open resources – The released dataset and code provide a ready‑made foundation for building domain‑specific extensions (e.g., medical imaging, autonomous driving) with minimal additional data.
Limitations & Future Work
- Reward heterogeneity handling – While EMA‑GRPO balances task rewards, it still relies on manually chosen hyper‑parameters (e.g., decay rates) that may need retuning for new tasks.
- Temporal resolution – The model processes video as a fixed‑length frame sequence; very long or high‑fps videos could strain memory and may require hierarchical temporal modeling.
- Domain bias – The training corpus, though large, is dominated by publicly available datasets; performance on niche domains (e.g., satellite imagery) remains untested.
- Explainability – Although CoT annotations improve interpretability, the internal reasoning of the transformer is still a black box; future work could integrate more explicit reasoning modules.
- Continual learning – Extending OneThinker to continually absorb new tasks without catastrophic forgetting is an open research direction.
OneThinker marks a significant step toward a truly multimodal, multi‑task AI assistant that can reason about both images and videos with a single, reusable model. Its open‑source release invites the community to push the boundaries of unified visual reasoning further.
Authors
- Kaituo Feng
- Manyuan Zhang
- Hongyu Li
- Kaixuan Fan
- Shuang Chen
- Yilei Jiang
- Dian Zheng
- Peiwen Sun
- Yiyuan Zhang
- Haoze Sun
- Yan Feng
- Peng Pei
- Xunliang Cai
- Xiangyu Yue
Paper Information
- arXiv ID: 2512.03043v1
- Categories: cs.CV
- Published: December 2, 2025
- PDF: Download PDF