[Paper] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Source: arXiv - 2601.10611v1
Overview
Molmo2 is a new family of open‑weight vision‑language models (VLMs) that push the state‑of‑the‑art for video understanding and, crucially, for pixel‑level grounding (pointing, tracking) across single images, multi‑image sets, and video streams. By releasing both the model weights and the full training data pipeline, the authors give the community a reproducible foundation for building the next generation of video‑centric AI applications.
Key Contributions
- Open‑source data collection: 7 novel video datasets + 2 multi‑image datasets (detailed captions, free‑form Q&A, object‑tracking queries, video‑pointing tasks) gathered without any closed‑source VLMs.
- Molmo2 model family: Scalable architectures up to 8 B parameters, trained with a custom packing and message‑tree encoding scheme that efficiently handles long video sequences.
- Bi‑directional vision‑token attention and a token‑weighting strategy that boost cross‑modal reasoning and grounding accuracy.
- State‑of‑the‑art open‑weight performance: Best‑in‑class on short‑video captioning, counting, and competitive on long‑video tasks; surpasses proprietary models on video grounding and tracking benchmarks.
- Comprehensive evaluation suite covering captioning, counting, Q&A, pointing, and tracking on both short and long videos.
Methodology
-
Data Pipeline
- Curated raw video clips from public sources and annotated them with high‑granularity captions (describing actions, objects, and scene details).
- Built a free‑form video Q&A set where annotators asked natural questions about the clip.
- Designed a video‑pointing dataset: annotators click on objects in frames and provide textual references, enabling models to learn “where” to look.
- Added a complex object‑tracking dataset with multi‑step queries (e.g., “track the red ball after it disappears and reappears”).
-
Model Architecture
- A transformer‑based vision‑language backbone that treats video frames as a sequence of vision tokens.
- Message‑tree encoding packs variable‑length frame sequences into a compact representation, reducing memory overhead.
- Bi‑directional attention lets language tokens attend to vision tokens and vice‑versa, fostering tighter alignment.
- Token‑weight strategy assigns higher importance to tokens that are likely to be referenced in grounding tasks (e.g., objects mentioned in the query).
-
Training Regimen
- Pre‑training on the large caption dataset to learn generic video‑text alignment.
- Fine‑tuning on the Q&A, pointing, and tracking datasets using a multi‑task loss that balances captioning, classification, and grounding objectives.
- Efficient mixed‑precision training on commodity GPUs, making the 8 B model reachable for most research labs.
Results & Findings
| Task | Molmo2‑8B | Qwen3‑VL (open) | Gemini 3 Pro (proprietary) |
|---|---|---|---|
| Video Counting (short) | 35.5 % accuracy | 29.6 % | – |
| Video Pointing (F1) | 38.4 | – | 20.0 |
| Video Tracking (J&F) | 56.2 | – | 41.1 |
| Short‑Video Captioning (BLEU‑4) | State‑of‑the‑art among open models | – | – |
| Long‑Video Understanding | Competitive (within 2–3 % of top closed models) | – | – |
- Molmo2 consistently outperforms existing open‑weight VLMs on grounding‑heavy benchmarks.
- On several tasks (pointing, tracking) it exceeds proprietary baselines, demonstrating that open data + smart training can close the gap.
- Ablation studies show the bi‑directional attention and token‑weighting each contribute ~3–5 % absolute gains on grounding metrics.
Practical Implications
- Developer‑ready APIs: With weights and data publicly available, engineers can fine‑tune Molmo2 for domain‑specific video assistants, surveillance analytics, or interactive media applications without costly licensing.
- Enhanced video UI/UX: Point‑and‑click interfaces (e.g., video editors, e‑learning platforms) can now leverage a model that truly understands “where” objects are, enabling features like automatic object tagging, smart clipping, and interactive Q&A over video content.
- Robotics & AR: Real‑time grounding allows robots or AR glasses to follow natural language commands that refer to objects in a live video feed (“hand me the blue mug on the left”).
- Content moderation: Precise grounding helps flag specific frames or regions that violate policies, reducing false positives compared to coarse classification.
- Research acceleration: The released datasets become a benchmark suite for the community, encouraging reproducibility and faster iteration on video‑language research.
Limitations & Future Work
- Scale ceiling: While the 8 B model is strong, it still lags behind the largest proprietary VLMs on very long‑form video reasoning and multi‑modal reasoning that involves audio.
- Compute requirements: Training the full pipeline demands multi‑GPU clusters; smaller labs may need to rely on the provided checkpoints and limited fine‑tuning.
- Domain bias: The datasets, though diverse, are sourced from publicly available videos and may under‑represent niche domains (medical imaging, industrial inspection).
- Future directions suggested by the authors include: integrating audio streams, scaling to >30 B parameters with sparse attention, and expanding the grounding tasks to 3‑D point clouds for mixed‑reality scenarios.
Authors
- Christopher Clark
- Jieyu Zhang
- Zixian Ma
- Jae Sung Park
- Mohammadreza Salehi
- Rohun Tripathi
- Sangho Lee
- Zhongzheng Ren
- Chris Dongjoo Kim
- Yinuo Yang
- Vincent Shao
- Yue Yang
- Weikai Huang
- Ziqi Gao
- Taira Anderson
- Jianrui Zhang
- Jitesh Jain
- George Stoica
- Winson Han
- Ali Farhadi
- Ranjay Krishna
Paper Information
- arXiv ID: 2601.10611v1
- Categories: cs.CV, cs.AI
- Published: January 15, 2026
- PDF: Download PDF