[Paper] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Published: 3 weeks ago (January 15, 2026 at 12:27 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.10611v1

Overview

Molmo2 is a new family of open‑weight vision‑language models (VLMs) that push the state‑of‑the‑art for video understanding and, crucially, for pixel‑level grounding (pointing, tracking) across single images, multi‑image sets, and video streams. By releasing both the model weights and the full training data pipeline, the authors give the community a reproducible foundation for building the next generation of video‑centric AI applications.

Key Contributions

Open‑source data collection: 7 novel video datasets + 2 multi‑image datasets (detailed captions, free‑form Q&A, object‑tracking queries, video‑pointing tasks) gathered without any closed‑source VLMs.
Molmo2 model family: Scalable architectures up to 8 B parameters, trained with a custom packing and message‑tree encoding scheme that efficiently handles long video sequences.
Bi‑directional vision‑token attention and a token‑weighting strategy that boost cross‑modal reasoning and grounding accuracy.
State‑of‑the‑art open‑weight performance: Best‑in‑class on short‑video captioning, counting, and competitive on long‑video tasks; surpasses proprietary models on video grounding and tracking benchmarks.
Comprehensive evaluation suite covering captioning, counting, Q&A, pointing, and tracking on both short and long videos.

Methodology

Data Pipeline
- Curated raw video clips from public sources and annotated them with high‑granularity captions (describing actions, objects, and scene details).
- Built a free‑form video Q&A set where annotators asked natural questions about the clip.
- Designed a video‑pointing dataset: annotators click on objects in frames and provide textual references, enabling models to learn “where” to look.
- Added a complex object‑tracking dataset with multi‑step queries (e.g., “track the red ball after it disappears and reappears”).
Model Architecture
- A transformer‑based vision‑language backbone that treats video frames as a sequence of vision tokens.
- Message‑tree encoding packs variable‑length frame sequences into a compact representation, reducing memory overhead.
- Bi‑directional attention lets language tokens attend to vision tokens and vice‑versa, fostering tighter alignment.
- Token‑weight strategy assigns higher importance to tokens that are likely to be referenced in grounding tasks (e.g., objects mentioned in the query).
Training Regimen
- Pre‑training on the large caption dataset to learn generic video‑text alignment.
- Fine‑tuning on the Q&A, pointing, and tracking datasets using a multi‑task loss that balances captioning, classification, and grounding objectives.
- Efficient mixed‑precision training on commodity GPUs, making the 8 B model reachable for most research labs.

Results & Findings

Task	Molmo2‑8B	Qwen3‑VL (open)	Gemini 3 Pro (proprietary)
Video Counting (short)	35.5 % accuracy	29.6 %	–
Video Pointing (F1)	38.4	–	20.0
Video Tracking (J&F)	56.2	–	41.1
Short‑Video Captioning (BLEU‑4)	State‑of‑the‑art among open models	–	–
Long‑Video Understanding	Competitive (within 2–3 % of top closed models)	–	–

Molmo2 consistently outperforms existing open‑weight VLMs on grounding‑heavy benchmarks.
On several tasks (pointing, tracking) it exceeds proprietary baselines, demonstrating that open data + smart training can close the gap.
Ablation studies show the bi‑directional attention and token‑weighting each contribute ~3–5 % absolute gains on grounding metrics.

Practical Implications

Developer‑ready APIs: With weights and data publicly available, engineers can fine‑tune Molmo2 for domain‑specific video assistants, surveillance analytics, or interactive media applications without costly licensing.
Enhanced video UI/UX: Point‑and‑click interfaces (e.g., video editors, e‑learning platforms) can now leverage a model that truly understands “where” objects are, enabling features like automatic object tagging, smart clipping, and interactive Q&A over video content.
Robotics & AR: Real‑time grounding allows robots or AR glasses to follow natural language commands that refer to objects in a live video feed (“hand me the blue mug on the left”).
Content moderation: Precise grounding helps flag specific frames or regions that violate policies, reducing false positives compared to coarse classification.
Research acceleration: The released datasets become a benchmark suite for the community, encouraging reproducibility and faster iteration on video‑language research.

Limitations & Future Work

Scale ceiling: While the 8 B model is strong, it still lags behind the largest proprietary VLMs on very long‑form video reasoning and multi‑modal reasoning that involves audio.
Compute requirements: Training the full pipeline demands multi‑GPU clusters; smaller labs may need to rely on the provided checkpoints and limited fine‑tuning.
Domain bias: The datasets, though diverse, are sourced from publicly available videos and may under‑represent niche domains (medical imaging, industrial inspection).
Future directions suggested by the authors include: integrating audio streams, scaling to >30 B parameters with sparse attention, and expanding the grounding tasks to 3‑D point clouds for mixed‑reality scenarios.

Authors

Christopher Clark
Jieyu Zhang
Zixian Ma
Jae Sung Park
Mohammadreza Salehi
Rohun Tripathi
Sangho Lee
Zhongzheng Ren
Chris Dongjoo Kim
Yinuo Yang
Vincent Shao
Yue Yang
Weikai Huang
Ziqi Gao
Taira Anderson
Jianrui Zhang
Jitesh Jain
George Stoica
Winson Han
Ali Farhadi
Ranjay Krishna

Paper Information

arXiv ID: 2601.10611v1
Categories: cs.CV, cs.AI
Published: January 15, 2026
PDF: Download PDF

[Paper] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

[Paper] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

[Paper] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models