[Paper] Action100M: A Large-scale Video Action Dataset

Published: (January 15, 2026 at 12:02 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.10592v1

Overview

The paper introduces Action100M, a massive, automatically‑curated video‑action dataset built from 1.2 million instructional videos (over 14 years of footage). By providing roughly 100 million temporally localized segments with open‑vocabulary action labels and rich, hierarchical captions, the authors aim to give the research community a “foundation” resource for training and evaluating video‑understanding models at scale.

Key Contributions

  • Scale‑first dataset: ~100 M annotated video segments covering a wide spectrum of everyday actions, far surpassing existing video‑action corpora.
  • Fully automated pipeline: Combines hierarchical temporal segmentation (V‑JEPA 2), multi‑level caption generation (Tree‑of‑Captions), and a large‑scale reasoning model (GPT‑OSS‑120B) with a Self‑Refine loop to produce high‑quality, structured annotations without human labeling.
  • Open‑vocabulary supervision: Action labels and captions are not limited to a fixed taxonomy, enabling models to learn from natural language descriptions.
  • Demonstrated utility: Training the VL‑JEPA vision‑language model on Action100M yields consistent performance gains and strong zero‑shot results on several downstream action‑recognition benchmarks.
  • Public release: The dataset and the code for the annotation pipeline are made available to the community, encouraging reproducibility and further scaling efforts.

Methodology

  1. Data collection – Harvested 1.2 M publicly available instructional videos from the web, covering domains such as cooking, DIY, fitness, and more.
  2. Hierarchical temporal segmentation – Using V‑JEPA 2 embeddings (a self‑supervised video encoder), the videos are recursively split into coherent sub‑segments, producing a tree‑like temporal structure.
  3. Tree‑of‑Captions generation – For each segment and its parent frames, a captioning model creates both brief and detailed textual descriptions, forming a multi‑level “caption tree.”
  4. Reasoning & structuring – A 120‑billion‑parameter language model (GPT‑OSS‑120B) ingests the raw captions and performs multi‑round Self‑Refine: it validates, merges, and restructures the information into a consistent annotation schema (action verb, actor, short/long captions).
  5. Dataset assembly – The final output is a set of temporally localized video clips, each paired with a structured, open‑vocabulary label and a hierarchy of natural‑language captions.

The entire pipeline runs without human intervention, making it feasible to scale to hundreds of millions of examples.

Results & Findings

  • Scaling benefits – Training VL‑JEPA on Action100M (vs. smaller datasets) improves top‑1 accuracy on Kinetics‑400 by ~4 % and on Something‑Else by ~5 %, confirming that more data translates into better visual‑language representations.
  • Zero‑shot transfer – Models pretrained on Action100M achieve state‑of‑the‑art zero‑shot performance on unseen action benchmarks (e.g., HMDB‑51, UCF‑101) without any fine‑tuning, demonstrating the generality of the learned representations.
  • Annotation quality – Human spot‑checks report that >85 % of the generated action labels and captions are semantically correct and temporally aligned, a remarkable figure for a fully automated process.
  • Ablation studies – Removing any pipeline component (e.g., the Self‑Refine step or the hierarchical segmentation) leads to noticeable drops in downstream performance, underscoring the importance of each stage.

Practical Implications

  • Better video AI for developers – Pre‑trained models on Action100M can be fine‑tuned with far fewer labeled examples for specific applications such as video search, content moderation, or automated tutorial generation.
  • Open‑vocabulary action detection – Because the dataset is not constrained to a fixed label set, downstream systems can recognize novel actions described in natural language, enabling more flexible user‑driven queries (“show me clips where someone whiskes eggs”).
  • Reduced annotation cost – The pipeline offers a blueprint for organizations to generate their own domain‑specific video corpora (e.g., industrial safety footage) without the expense of manual labeling.
  • Foundation for multimodal world models – The rich caption hierarchy provides both coarse and fine‑grained semantic context, which can be leveraged in robotics or AR/VR systems that need to reason about ongoing activities.

Limitations & Future Work

  • Domain bias – The source videos are primarily instructional, which may under‑represent actions common in other contexts (e.g., sports, surveillance).
  • No explicit visual grounding verification – While human checks show high quality, the pipeline lacks a formal metric for temporal alignment accuracy across the entire dataset.
  • Compute‑heavy annotation – The use of a 120 B‑parameter language model makes the pipeline expensive; future work could explore lighter reasoning models or distillation techniques.
  • Extending to multimodal signals – Incorporating audio, speech transcripts, or sensor data could further enrich the dataset and enable richer multimodal reasoning.

Action100M marks a significant step toward democratizing large‑scale video understanding. By making a massive, open‑vocabulary resource publicly available, the authors open the door for developers to build more capable, adaptable video AI systems without the traditional bottleneck of costly manual annotation.

Authors

  • Delong Chen
  • Tejaswi Kasarla
  • Yejin Bang
  • Mustafa Shukor
  • Willy Chung
  • Jade Yu
  • Allen Bolourchi
  • Theo Moutakanni
  • Pascale Fung

Paper Information

  • arXiv ID: 2601.10592v1
  • Categories: cs.CV
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »