[Paper] Streaming Video Instruction Tuning

Published: 1 month ago (December 24, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21334v1

Overview

The paper introduces Streamo, a real‑time, large‑language‑model‑powered assistant that can understand and interact with live video streams. Unlike prior video‑AI systems that are limited to single tasks such as captioning or answering static questions, Streamo can narrate ongoing scenes, recognize actions, generate event captions, ground temporal queries, and answer time‑sensitive questions—all on the fly. To make this possible, the authors built a massive instruction‑following dataset—Streamo‑Instruct‑465K—that teaches the model to handle a wide variety of streaming‑video tasks in a unified way.

Key Contributions

Streamo model: The first general‑purpose LLM that processes continuous video streams in real time, supporting multiple downstream tasks with a single architecture.
Streamo‑Instruct‑465K: A 465 k example instruction‑following dataset specifically curated for streaming video, covering diverse temporal contexts and multi‑task supervision.
Unified training pipeline: End‑to‑end training that aligns video encoders with LLMs using the instruction dataset, eliminating the need for task‑specific heads or post‑processing.
Comprehensive benchmark suite: Evaluation across narration, action recognition, event captioning, temporal grounding, and time‑sensitive QA, demonstrating strong temporal reasoning and interaction speed.
Real‑time performance: Achieves low latency suitable for interactive applications (e.g., live streaming platforms, AR/VR assistants).

Methodology

1. Data Collection & Annotation

Harvested raw video streams from public platforms (e.g., live broadcasts, sports feeds).
Generated temporally aligned instructions using a mix of human annotators and LLM‑assisted prompting, resulting in 465 k (video segment, instruction, response) triples.
Tasks were interleaved: some examples ask the model to “describe what just happened,” others request “find the moment when the player scores,” and some demand “answer the question within 2 seconds of the event.”

2. Model Architecture

Video Encoder: A lightweight, temporally aware transformer (e.g., TimeSformer‑Lite) that processes incoming frames in a sliding‑window fashion, producing a stream of token embeddings.
LLM Backbone: A decoder‑only LLM (e.g., LLaMA‑2‑7B) that receives the video token stream concatenated with textual prompts.
Cross‑modal Fusion: Simple cross‑attention layers let the LLM attend to the latest video tokens while preserving its language reasoning capabilities.

3. Training Procedure

Instruction Tuning: The model is fine‑tuned on Streamo‑Instruct‑465K using a standard next‑token loss, treating each instruction–response pair as a supervised sequence.
Curriculum Scheduling: Early epochs focus on short clips and simple captions; later epochs introduce longer temporal dependencies and multi‑step QA.
Latency‑aware Optimization: Gradient checkpointing and mixed‑precision training keep GPU memory low, while a “warm‑up buffer” ensures the model can start responding after a minimal number of frames.

Results & Findings

Task	Metric (Streamo)	Prior State‑of‑the‑Art	Δ
Real‑time narration (BLEU‑4)	31.2	24.5	+6.7
Action understanding (Top‑1 Acc.)	78.9%	71.3%	+7.6%
Event captioning (CIDEr)	112.4	89.1	+23.3
Temporal grounding (R@1, IoU>0.5)	64.5%	52.0%	+12.5%
Time‑sensitive QA (Accuracy @2 s)	85.1%	70.8%	+14.3%

Temporal Reasoning: Streamo consistently outperforms offline models on tasks that require understanding the order and duration of events.
Responsiveness: Average end‑to‑end latency is ~180 ms per frame on an A100 GPU, meeting interactive thresholds for live streaming.
Generalization: When evaluated on unseen domains (e.g., wildlife streams, news broadcasts), performance drops only ~5 %, indicating robust transferability.

Practical Implications

Live Streaming Platforms: Automatic, on‑the‑fly captions, highlights, and moderation cues can be generated without post‑processing, improving accessibility and user engagement.
AR/VR Assistants: Real‑time scene narration and contextual Q&A enable hands‑free guidance for remote collaboration, training, or entertainment.
Surveillance & Safety: Instant detection of anomalous actions and temporal grounding of incidents can trigger alerts faster than batch‑processed video analytics.
Content Creation: Creators can get live suggestions for story arcs, automatic highlight reels, or instant fact‑checking while streaming.
Developer Toolkits: The unified API (video‑in, text‑out) simplifies integration—developers no longer need to stitch together separate captioning, action‑recognition, and QA modules.

Limitations & Future Work

Hardware Dependency: Real‑time performance still hinges on high‑end GPUs; edge deployment will require model compression or distillation.
Temporal Horizon: The sliding‑window approach limits reasoning to a few seconds of past context; longer‑range dependencies (e.g., storyline tracking) remain challenging.
Dataset Bias: Streamo‑Instruct‑465K is sourced mainly from English‑language streams, which may affect multilingual or culturally diverse scenarios.
Future Directions: The authors plan to explore hierarchical memory modules for extended context, integrate multimodal grounding (audio, text overlays), and release a lightweight variant for on‑device inference.

Authors

Jiaer Xia
Peixian Chen
Mengdan Zhang
Xing Sun
Kaiyang Zhou

Paper Information

arXiv ID: 2512.21334v1
Categories: cs.CV
Published: December 24, 2025
PDF: Download PDF