[Paper] ChronusOmni: Improving Time Awareness of Omni Large Language Models

Published: 2 months ago (December 10, 2025 at 12:22 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09841v1

Overview

ChronusOmni is a new “omni” large language model that can reason about when things happen across both video and audio streams. By tightly integrating timestamps into its multimodal representations, the model can answer questions that require explicit timing (e.g., “What happens at 00:45?”) as well as implicit cross‑modal timing (e.g., “What is on screen when the narrator says ‘the storm is coming’?”). The authors also release a fresh benchmark, ChronusAV, to push forward research on audiovisual temporal grounding.

Key Contributions

Unified timestamp tokenization: Introduces a special token that interleaves with visual and audio embeddings at each time step, enabling a single transformer to model temporal relations across modalities.
Reinforcement‑learning fine‑tuning: Designs reward functions that explicitly penalize out‑of‑order predictions and reward fine‑grained temporal alignment, sharpening the model’s sense of chronology.
ChronusAV dataset: A large‑scale, modality‑complete collection of video‑audio clips with densely annotated timestamps for both explicit and implicit grounding tasks.
State‑of‑the‑art performance: Achieves >30 % relative improvement over prior methods on ChronusAV and sets new best scores on several existing temporal grounding benchmarks.
Preserved general video/audio understanding: Demonstrates that the added temporal machinery does not degrade performance on standard video‑question answering or audio classification tasks.

Methodology

Temporal Token Insertion – For every fixed‑size time slice (e.g., 0.5 s), the model inserts a timestamp token into the input sequence. The token sits alongside the visual frame embedding and the corresponding audio spectrogram embedding, forming a triplet [timestamp, visual, audio]. This creates a single, ordered sequence that the transformer can attend to, treating time as just another token type.
Multimodal Encoder – A pre‑trained vision encoder (e.g., CLIP ViT) and an audio encoder (e.g., wav2vec 2.0) generate modality‑specific vectors. These vectors are projected to a common dimension and concatenated with the timestamp token before feeding into a language‑model backbone (e.g., LLaMA).
Reinforcement Learning (RL) Stage – After supervised pre‑training on ChronusAV, the model is fine‑tuned with RL using Proximal Policy Optimization. Two custom rewards are used:
- Temporal Order Reward – Gives higher scores when predicted timestamps follow the ground‑truth chronological order.
- Cross‑Modal Alignment Reward – Encourages the model to correctly pair visual events with corresponding audio cues (or vice‑versa).
Training Pipeline – The authors first train on a mixture of standard video‑language corpora (to retain general capabilities) and ChronusAV (to inject temporal knowledge). The RL stage then refines the model’s timing precision without catastrophic forgetting.

Results & Findings

Benchmark	Metric (↑ better)	ChronusOmni	Prior SOTA
ChronusAV (Explicit Grounding)	mIoU	0.71	0.53
ChronusAV (Implicit Cross‑Modal)	Acc@1	0.84	0.61
TVQA (Video QA)	Accuracy	0.78	0.75
AVSD (Audio‑Visual Dialog)	BLEU‑4	0.32	0.28

30 %+ relative gain on ChronusAV’s main metric demonstrates that timestamp tokenization + RL dramatically improves temporal grounding.
Minimal drop (or slight gain) on unrelated video‑language tasks shows the approach does not sacrifice general understanding.
Ablation studies reveal that removing the RL stage cuts performance by ~12 %, while omitting audio embeddings drops implicit grounding accuracy by ~18 %.

Practical Implications

Enhanced video assistants: Developers building AI agents that narrate or answer questions about movies, sports replays, or surveillance footage can now retrieve when something happened with higher confidence.
Multimodal content indexing: Search engines can index video‑audio archives by precise temporal tags, enabling queries like “show the scene where the protagonist first mentions the secret” without manual annotation.
Real‑time monitoring: In safety‑critical domains (e.g., autonomous driving, industrial monitoring), the model can align sensor audio (alarms) with visual cues to trigger timely alerts.
Creative tools: Video editors can automatically generate timelines that sync dialogue with on‑screen actions, speeding up subtitling or dubbing pipelines.

Because ChronusOmni builds on existing vision and audio encoders, integrating it into current pipelines requires only swapping the multimodal encoder and adding the timestamp token layer—no massive architectural overhaul.

Limitations & Future Work

Fixed time granularity: The current slice size is uniform; very fast events (e.g., rapid cuts) may be missed. Adaptive slicing could improve fidelity.
Dataset bias: ChronusAV, while diverse, still leans toward scripted media (movies, TV). Real‑world footage (e.g., dashcam, live streams) may exhibit different audio‑visual timing patterns.
Scalability of RL: Reinforcement learning adds computational overhead and can be unstable on larger models; exploring more efficient fine‑tuning (e.g., LoRA‑style adapters) is an open direction.

The authors suggest extending the timestamp token concept to other modalities (e.g., text streams, sensor data) and investigating self‑supervised temporal pre‑training to reduce reliance on densely annotated data.

Authors

Yijing Chen
Yihan Wu
Kaisi Guan
Yuchen Ren
Yuyue Wang
Ruihua Song
Liyun Ru

Paper Information

arXiv ID: 2512.09841v1
Categories: cs.CL, cs.CV, cs.MM
Published: December 10, 2025
PDF: Download PDF

[Paper] ChronusOmni: Improving Time Awareness of Omni Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

[Paper] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

[Paper] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

[Paper] Stronger Normalization-Free Transformers