[Paper] ChronusOmni: Improving Time Awareness of Omni Large Language Models
Source: arXiv - 2512.09841v1
Overview
ChronusOmni is a new “omni” large language model that can reason about when things happen across both video and audio streams. By tightly integrating timestamps into its multimodal representations, the model can answer questions that require explicit timing (e.g., “What happens at 00:45?”) as well as implicit cross‑modal timing (e.g., “What is on screen when the narrator says ‘the storm is coming’?”). The authors also release a fresh benchmark, ChronusAV, to push forward research on audiovisual temporal grounding.
Key Contributions
- Unified timestamp tokenization: Introduces a special token that interleaves with visual and audio embeddings at each time step, enabling a single transformer to model temporal relations across modalities.
- Reinforcement‑learning fine‑tuning: Designs reward functions that explicitly penalize out‑of‑order predictions and reward fine‑grained temporal alignment, sharpening the model’s sense of chronology.
- ChronusAV dataset: A large‑scale, modality‑complete collection of video‑audio clips with densely annotated timestamps for both explicit and implicit grounding tasks.
- State‑of‑the‑art performance: Achieves >30 % relative improvement over prior methods on ChronusAV and sets new best scores on several existing temporal grounding benchmarks.
- Preserved general video/audio understanding: Demonstrates that the added temporal machinery does not degrade performance on standard video‑question answering or audio classification tasks.
Methodology
-
Temporal Token Insertion – For every fixed‑size time slice (e.g., 0.5 s), the model inserts a timestamp token into the input sequence. The token sits alongside the visual frame embedding and the corresponding audio spectrogram embedding, forming a triplet
[timestamp, visual, audio]. This creates a single, ordered sequence that the transformer can attend to, treating time as just another token type. -
Multimodal Encoder – A pre‑trained vision encoder (e.g., CLIP ViT) and an audio encoder (e.g., wav2vec 2.0) generate modality‑specific vectors. These vectors are projected to a common dimension and concatenated with the timestamp token before feeding into a language‑model backbone (e.g., LLaMA).
-
Reinforcement Learning (RL) Stage – After supervised pre‑training on ChronusAV, the model is fine‑tuned with RL using Proximal Policy Optimization. Two custom rewards are used:
- Temporal Order Reward – Gives higher scores when predicted timestamps follow the ground‑truth chronological order.
- Cross‑Modal Alignment Reward – Encourages the model to correctly pair visual events with corresponding audio cues (or vice‑versa).
-
Training Pipeline – The authors first train on a mixture of standard video‑language corpora (to retain general capabilities) and ChronusAV (to inject temporal knowledge). The RL stage then refines the model’s timing precision without catastrophic forgetting.
Results & Findings
| Benchmark | Metric (↑ better) | ChronusOmni | Prior SOTA |
|---|---|---|---|
| ChronusAV (Explicit Grounding) | mIoU | 0.71 | 0.53 |
| ChronusAV (Implicit Cross‑Modal) | Acc@1 | 0.84 | 0.61 |
| TVQA (Video QA) | Accuracy | 0.78 | 0.75 |
| AVSD (Audio‑Visual Dialog) | BLEU‑4 | 0.32 | 0.28 |
- 30 %+ relative gain on ChronusAV’s main metric demonstrates that timestamp tokenization + RL dramatically improves temporal grounding.
- Minimal drop (or slight gain) on unrelated video‑language tasks shows the approach does not sacrifice general understanding.
- Ablation studies reveal that removing the RL stage cuts performance by ~12 %, while omitting audio embeddings drops implicit grounding accuracy by ~18 %.
Practical Implications
- Enhanced video assistants: Developers building AI agents that narrate or answer questions about movies, sports replays, or surveillance footage can now retrieve when something happened with higher confidence.
- Multimodal content indexing: Search engines can index video‑audio archives by precise temporal tags, enabling queries like “show the scene where the protagonist first mentions the secret” without manual annotation.
- Real‑time monitoring: In safety‑critical domains (e.g., autonomous driving, industrial monitoring), the model can align sensor audio (alarms) with visual cues to trigger timely alerts.
- Creative tools: Video editors can automatically generate timelines that sync dialogue with on‑screen actions, speeding up subtitling or dubbing pipelines.
Because ChronusOmni builds on existing vision and audio encoders, integrating it into current pipelines requires only swapping the multimodal encoder and adding the timestamp token layer—no massive architectural overhaul.
Limitations & Future Work
- Fixed time granularity: The current slice size is uniform; very fast events (e.g., rapid cuts) may be missed. Adaptive slicing could improve fidelity.
- Dataset bias: ChronusAV, while diverse, still leans toward scripted media (movies, TV). Real‑world footage (e.g., dashcam, live streams) may exhibit different audio‑visual timing patterns.
- Scalability of RL: Reinforcement learning adds computational overhead and can be unstable on larger models; exploring more efficient fine‑tuning (e.g., LoRA‑style adapters) is an open direction.
The authors suggest extending the timestamp token concept to other modalities (e.g., text streams, sensor data) and investigating self‑supervised temporal pre‑training to reduce reliance on densely annotated data.
Authors
- Yijing Chen
- Yihan Wu
- Kaisi Guan
- Yuchen Ren
- Yuyue Wang
- Ruihua Song
- Liyun Ru
Paper Information
- arXiv ID: 2512.09841v1
- Categories: cs.CL, cs.CV, cs.MM
- Published: December 10, 2025
- PDF: Download PDF