[Paper] Astra: General Interactive World Model with Autoregressive Denoising

Published: (December 9, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.08931v1

Overview

Astra is a new “interactive world model” that can predict realistic video futures for a wide range of real‑world tasks—think autonomous‑driving dash‑cam feeds, robot‑arm manipulation, or even a moving camera in a game engine. By marrying diffusion‑style video generation with an autoregressive denoising backbone, Astra can take past frames and explicit action commands (e.g., steering angle, gripper force) and stream out coherent, long‑horizon video predictions in real time.

Key Contributions

  • General‑purpose interactive world model – works across heterogeneous action modalities (camera motion, robot joint commands, navigation actions).
  • Autoregressive denoising architecture – a diffusion transformer that denoises one frame at a time while conditioning on a causal history, enabling streaming predictions.
  • Noise‑augmented history memory – injects controlled noise into past frames to prevent the model from over‑fitting to the exact past, striking a balance between responsiveness and temporal consistency.
  • Action‑aware adapter – a lightweight plug‑in that injects action vectors directly into the denoising layers, ensuring tight alignment between predicted video and the supplied control signals.
  • Mixture‑of‑action‑experts routing – dynamically selects the appropriate expert for each action type (e.g., continuous steering vs. discrete grasp commands), boosting versatility across tasks.
  • State‑of‑the‑art results – superior video fidelity, longer prediction horizons, and tighter action‑video alignment on benchmarks ranging from driving datasets to robot manipulation suites.

Methodology

  1. Temporal Causal Attention – The model processes a sliding window of past frames with a causal mask, so each prediction only sees earlier frames, mimicking real‑time perception.
  2. Autoregressive Denoising – Starting from a noisy latent, Astra iteratively denoises one frame at a time, conditioning on the already‑generated frames. This is similar to diffusion models for images but extended to the time dimension.
  3. Noise‑Augmented History Memory – Before feeding past frames into the transformer, a small amount of Gaussian noise is added. This forces the network to rely on both visual context and the incoming action signal, preventing “copy‑paste” of the past.
  4. Action‑Aware Adapter – Action vectors are projected and added to the intermediate token embeddings at each denoising step, giving the model a direct pathway to modulate pixel‑level changes based on control inputs.
  5. Mixture of Action Experts – A gating network examines the incoming action type and routes the signal through a specialized expert (e.g., a continuous‑control expert for steering, a discrete‑grasp expert for manipulation). The outputs are fused before entering the denoising pipeline.

All components are trained end‑to‑end with a standard diffusion loss (predicting the added noise) plus an auxiliary action‑alignment loss that penalizes mismatches between the commanded action and the resulting motion in the generated video.

Results & Findings

DatasetHorizon (frames)FVD ↓ (lower better)Action‑Alignment ↑
CARLA (driving)3045.2 (vs. 68.7 SOTA)0.84
RoboNet (robot grasp)2038.9 (vs. 55.1)0.79
Kinetics‑400 (camera motion)2552.3 (vs. 71.4)0.81
  • Higher fidelity: Astra’s videos retain fine‑grained textures and motion cues even after 2‑3 seconds of prediction.
  • Longer coherent horizons: The noise‑augmented memory lets the model stay temporally consistent without drifting.
  • Tighter action alignment: The action‑aware adapter reduces the average deviation between commanded steering angle and the predicted lane curvature by ~30 % compared to prior world models.

Qualitative demos show Astra smoothly transitioning from a straight‑driving segment to a sharp turn when given a steering command, and a robot arm correctly adjusting its grip as the target object moves.

Practical Implications

IndustryHow Astra Helps
Autonomous VehiclesSimulate “what‑if” scenarios on‑the‑fly for safety validation, or generate synthetic training data that respects exact control inputs.
RoboticsReal‑time visual foresight for manipulation—e.g., a robot can preview the outcome of a grasp before executing it, reducing failed attempts.
AR/VR & GamingStream interactive cut‑scenes that react to player actions without pre‑baked animations, lowering content creation costs.
Surveillance & Predictive MaintenanceForecast camera views under planned camera motions, aiding in planning optimal viewpoints for inspection drones.
Research & SimulationProvides a plug‑and‑play world model that can be conditioned on arbitrary action vectors, accelerating prototyping of new control algorithms.

Because Astra runs autoregressively with causal attention, it can be deployed on edge GPUs for online prediction—critical for closed‑loop control where latency matters.

Limitations & Future Work

  • Compute‑heavy: Autoregressive diffusion still demands several denoising steps per frame, which can be a bottleneck for ultra‑low‑latency applications.
  • Action modality scaling: While the mixture‑of‑experts handles several discrete/continuous actions, adding entirely new modalities (e.g., natural‑language commands) will require retraining or new expert heads.
  • Domain gap: The model is trained on curated datasets; performance may degrade in highly unstructured environments (e.g., off‑road driving) without additional fine‑tuning.

Future directions include distillation of the autoregressive denoiser into a single‑step predictor, expanding the expert library to multimodal language‑action inputs, and integrating reinforcement‑learning loops that let Astra improve its predictions through real‑world interaction feedback.

Authors

  • Yixuan Zhu
  • Jiaqi Feng
  • Wenzhao Zheng
  • Yuan Gao
  • Xin Tao
  • Pengfei Wan
  • Jie Zhou
  • Jiwen Lu

Paper Information

  • arXiv ID: 2512.08931v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Relational Visual Similarity

Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also...