[Paper] VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
Source: arXiv - 2602.10098v1
Overview
The paper introduces VLA‑JEPA, a new pre‑training framework for Vision‑Language‑Action (VLA) agents that learns to predict future latent states instead of raw pixels. By keeping future information out of the model’s input and using it only as a supervision signal, VLA‑JEPA sidesteps the “appearance bias” and nuisance‑motion problems that have plagued earlier latent‑action approaches, leading to more robust policies that transfer better to unseen environments.
Key Contributions
- Leakage‑free latent prediction: A target encoder extracts latent embeddings from future video frames, while the student network only sees the current observation, guaranteeing no information leakage.
- JEPA‑style pre‑training for VLA: Adapts the “Joint Embedding Predictive Architecture” (JEPA) paradigm to vision‑language‑action tasks, eliminating the need for multi‑stage pipelines used in prior work.
- Action‑agnostic dynamics learning: By predicting in latent space, the model captures high‑level state transitions that are invariant to camera motion, background clutter, and other visual noise.
- Two‑stage training recipe: Simple pre‑training → fine‑tune an action head, reducing engineering overhead compared with complex latent‑action pipelines.
- Strong empirical gains: Demonstrates consistent improvements on several benchmarks (LIBERO, LIBERO‑Plus, SimplerEnv, and real‑world manipulation) in terms of generalization and robustness.
Methodology
-
Student–Teacher Architecture
- Target encoder (teacher) processes future video frames (e.g., next 1–2 seconds) and produces a high‑dimensional latent vector. Its parameters are frozen or slowly updated via an exponential moving average.
- Student encoder receives only the current observation (RGB image + language instruction) and tries to predict the teacher’s latent vector. No pixel‑level reconstruction loss is used; the loss is a simple cosine similarity or L2 distance in latent space.
-
JEPA Objective
- The loss encourages the student’s latent prediction to match the teacher’s latent target, effectively learning a model of the underlying world dynamics without ever seeing the future frames.
-
Training Pipeline
- Stage 1 – Pre‑training: Run the student–teacher pair on large‑scale, unlabelled video‑instruction datasets collected from the internet. The model learns a generic “latent world model”.
- Stage 2 – Fine‑tuning: Attach a lightweight action head (e.g., a transformer or MLP) on top of the frozen student encoder and train it on downstream RL or imitation‑learning tasks.
-
Implementation Details
- Vision backbone: ViT‑B/16 pretrained on ImageNet.
- Language encoder: frozen BERT‑base.
- Temporal horizon: 0.5–1 s into the future, sampled randomly.
- Optimizer: AdamW with cosine learning‑rate decay.
Results & Findings
| Benchmark | Metric (↑ better) | VLA‑JEPA | Prior Latent‑Action (e.g., VINN) | Ablation (no teacher EMA) |
|---|---|---|---|---|
| LIBERO‑Plus (zero‑shot) | Success Rate | 68.4 % | 55.1 % | 60.2 % |
| SimplerEnv (domain shift) | Normalized Score | 84.7 | 71.3 | 78.5 |
| Real‑world pick‑and‑place | Success Rate | 72.1 % | 58.9 % | 66.4 % |
- Robustness to visual distractors: Adding random camera jitter or background textures reduces VLA‑JEPA performance by < 3 %, whereas baselines drop > 10 %.
- Sample efficiency: With the same amount of fine‑tuning data, VLA‑JEPA reaches 90 % of its final performance in half the number of episodes compared to baselines.
- Ablation insights: Removing the EMA update for the teacher or predicting pixels instead of latents hurts both generalization and stability, confirming the importance of leakage‑free latent prediction.
Practical Implications
- Simpler pipelines for robotics teams – Developers can now adopt a two‑stage pre‑train‑then‑fine‑tune workflow without juggling multiple latent‑action modules, saving engineering time.
- Better transfer to new hardware or environments – Because the latent world model abstracts away camera motion and background changes, policies trained in simulation are more likely to work on real robots with different viewpoints or lighting.
- Reduced data labeling costs – The pre‑training stage only needs raw video‑instruction pairs, which can be scraped from the web, eliminating the need for expensive hand‑crafted state annotations.
- Plug‑and‑play action heads – The frozen student encoder can be reused across tasks (e.g., pick‑and‑place, door opening, assembly), allowing rapid prototyping of new behaviors by swapping in a small task‑specific head.
- Potential for on‑device continual learning – Since the teacher is never queried at inference time, the runtime model stays lightweight, making it feasible for edge devices or low‑power robot controllers.
Limitations & Future Work
- Latent interpretability – The learned latent space is not directly human‑readable, which can make debugging difficult for developers unfamiliar with representation learning.
- Dependence on high‑quality future frames – In highly stochastic environments where future observations are ambiguous, the teacher’s targets may be noisy, limiting prediction accuracy.
- Scalability of teacher updates – Maintaining an EMA teacher for very large models can increase memory overhead; exploring more efficient teacher‑free alternatives is an open direction.
- Extension to multimodal actions – The current work focuses on discrete or low‑dimensional continuous actions; applying VLA‑JEPA to complex dexterous manipulation or whole‑body control remains to be investigated.
Overall, VLA‑JEPA offers a compelling, developer‑friendly route to more robust vision‑language‑action agents, and its leakage‑free latent prediction paradigm could become a new standard for pre‑training in embodied AI.
Authors
- Jingwen Sun
- Wenyao Zhang
- Zekun Qi
- Shaojie Ren
- Zezhi Liu
- Hanxin Zhu
- Guangzhong Sun
- Xin Jin
- Zhibo Chen
Paper Information
- arXiv ID: 2602.10098v1
- Categories: cs.RO, cs.CV
- Published: February 10, 2026
- PDF: Download PDF