[Paper] VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Published: 2 days ago (February 10, 2026 at 01:58 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.10098v1

Overview

The paper introduces VLA‑JEPA, a new pre‑training framework for Vision‑Language‑Action (VLA) agents that learns to predict future latent states instead of raw pixels. By keeping future information out of the model’s input and using it only as a supervision signal, VLA‑JEPA sidesteps the “appearance bias” and nuisance‑motion problems that have plagued earlier latent‑action approaches, leading to more robust policies that transfer better to unseen environments.

Key Contributions

Leakage‑free latent prediction: A target encoder extracts latent embeddings from future video frames, while the student network only sees the current observation, guaranteeing no information leakage.
JEPA‑style pre‑training for VLA: Adapts the “Joint Embedding Predictive Architecture” (JEPA) paradigm to vision‑language‑action tasks, eliminating the need for multi‑stage pipelines used in prior work.
Action‑agnostic dynamics learning: By predicting in latent space, the model captures high‑level state transitions that are invariant to camera motion, background clutter, and other visual noise.
Two‑stage training recipe: Simple pre‑training → fine‑tune an action head, reducing engineering overhead compared with complex latent‑action pipelines.
Strong empirical gains: Demonstrates consistent improvements on several benchmarks (LIBERO, LIBERO‑Plus, SimplerEnv, and real‑world manipulation) in terms of generalization and robustness.

Methodology

Student–Teacher Architecture
- Target encoder (teacher) processes future video frames (e.g., next 1–2 seconds) and produces a high‑dimensional latent vector. Its parameters are frozen or slowly updated via an exponential moving average.
- Student encoder receives only the current observation (RGB image + language instruction) and tries to predict the teacher’s latent vector. No pixel‑level reconstruction loss is used; the loss is a simple cosine similarity or L2 distance in latent space.
JEPA Objective
- The loss encourages the student’s latent prediction to match the teacher’s latent target, effectively learning a model of the underlying world dynamics without ever seeing the future frames.
Training Pipeline
- Stage 1 – Pre‑training: Run the student–teacher pair on large‑scale, unlabelled video‑instruction datasets collected from the internet. The model learns a generic “latent world model”.
- Stage 2 – Fine‑tuning: Attach a lightweight action head (e.g., a transformer or MLP) on top of the frozen student encoder and train it on downstream RL or imitation‑learning tasks.
Implementation Details
- Vision backbone: ViT‑B/16 pretrained on ImageNet.
- Language encoder: frozen BERT‑base.
- Temporal horizon: 0.5–1 s into the future, sampled randomly.
- Optimizer: AdamW with cosine learning‑rate decay.

Results & Findings

Benchmark	Metric (↑ better)	VLA‑JEPA	Prior Latent‑Action (e.g., VINN)	Ablation (no teacher EMA)
LIBERO‑Plus (zero‑shot)	Success Rate	68.4 %	55.1 %	60.2 %
SimplerEnv (domain shift)	Normalized Score	84.7	71.3	78.5
Real‑world pick‑and‑place	Success Rate	72.1 %	58.9 %	66.4 %

Robustness to visual distractors: Adding random camera jitter or background textures reduces VLA‑JEPA performance by < 3 %, whereas baselines drop > 10 %.
Sample efficiency: With the same amount of fine‑tuning data, VLA‑JEPA reaches 90 % of its final performance in half the number of episodes compared to baselines.
Ablation insights: Removing the EMA update for the teacher or predicting pixels instead of latents hurts both generalization and stability, confirming the importance of leakage‑free latent prediction.

Practical Implications

Simpler pipelines for robotics teams – Developers can now adopt a two‑stage pre‑train‑then‑fine‑tune workflow without juggling multiple latent‑action modules, saving engineering time.
Better transfer to new hardware or environments – Because the latent world model abstracts away camera motion and background changes, policies trained in simulation are more likely to work on real robots with different viewpoints or lighting.
Reduced data labeling costs – The pre‑training stage only needs raw video‑instruction pairs, which can be scraped from the web, eliminating the need for expensive hand‑crafted state annotations.
Plug‑and‑play action heads – The frozen student encoder can be reused across tasks (e.g., pick‑and‑place, door opening, assembly), allowing rapid prototyping of new behaviors by swapping in a small task‑specific head.
Potential for on‑device continual learning – Since the teacher is never queried at inference time, the runtime model stays lightweight, making it feasible for edge devices or low‑power robot controllers.

Limitations & Future Work

Latent interpretability – The learned latent space is not directly human‑readable, which can make debugging difficult for developers unfamiliar with representation learning.
Dependence on high‑quality future frames – In highly stochastic environments where future observations are ambiguous, the teacher’s targets may be noisy, limiting prediction accuracy.
Scalability of teacher updates – Maintaining an EMA teacher for very large models can increase memory overhead; exploring more efficient teacher‑free alternatives is an open direction.
Extension to multimodal actions – The current work focuses on discrete or low‑dimensional continuous actions; applying VLA‑JEPA to complex dexterous manipulation or whole‑body control remains to be investigated.

Overall, VLA‑JEPA offers a compelling, developer‑friendly route to more robust vision‑language‑action agents, and its leakage‑free latent prediction paradigm could become a new standard for pre‑training in embodied AI.

Authors

Jingwen Sun
Wenyao Zhang
Zekun Qi
Shaojie Ren
Zezhi Liu
Hanxin Zhu
Guangzhong Sun
Xin Jin
Zhibo Chen

Paper Information

arXiv ID: 2602.10098v1
Categories: cs.RO, cs.CV
Published: February 10, 2026
PDF: Download PDF

[Paper] VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision