[Paper] Latent Wasserstein Adversarial Imitation Learning
Source: arXiv - 2603.05440v1
Overview
The paper introduces Latent Wasserstein Adversarial Imitation Learning (LWAIL), a new way for agents to learn from state‑only demonstrations—i.e., without needing the expert’s actions or large, high‑quality datasets. By matching the distribution of states in a specially crafted latent space that respects the environment’s dynamics, LWAIL can achieve expert‑level performance from just a handful of demonstration episodes.
Key Contributions
- State‑only imitation: Enables learning from demonstrations that contain only observations (states), eliminating the need for action labels.
- Dynamics‑aware latent space: Constructs a latent representation using an Intention Conditioned Value Function (ICVF) trained on a small set of random state trajectories, capturing transition dynamics.
- Wasserstein distance in latent space: Employs the Wasserstein metric to compare the agent’s state distribution with the expert’s, providing smoother gradients and more stable training than classic GAN‑style divergences.
- Sample efficiency: Demonstrates that a single or a few expert episodes suffice to reach near‑expert performance on MuJoCo benchmarks.
- Empirical superiority: Outperforms prior Wasserstein‑based and adversarial imitation learning methods across multiple continuous control tasks.
Methodology
-
Pre‑training the latent encoder
- Train an Intention Conditioned Value Function on a modest collection of randomly generated state sequences (no expert data).
- The ICVF learns to predict the expected return conditioned on a latent “intention” vector, forcing the latent space to respect how states evolve under the environment’s dynamics.
-
Adversarial imitation in latent space
- Encode both the agent’s roll‑outs and the expert’s state‑only trajectories into the latent space using the trained encoder.
- A discriminator (critic) estimates the Wasserstein distance between the two latent state distributions.
- The policy is updated to minimize this distance, effectively pulling the agent’s state distribution toward the expert’s while staying grounded in the dynamics‑aware latent representation.
-
Training loop
- Alternate between collecting new agent trajectories, encoding them, updating the discriminator, and performing policy gradient steps (e.g., PPO) guided by the Wasserstein loss.
The whole pipeline requires no action labels from the expert and only a tiny amount of expert state data.
Results & Findings
- MuJoCo benchmarks (e.g., Hopper, Walker2d, HalfCheetah): LWAIL reaches or exceeds the performance of state‑of‑the‑art imitation methods that rely on full demonstrations (states + actions).
- Data efficiency: With just 1–3 expert episodes, LWAIL matches the performance of baselines that need dozens of episodes.
- Stability: The Wasserstein loss in the latent space yields smoother training curves and less mode collapse compared to traditional GAN‑based IL.
- Ablation studies: Removing the dynamics‑aware pre‑training or using a vanilla latent encoder degrades performance dramatically, confirming the importance of the ICVF‑driven representation.
Practical Implications
- Robotics & automation: Companies can now train robots from video or sensor logs that capture only positions (states) without hand‑labeling actions, dramatically lowering data collection costs.
- Simulation‑to‑real transfer: Since the latent space encodes dynamics, policies learned in simulation can be more readily adapted to real‑world systems where only state observations are available.
- Legacy system integration: Existing logs from legacy controllers (often state‑only) become usable for imitation, enabling rapid upgrades without re‑instrumenting the system for action capture.
- Rapid prototyping: Developers can bootstrap a competent policy with a single demonstration run, speeding up iteration cycles in reinforcement‑learning‑driven product development.
Limitations & Future Work
- Reliance on a good dynamics encoder: The quality of the latent space hinges on the ICVF pre‑training; poor random data or mismatched dynamics can hurt performance.
- Scalability to high‑dimensional visual inputs: The current work focuses on low‑dimensional state vectors (e.g., joint angles). Extending LWAIL to raw images will require more sophisticated encoders.
- Theoretical guarantees: While empirical results are strong, formal convergence proofs for the combined Wasserstein‑latent adversarial setup remain open.
Future research directions include integrating visual perception, exploring self‑supervised dynamics learning to replace random pre‑training, and applying LWAIL to multi‑agent or hierarchical imitation scenarios.
Authors
- Siqi Yang
- Kai Yan
- Alexander G. Schwing
- Yu-Xiong Wang
Paper Information
- arXiv ID: 2603.05440v1
- Categories: cs.LG
- Published: March 5, 2026
- PDF: Download PDF