[Paper] Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance
Source: arXiv - 2511.21356v1
Overview
The paper introduces Hybrid‑AIRL (H‑AIRL), a new twist on Adversarial Inverse Reinforcement Learning (AIRL) that blends adversarial training with a supervised loss derived from expert demonstrations. By testing the approach on the notoriously hard Heads‑Up Limit Hold’em (HULHE) poker environment and several Gymnasium benchmarks, the authors show that adding a modest amount of supervised guidance dramatically improves reward inference, sample efficiency, and learning stability.
Key Contributions
- Hybrid‑AIRL framework: Extends AIRL with a supervised expert‑action loss and a stochastic regularization term to stabilize reward learning.
- Empirical evaluation on HULHE: First systematic study of AIRL (and its hybrid variant) in a high‑complexity, imperfect‑information game with sparse, delayed rewards.
- Benchmark suite: Experiments on a curated set of Gymnasium tasks (e.g., CartPole, LunarLander, MuJoCo‑style continuous control) to demonstrate generality.
- Reward‑function diagnostics: Visual analysis tools that expose how the learned dense reward correlates with game states and expert behavior.
- Sample‑efficiency gains: Quantitative evidence that H‑AIRL reaches comparable performance with 30‑50 % fewer environment interactions than vanilla AIRL.
Methodology
- Baseline AIRL recap – AIRL treats IRL as a two‑player game: a discriminator tries to distinguish expert state‑action pairs from those generated by the current policy, while the policy (generator) learns to fool the discriminator, implicitly shaping a reward function.
- Hybrid augmentation
- Supervised loss: A cross‑entropy term that directly penalizes the policy for deviating from expert actions on the demonstration set. This provides a dense, low‑variance learning signal early in training.
- Stochastic regularization: Randomly masks portions of the discriminator’s input (state or action) during updates, preventing over‑fitting to spurious patterns in the limited expert data.
- Training loop – The policy and discriminator are updated alternately, as in standard AIRL, but the supervised loss is added to the policy’s gradient. Hyper‑parameters control the weighting between adversarial and supervised components.
- Evaluation pipeline – The authors run multiple seeds on each environment, track cumulative reward, policy entropy, and the learned reward’s correlation with ground‑truth (where available). They also visualize reward heat‑maps over game states in HULHE.
Results & Findings
| Environment | AIRL (samples) | H‑AIRL (samples) | Final Score (↑) | Stability (variance) |
|---|---|---|---|---|
| CartPole | 10 k | 6 k | 200 (max) | ↓ 0.12 |
| LunarLander | 150 k | 85 k | 260 vs 240 | ↓ 0.35 |
| MuJoCo‑HalfCheetah | 500 k | 280 k | 12 300 vs 10 900 | ↓ 0.22 |
| HULHE (poker) | 1.2 M | 0.7 M | 0.78 win‑rate vs 0.62 | ↓ 0.18 |
- Sample efficiency: H‑AIRL consistently reaches target performance with 30‑50 % fewer environment steps.
- Learning stability: The variance across random seeds drops noticeably, indicating the supervised term mitigates the high variance typical of adversarial IRL.
- Reward interpretability: Visualizations reveal that H‑AIRL’s learned reward assigns higher values to hand‑strength states that align with expert betting patterns, whereas vanilla AIRL’s reward appears noisy and less correlated with domain knowledge.
Practical Implications
- Faster prototyping of reward models – Developers can now extract dense reward functions from a modest set of expert logs without needing millions of interactions, which is valuable for robotics, game AI, and autonomous systems where data collection is expensive.
- Safer policy learning – By anchoring the policy to expert actions, H‑AIRL reduces the risk of catastrophic exploration in safety‑critical domains (e.g., autonomous driving simulators).
- Hybrid training pipelines – The approach fits naturally into existing RL libraries (e.g., Stable‑Baselines3, RLlib) as a drop‑in replacement for the AIRL trainer, requiring only the addition of a supervised loss term.
- Domain‑agnostic applicability – The benchmark suite shows that the method works across discrete and continuous control, suggesting it can be adopted for any setting where a small, high‑quality demonstration dataset exists.
Limitations & Future Work
- Dependence on demonstration quality – The supervised component assumes the expert data is near‑optimal; noisy or sub‑optimal demonstrations could bias the learned reward.
- Scalability to massive state spaces – While stochastic regularization helps, the discriminator still processes full state representations, which may become a bottleneck in high‑dimensional perception tasks (e.g., raw video).
- Theoretical guarantees – The paper provides empirical evidence but lacks a formal analysis of convergence properties when mixing adversarial and supervised losses.
- Future directions suggested by the authors include:
- Adaptive weighting schemes that automatically balance the two loss terms.
- Curriculum strategies that gradually phase out the supervised loss as the policy improves.
- Extending H‑AIRL to multi‑agent environments beyond two‑player poker.
Authors
- Bram Silue
- Santiago Amaya-Corredor
- Patrick Mannion
- Lander Willem
- Pieter Libin
Paper Information
- arXiv ID: 2511.21356v1
- Categories: cs.LG, cs.AI
- Published: November 26, 2025
- PDF: Download PDF