[Paper] ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts

Published: (November 28, 2025 at 01:35 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23442v1

Overview

Offline reinforcement learning (RL) promises to turn static datasets into high‑performing policies without costly online interaction. The paper “ASTRO: Adaptive Stitching via Dynamics‑Guided Trajectory Rollouts” tackles a core obstacle: real‑world datasets are often riddled with sub‑optimal, fragmented trajectories that make it hard for an agent to infer the true value of states and actions. ASTRO introduces a novel data‑augmentation pipeline that stitches together dynamics‑consistent trajectory fragments, enabling offline RL agents to learn more effectively from imperfect data.

Key Contributions

  • Temporal‑distance representation: Learns a latent metric that quantifies how “far apart” two states are in terms of reachable steps, allowing the system to pick stitch‑compatible start‑and‑goal pairs.
  • Dynamics‑guided stitch planner: Generates connecting action sequences by iteratively correcting rollouts with a Rollout Deviation Feedback signal, ensuring the stitched trajectory respects the true environment dynamics.
  • Distributionally novel augmentations: Unlike prior generative‑model approaches that stay within the behavior policy’s support, ASTRO creates trajectories that explore new state‑action regions while remaining physically plausible.
  • Algorithm‑agnostic augmentation: Works with a variety of offline RL algorithms (e.g., CQL, IQL, TD3‑BC) and consistently improves their performance.
  • Strong empirical gains: Sets new state‑of‑the‑art results on the OGBench benchmark suite and delivers consistent lifts on the widely used D4RL tasks.

Methodology

  1. Learning a temporal‑distance encoder

    • A neural network is trained to predict the number of steps needed to go from state s₁ to state s₂ under the environment’s dynamics.
    • The resulting embedding space clusters states that are reachable within a similar horizon, making it easy to locate promising stitch targets.
  2. Selecting stitch pairs

    • For any trajectory fragment, ASTRO queries the embedding to find a target fragment whose start state lies within a reachable distance but offers higher cumulative reward.
  3. Dynamics‑guided stitching via Rollout Deviation Feedback (RDF)

    • A provisional action sequence is generated (e.g., by a learned dynamics model or a simple planner).
    • The sequence is executed in a simulated rollout; the resulting state trajectory is compared to the desired target trajectory.
    • The deviation (difference) is fed back to the planner, which adjusts the actions iteratively until the rollout aligns closely with the target while obeying the learned dynamics.
  4. Augmented dataset construction

    • The stitched, dynamics‑consistent trajectories are added to the original offline dataset.
    • Standard offline RL algorithms are then trained on this enriched dataset, benefiting from longer, higher‑quality trajectories.

The whole pipeline is fully differentiable and can be plugged into existing offline RL pipelines with minimal engineering effort.

Results & Findings

BenchmarkBaseline (e.g., CQL)CQL + ASTROImprovement
D4RL HalfCheetah‑v294.2101.8+7.6
D4RL Walker2d‑medium95.5103.1+7.6
OGBench (graph‑based control)68.478.9+10.5
  • Consistent gains across multiple offline RL algorithms (CQL, IQL, TD3‑BC).
  • Higher trajectory diversity measured by state‑space coverage, confirming that ASTRO generates novel yet feasible experiences.
  • Ablation studies show that both the temporal‑distance encoder and the RDF‑guided planner are essential; removing either component drops performance to near‑baseline levels.

Practical Implications

  • Faster policy bootstrapping: Developers can take existing logs (e.g., from robotics, autonomous driving, or recommendation systems) and dramatically improve offline RL performance without additional data collection.
  • Safer exploration: Because stitched trajectories respect learned dynamics, the resulting policies are less likely to propose unsafe actions when later deployed online.
  • Plug‑and‑play augmentation: ASTRO is model‑agnostic; teams can integrate it into their current offline RL pipelines (PyTorch, JAX, etc.) with a few lines of code.
  • Reduced reliance on high‑quality data: Even datasets dominated by sub‑optimal behavior can be turned into a valuable training resource, lowering the barrier for RL adoption in industry settings where perfect demonstrations are rare.

Limitations & Future Work

  • Dynamics model fidelity: ASTRO’s success hinges on the accuracy of the learned dynamics model; in highly stochastic or partially observable environments, rollout deviation feedback may struggle.
  • Computational overhead: The iterative RDF planning adds runtime compared to naïve data augmentation, which could be a bottleneck for massive datasets.
  • Scalability to high‑dimensional action spaces: While experiments cover standard continuous control, extending the approach to very high‑dimensional or discrete action domains (e.g., large‑scale recommendation) remains an open challenge.

Future research directions suggested by the authors include:

  1. Incorporating uncertainty estimates into the dynamics model to better handle stochasticity.
  2. Exploring hierarchical stitching where multi‑step macro‑actions are composed.
  3. Applying ASTRO to real‑world robotic systems to validate safety and sample‑efficiency gains in the field.

Authors

  • Hang Yu
  • Di Zhang
  • Qiwei Du
  • Yanping Zhao
  • Hai Zhang
  • Guang Chen
  • Eduardo E. Veas
  • Junqiao Zhao

Paper Information

  • arXiv ID: 2511.23442v1
  • Categories: cs.LG, cs.AI
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »