[Paper] cuNRTO: GPU-Accelerated Nonlinear Robust Trajectory Optimization

Published: (March 3, 2026 at 01:17 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.02642v1

Overview

The paper introduces cuNRTO, a GPU‑accelerated framework for nonlinear robust trajectory optimization (NRTO). By moving the heavy lifting of second‑order cone programming (SOCP) onto CUDA‑enabled GPUs, the authors achieve order‑of‑magnitude speedups, making real‑time robust planning feasible for complex robots such as quadcopters and manipulators.

Key Contributions

  • Two GPU‑friendly NRTO architectures:
    1. NRTO‑DR – employs Douglas‑Rachford splitting to solve SOCP sub‑problems in parallel.
    2. NRTO‑FullADMM – a novel ADMM‑based variant that further exploits sparsity and problem structure for scalability.
  • Custom CUDA kernels for fast second‑order cone (SOC) projections and dense linear‑algebra pipelines (cuBLAS GEMM chains) that handle feedback‑gain updates.
  • Comprehensive experimental validation on three benchmark platforms (unicycle, quadcopter, Franka manipulator) showing up to 139.6× speedup over CPU‑only baselines.
  • Open‑source implementation (released with the paper) that can be integrated into existing ROS/ROS2 pipelines.

Methodology

  1. Problem Formulation – The robust trajectory optimization problem is cast as a nonlinear program with SOCP constraints that enforce safety under bounded disturbances.
  2. Splitting Strategies
    • Douglas‑Rachford (DR) Splitting: Decomposes the original problem into a proximal step (nonlinear dynamics) and a projection step (SOCP feasibility). The projection step is embarrassingly parallel across time steps, making it ideal for GPU execution.
    • Full ADMM: Extends the DR idea by introducing auxiliary variables for each constraint block, allowing independent updates and tighter convergence guarantees. The ADMM updates are expressed as sparse linear solves that map efficiently onto CUDA’s batched solvers.
  3. GPU Implementation
    • SOC Projection Kernels – Hand‑written CUDA kernels compute the Euclidean projection onto second‑order cones for thousands of constraints simultaneously.
    • Linear‑Algebra Backbone – cuBLAS is used for the dense matrix‑matrix multiplications required in the feedback‑gain (K) updates, while cuSPARSE handles the sparse direct solves in the ADMM step.
    • Memory Layout – Data is stored in a structure‑of‑arrays format to maximize coalesced memory accesses and minimize latency.
  4. Iterative Loop – Each outer NRTO iteration alternates between a forward rollout (nonlinear dynamics) and the chosen splitting update (DR or ADMM), converging to a trajectory that satisfies the robust constraints.

Results & Findings

PlatformBaseline (CPU)cuNRTO‑DRcuNRTO‑FullADMMSpeedup
Unicycle (10 s horizon)1.84 s0.16 s0.12 s11.5× – 15.3×
Quadcopter (20 s horizon)4.92 s0.07 s0.05 s70.3× – 98.4×
Franka Manipulator (30 s horizon)12.4 s0.09 s0.09 s138.9× – 139.6×
  • Convergence – Both NRTO‑DR and NRTO‑FullADMM reach the same optimal cost within 5–8 outer iterations, matching the solution quality of the CPU solver.
  • Scalability – As the horizon length and state dimension grow, the ADMM variant shows a flatter runtime curve, confirming its superior handling of large, sparse constraint matrices.
  • Robustness – Simulated disturbances bounded by the prescribed uncertainty set never violated state or control constraints, demonstrating the practical safety guarantees of the optimized trajectories.

Practical Implications

  • Real‑Time Safe Planning – Developers can now embed robust trajectory optimization directly into control loops (e.g., 50 Hz for quadrotor navigation) without offloading to a separate CPU or sacrificing safety margins.
  • Plug‑and‑Play GPU Modules – The provided CUDA kernels can be wrapped as ROS nodes or integrated into existing GPU‑accelerated perception pipelines, enabling end‑to‑end perception‑planning‑control on a single device.
  • Scalable to High‑DOF Systems – The ADMM‑based architecture scales gracefully to manipulators with dozens of joints, opening doors for robust motion planning in industrial automation and collaborative robots.
  • Energy‑Efficient Computation – By leveraging the parallelism of modern GPUs, the overall energy per planning query drops compared to multi‑core CPU clusters, which is attractive for edge devices and autonomous drones with limited power budgets.

Limitations & Future Work

  • Hardware Dependency – The speedups rely on NVIDIA GPUs with sufficient CUDA cores and memory bandwidth; performance on integrated or low‑power GPUs may be modest.
  • Static Uncertainty Bounds – The current formulation assumes known, time‑invariant disturbance sets. Extending to adaptive or learned uncertainty models is left for future research.
  • Solver Warm‑Start – While the authors use the previous trajectory as an initializer, more sophisticated warm‑starting strategies (e.g., learned priors) could further reduce iteration counts.
  • Real‑World Validation – Experiments are confined to simulation; deploying cuNRTO on physical robots under real sensor noise and actuator lag will be a critical next step.

cuNRTO demonstrates that with the right algorithmic splitting and GPU‑centric implementation, robust trajectory optimization—once a computational bottleneck—can become a practical tool for developers building safe, high‑performance autonomous systems.

Authors

  • Jiawei Wang
  • Arshiya Taj Abdul
  • Evangelos A. Theodorou

Paper Information

  • arXiv ID: 2603.02642v1
  • Categories: cs.RO, cs.DC, eess.SY
  • Published: March 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »