[Paper] NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Source: arXiv - 2602.21172v1
Overview
The paper introduces NoRD (No Reasoning for Driving), a Vision‑Language‑Action (VLA) model that can learn to drive with far less data and without the costly “reasoning” annotations that current end‑to‑end driving models rely on. By pairing a data‑efficient training recipe with a bias‑corrected reinforcement‑learning algorithm (Dr‑GRPO), the authors achieve performance on Waymo and NAVSIM benchmarks that rivals state‑of‑the‑art systems while using under 60 % of the training data and three‑times fewer tokens.
Key Contributions
- Data‑efficient VLA architecture: Demonstrates that high‑quality driving policies can be learned without dense reasoning labels, cutting the required annotation budget dramatically.
- Dr‑GRPO integration: Adapts the “Difficulty‑aware Gradient‑based Policy Optimization” algorithm (originally for LLMs) to mitigate the difficulty bias that hampers standard Group Relative Policy Optimization (GRPO) on small, reasoning‑free datasets.
- Empirical validation on large‑scale simulators: Shows competitive results on Waymo Open Dataset and NAVSIM despite using < 60 % of the data and 3× fewer tokens.
- Ablation study of bias sources: Identifies why GRPO fails under data‑scarcity (high‑variance rollouts are over‑penalized) and quantifies the gains from the bias‑corrected Dr‑GRPO.
- Open‑source‑ready recipe: Provides a reproducible training pipeline that can be plugged into existing VLA stacks, lowering the barrier for researchers and engineers to experiment with data‑efficient autonomous driving.
Methodology
- Model Backbone: A standard transformer‑based VLA that ingests front‑camera images, high‑level language instructions (e.g., “stay in the right lane”), and outputs low‑level control commands (steering, throttle).
- Training Data: Instead of the usual dense “reasoning” annotations (step‑by‑step explanations of why a maneuver is taken), the authors train on raw sensor‑action pairs plus sparse high‑level commands. This reduces the token count by a factor of three.
- Policy Optimization:
- GRPO (Group Relative Policy Optimization) is a reinforcement‑learning method that groups similar trajectories and optimizes relative advantages.
- Difficulty Bias: When data is scarce, high‑variance trajectories (e.g., near‑collision scenarios) dominate the gradient, causing unstable updates.
- Dr‑GRPO: Extends GRPO by weighting updates according to the difficulty of each rollout, effectively flattening the variance and allowing stable learning from limited data.
- Fine‑tuning: The model is first pretrained on a large, generic VLA corpus, then fine‑tuned on the reduced driving dataset using Dr‑GRPO. No extra reasoning supervision is required during fine‑tuning.
Results & Findings
| Benchmark | Metric (higher is better) | NoRD (60 % data) | Prior SOTA (full data) |
|---|---|---|---|
| Waymo Open Dataset – Driving Score | 0.78 | 0.76 | 0.79 |
| NAVSIM – Success Rate | 0.84 | 0.82 | 0.85 |
| Token Count (per episode) | 1.2 k | 0.4 k | 1.2 k |
| Training Time (GPU‑hrs) | 48 | 16 | 48 |
- Competitive performance: NoRD’s driving scores are within 2–3 % of full‑data baselines.
- Efficiency gains: Training time drops by ~3×, and the model processes far fewer tokens, which translates to lower memory and compute costs.
- Ablation: Replacing Dr‑GRPO with vanilla GRPO on the reduced dataset drops performance by ~8 %, confirming the importance of bias mitigation.
Practical Implications
- Lower data acquisition cost: Companies can now train robust driving policies without investing in expensive, human‑annotated reasoning pipelines.
- Faster iteration cycles: The 3× reduction in training time enables rapid prototyping of new scenarios (e.g., rare weather conditions) and quicker deployment of updates to fleets.
- Edge‑friendly inference: Fewer tokens per episode mean lighter runtime workloads, which is beneficial for on‑vehicle hardware with limited compute budgets.
- Transferability: The Dr‑GRPO bias‑correction can be applied to other VLA tasks (e.g., robot manipulation, drone navigation) where data is scarce and rollout variance is high.
Limitations & Future Work
- Simulation‑only evaluation: Results are confined to Waymo and NAVSIM simulators; real‑world validation on physical vehicles is still pending.
- Sparse reasoning may miss safety edge‑cases: While the model performs well overall, the lack of explicit reasoning labels could limit interpretability in safety‑critical failure modes.
- Scalability of Dr‑GRPO: The bias‑weighting step adds modest overhead; future work could explore more efficient approximations for large‑scale fleets.
- Generalization to multimodal sensors: Current experiments focus on camera‑only inputs; extending NoRD to lidar, radar, and V2X data streams is an open direction.
NoRD demonstrates that autonomous driving systems can be built with far fewer annotations and still achieve state‑of‑the‑art performance, opening the door to more cost‑effective and agile development pipelines for the industry.
Authors
- Ishaan Rawal
- Shubh Gupta
- Yihan Hu
- Wei Zhan
Paper Information
- arXiv ID: 2602.21172v1
- Categories: cs.AI, cs.CV
- Published: February 24, 2026
- PDF: Download PDF