[Paper] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Source: arXiv - 2601.10707v1
Overview
A new study shows that autonomous‑driving policies built on foundation‑model patch features can become far more robust when they are forced to ignore a random subset of those patches during training. By stochastically masking patch descriptors, the authors dramatically improve out‑of‑distribution (OOD) performance while also cutting inference time in half.
Key Contributions
- Stochastic‑Patch‑Selection (SPS): a lightweight training trick that randomly drops a configurable fraction of visual patches per frame, preserving spatial layout.
- Redundancy analysis of BLIP‑2 visual tokens using PCA and cross‑patch similarity, revealing that >90 % of variance lives in <30 % of the patches.
- Empirical gains: SPS‑trained policies achieve a 6.2 % average improvement over the previous state‑of‑the‑art across diverse OOD benchmarks, with up to 20.4 % boost in closed‑loop simulation.
- Speedup: inference becomes 2.4× faster because fewer token embeddings are processed.
- Real‑world transfer: the same SPS‑trained model drives a physical car out‑of‑the‑box, without additional fine‑tuning.
Methodology
- Feature extraction – Each camera frame is passed through a frozen BLIP‑2 vision encoder, producing a 64‑patch token grid (each token ≈ 768‑dim vector).
- Redundancy quantification – The authors run PCA on a large corpus of tokens and compute pairwise cosine similarity. The analysis shows that most information is duplicated across many patches.
- Stochastic masking – During each training step, a random mask (e.g., 30 % of patches) is applied. Masked tokens are replaced with a learned “null” embedding, but the 2‑D layout of the remaining tokens stays unchanged, so the policy still receives a coherent spatial map.
- Policy network – A lightweight transformer decoder consumes the partially‑masked token grid and outputs steering, throttle, and brake commands in an end‑to‑end fashion.
- Training regime – Standard imitation learning on expert driving data, with the SPS mask recomputed for every frame, yields many different “views” of the same scene.
- Evaluation – The authors test on several OOD tracks (weather, lighting, novel routes) in simulation and on a real‑world test vehicle, comparing against the best published end‑to‑end baselines.
Results & Findings
| Metric | Baseline (SOTA) | SPS (this work) | Relative Δ |
|---|---|---|---|
| Average OOD success rate | 71.3 % | 77.5 % | +6.2 % |
| Closed‑loop simulation improvement (best scenario) | 58.1 % | 78.5 % | +20.4 % |
| Inference latency (per frame) | 45 ms | 19 ms | 2.4× faster |
| Parameter count | 12 M | 12 M (unchanged) | – |
Ablation studies show that masking rates between 20 %–40 % give the best trade‑off; too aggressive masking (≥ 60 %) degrades performance, while no masking reproduces the over‑fitting behavior of the baseline. Re‑ordering patches (shuffling spatial positions) harms the model, confirming that preserving spatial coherence is crucial.
Practical Implications
- Robustness for production fleets – SPS can be added to existing perception‑to‑control pipelines with a single line of code (mask generation) and no extra sensors, helping cars handle novel weather or road conditions without costly data collection.
- Compute savings – Dropping ~30 % of tokens reduces GPU memory bandwidth and inference time, enabling higher‑frequency control loops on edge hardware (e.g., automotive‑grade SoCs).
- Simplified data pipelines – Because the foundation model stays frozen, developers can reuse a single pretrained visual encoder across multiple vehicle platforms, focusing effort on the lightweight policy head.
- Transferability – The same model trained in simulation transferred directly to a real car, suggesting that SPS mitigates the simulation‑to‑real gap—a major pain point for autonomous‑driving startups.
- Generalizable recipe – The stochastic masking idea is model‑agnostic; it could be applied to other token‑based perception stacks (e.g., LiDAR point‑cloud tokens, multimodal transformers) to curb redundancy‑induced over‑fitting.
Limitations & Future Work
- Masking hyper‑parameter sensitivity – The optimal drop rate depends on the encoder’s token count and the downstream policy size; automated tuning is left for future research.
- Static masking distribution – The current implementation samples masks uniformly at random; more sophisticated, content‑aware masking (e.g., focusing on high‑entropy regions) might yield further gains.
- Domain scope – Experiments cover visual‑only driving; extending SPS to multimodal setups (camera + LiDAR + radar) and to higher‑resolution token grids remains an open question.
- Theoretical guarantees – While empirical results are strong, a formal analysis of why stochastic token dropout improves OOD invariance would strengthen the claim.
Overall, the paper offers a pragmatic, low‑cost technique that can be adopted today to make end‑to‑end autonomous driving systems more reliable and faster, bridging a gap between academic breakthroughs and real‑world deployment.
Authors
- Amir Mallak
- Erfan Aasi
- Shiva Sreeram
- Tsun-Hsuan Wang
- Daniela Rus
- Alaa Maalouf
Paper Information
- arXiv ID: 2601.10707v1
- Categories: cs.CV, cs.LG, cs.RO
- Published: January 15, 2026
- PDF: Download PDF