[Paper] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection

Published: 3 weeks ago (January 15, 2026 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.10707v1

Overview

A new study shows that autonomous‑driving policies built on foundation‑model patch features can become far more robust when they are forced to ignore a random subset of those patches during training. By stochastically masking patch descriptors, the authors dramatically improve out‑of‑distribution (OOD) performance while also cutting inference time in half.

Key Contributions

Stochastic‑Patch‑Selection (SPS): a lightweight training trick that randomly drops a configurable fraction of visual patches per frame, preserving spatial layout.
Redundancy analysis of BLIP‑2 visual tokens using PCA and cross‑patch similarity, revealing that >90 % of variance lives in <30 % of the patches.
Empirical gains: SPS‑trained policies achieve a 6.2 % average improvement over the previous state‑of‑the‑art across diverse OOD benchmarks, with up to 20.4 % boost in closed‑loop simulation.
Speedup: inference becomes 2.4× faster because fewer token embeddings are processed.
Real‑world transfer: the same SPS‑trained model drives a physical car out‑of‑the‑box, without additional fine‑tuning.

Methodology

Feature extraction – Each camera frame is passed through a frozen BLIP‑2 vision encoder, producing a 64‑patch token grid (each token ≈ 768‑dim vector).
Redundancy quantification – The authors run PCA on a large corpus of tokens and compute pairwise cosine similarity. The analysis shows that most information is duplicated across many patches.
Stochastic masking – During each training step, a random mask (e.g., 30 % of patches) is applied. Masked tokens are replaced with a learned “null” embedding, but the 2‑D layout of the remaining tokens stays unchanged, so the policy still receives a coherent spatial map.
Policy network – A lightweight transformer decoder consumes the partially‑masked token grid and outputs steering, throttle, and brake commands in an end‑to‑end fashion.
Training regime – Standard imitation learning on expert driving data, with the SPS mask recomputed for every frame, yields many different “views” of the same scene.
Evaluation – The authors test on several OOD tracks (weather, lighting, novel routes) in simulation and on a real‑world test vehicle, comparing against the best published end‑to‑end baselines.

Results & Findings

Metric	Baseline (SOTA)	SPS (this work)	Relative Δ
Average OOD success rate	71.3 %	77.5 %	+6.2 %
Closed‑loop simulation improvement (best scenario)	58.1 %	78.5 %	+20.4 %
Inference latency (per frame)	45 ms	19 ms	2.4× faster
Parameter count	12 M	12 M (unchanged)	–

Ablation studies show that masking rates between 20 %–40 % give the best trade‑off; too aggressive masking (≥ 60 %) degrades performance, while no masking reproduces the over‑fitting behavior of the baseline. Re‑ordering patches (shuffling spatial positions) harms the model, confirming that preserving spatial coherence is crucial.

Practical Implications

Robustness for production fleets – SPS can be added to existing perception‑to‑control pipelines with a single line of code (mask generation) and no extra sensors, helping cars handle novel weather or road conditions without costly data collection.
Compute savings – Dropping ~30 % of tokens reduces GPU memory bandwidth and inference time, enabling higher‑frequency control loops on edge hardware (e.g., automotive‑grade SoCs).
Simplified data pipelines – Because the foundation model stays frozen, developers can reuse a single pretrained visual encoder across multiple vehicle platforms, focusing effort on the lightweight policy head.
Transferability – The same model trained in simulation transferred directly to a real car, suggesting that SPS mitigates the simulation‑to‑real gap—a major pain point for autonomous‑driving startups.
Generalizable recipe – The stochastic masking idea is model‑agnostic; it could be applied to other token‑based perception stacks (e.g., LiDAR point‑cloud tokens, multimodal transformers) to curb redundancy‑induced over‑fitting.

Limitations & Future Work

Masking hyper‑parameter sensitivity – The optimal drop rate depends on the encoder’s token count and the downstream policy size; automated tuning is left for future research.
Static masking distribution – The current implementation samples masks uniformly at random; more sophisticated, content‑aware masking (e.g., focusing on high‑entropy regions) might yield further gains.
Domain scope – Experiments cover visual‑only driving; extending SPS to multimodal setups (camera + LiDAR + radar) and to higher‑resolution token grids remains an open question.
Theoretical guarantees – While empirical results are strong, a formal analysis of why stochastic token dropout improves OOD invariance would strengthen the claim.

Overall, the paper offers a pragmatic, low‑cost technique that can be adopted today to make end‑to‑end autonomous driving systems more reliable and faster, bridging a gap between academic breakthroughs and real‑world deployment.

Authors

Amir Mallak
Erfan Aasi
Shiva Sreeram
Tsun-Hsuan Wang
Daniela Rus
Alaa Maalouf

Paper Information

arXiv ID: 2601.10707v1
Categories: cs.CV, cs.LG, cs.RO
Published: January 15, 2026
PDF: Download PDF

[Paper] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

[Paper] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

[Paper] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models