[Paper] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection

Published: (January 15, 2026 at 01:58 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.10707v1

Overview

A new study shows that autonomous‑driving policies built on foundation‑model patch features can become far more robust when they are forced to ignore a random subset of those patches during training. By stochastically masking patch descriptors, the authors dramatically improve out‑of‑distribution (OOD) performance while also cutting inference time in half.

Key Contributions

  • Stochastic‑Patch‑Selection (SPS): a lightweight training trick that randomly drops a configurable fraction of visual patches per frame, preserving spatial layout.
  • Redundancy analysis of BLIP‑2 visual tokens using PCA and cross‑patch similarity, revealing that >90 % of variance lives in <30 % of the patches.
  • Empirical gains: SPS‑trained policies achieve a 6.2 % average improvement over the previous state‑of‑the‑art across diverse OOD benchmarks, with up to 20.4 % boost in closed‑loop simulation.
  • Speedup: inference becomes 2.4× faster because fewer token embeddings are processed.
  • Real‑world transfer: the same SPS‑trained model drives a physical car out‑of‑the‑box, without additional fine‑tuning.

Methodology

  1. Feature extraction – Each camera frame is passed through a frozen BLIP‑2 vision encoder, producing a 64‑patch token grid (each token ≈ 768‑dim vector).
  2. Redundancy quantification – The authors run PCA on a large corpus of tokens and compute pairwise cosine similarity. The analysis shows that most information is duplicated across many patches.
  3. Stochastic masking – During each training step, a random mask (e.g., 30 % of patches) is applied. Masked tokens are replaced with a learned “null” embedding, but the 2‑D layout of the remaining tokens stays unchanged, so the policy still receives a coherent spatial map.
  4. Policy network – A lightweight transformer decoder consumes the partially‑masked token grid and outputs steering, throttle, and brake commands in an end‑to‑end fashion.
  5. Training regime – Standard imitation learning on expert driving data, with the SPS mask recomputed for every frame, yields many different “views” of the same scene.
  6. Evaluation – The authors test on several OOD tracks (weather, lighting, novel routes) in simulation and on a real‑world test vehicle, comparing against the best published end‑to‑end baselines.

Results & Findings

MetricBaseline (SOTA)SPS (this work)Relative Δ
Average OOD success rate71.3 %77.5 %+6.2 %
Closed‑loop simulation improvement (best scenario)58.1 %78.5 %+20.4 %
Inference latency (per frame)45 ms19 ms2.4× faster
Parameter count12 M12 M (unchanged)

Ablation studies show that masking rates between 20 %–40 % give the best trade‑off; too aggressive masking (≥ 60 %) degrades performance, while no masking reproduces the over‑fitting behavior of the baseline. Re‑ordering patches (shuffling spatial positions) harms the model, confirming that preserving spatial coherence is crucial.

Practical Implications

  • Robustness for production fleets – SPS can be added to existing perception‑to‑control pipelines with a single line of code (mask generation) and no extra sensors, helping cars handle novel weather or road conditions without costly data collection.
  • Compute savings – Dropping ~30 % of tokens reduces GPU memory bandwidth and inference time, enabling higher‑frequency control loops on edge hardware (e.g., automotive‑grade SoCs).
  • Simplified data pipelines – Because the foundation model stays frozen, developers can reuse a single pretrained visual encoder across multiple vehicle platforms, focusing effort on the lightweight policy head.
  • Transferability – The same model trained in simulation transferred directly to a real car, suggesting that SPS mitigates the simulation‑to‑real gap—a major pain point for autonomous‑driving startups.
  • Generalizable recipe – The stochastic masking idea is model‑agnostic; it could be applied to other token‑based perception stacks (e.g., LiDAR point‑cloud tokens, multimodal transformers) to curb redundancy‑induced over‑fitting.

Limitations & Future Work

  • Masking hyper‑parameter sensitivity – The optimal drop rate depends on the encoder’s token count and the downstream policy size; automated tuning is left for future research.
  • Static masking distribution – The current implementation samples masks uniformly at random; more sophisticated, content‑aware masking (e.g., focusing on high‑entropy regions) might yield further gains.
  • Domain scope – Experiments cover visual‑only driving; extending SPS to multimodal setups (camera + LiDAR + radar) and to higher‑resolution token grids remains an open question.
  • Theoretical guarantees – While empirical results are strong, a formal analysis of why stochastic token dropout improves OOD invariance would strengthen the claim.

Overall, the paper offers a pragmatic, low‑cost technique that can be adopted today to make end‑to‑end autonomous driving systems more reliable and faster, bridging a gap between academic breakthroughs and real‑world deployment.

Authors

  • Amir Mallak
  • Erfan Aasi
  • Shiva Sreeram
  • Tsun-Hsuan Wang
  • Daniela Rus
  • Alaa Maalouf

Paper Information

  • arXiv ID: 2601.10707v1
  • Categories: cs.CV, cs.LG, cs.RO
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »