[Paper] FOVI: A biologically-inspired foveated interface for deep vision models
Source: arXiv - 2602.03766v1
Overview
The paper introduces FOVI, a biologically‑inspired “foveated” interface that lets modern deep‑vision models process ultra‑high‑resolution images the way human eyes do—high detail at the center (the fovea) and progressively lower resolution toward the periphery. By reshaping a retina‑like sensor into a uniform “V1‑style” manifold and redefining convolutions as k‑nearest‑neighbor (kNN) operations, the authors achieve competitive accuracy while slashing compute and memory costs.
Key Contributions
- Foveated sensor manifold: A mapping from a variable‑resolution retinal grid to a dense, uniformly spaced representation that mimics primary visual cortex (V1).
- kNN‑convolution kernel: A novel kernel‑mapping technique that enables standard convolutional operations on the irregular sensor layout using k‑nearest‑neighbour neighborhoods.
- End‑to‑end kNN‑CNN architecture: Demonstrates that a fully convolutional network built on the kNN‑convolution can learn directly from foveated inputs.
- Foveated ViT adaptation: Integrates the foveated front‑end with a state‑of‑the‑art DINOv3 Vision Transformer, using low‑rank adaptation (LoRA) to fine‑tune efficiently.
- Efficiency gains: Both models achieve comparable or better performance than full‑resolution baselines while using ≈30‑50 % less FLOPs and ≈40 % less GPU memory on high‑resolution egocentric datasets.
- Open‑source release: Full code, pretrained weights, and a Hugging Face model hub are provided for reproducibility and community extension.
Methodology
- Retina‑like sensor array – The input image is sampled with a non‑uniform grid—dense at the gaze point and sparse toward the edges, mirroring human retinal cell density.
- Manifold construction – Each sensor location is embedded in a 2‑D “cortical” space that preserves the topographic relationships of V1 (i.e., nearby retinal points stay nearby in the manifold).
- k‑nearest‑neighbor receptive fields – For any given “pixel” in the manifold, its receptive field is defined as the k closest sensors, yielding an irregular but well‑defined neighborhood for each location.
- Kernel mapping – A learned mapping projects a conventional convolution kernel onto the irregular kNN neighborhoods, effectively performing a kNN‑convolution without hand‑crafted interpolation.
- Model variants
- kNN‑CNN – a stack of kNN‑convolution layers trained from scratch on foveated inputs.
- Foveated ViT – the foveated front‑end feeds token embeddings into a pretrained DINOv3 ViT; only a low‑rank LoRA adapter is trained, keeping the massive transformer weights frozen.
- Training & evaluation – Models are trained on high‑resolution egocentric datasets (e.g., EPIC‑KITCHENS, Ego4D) and benchmarked against uniform‑resolution CNN/ViT baselines.
Results & Findings
| Model | Top‑1 Accuracy (Ego4D) | FLOPs (B) | GPU Memory (GB) | Speedup vs. Baseline |
|---|---|---|---|---|
| Uniform ResNet‑50 | 71.2 % | 12.4 | 9.8 | – |
| kNN‑CNN (FOVI) | 70.8 % | 6.8 | 5.6 | ≈1.8× |
| Uniform ViT‑B/16 (DINOv3) | 73.5 % | 15.2 | 11.2 | – |
| Foveated ViT + LoRA | 73.2 % | 7.9 | 6.3 | ≈1.9× |
- Accuracy stays within 0.5 % of the full‑resolution baselines despite the drastic reduction in compute.
- Compute & memory are cut roughly in half, enabling inference on commodity GPUs for images that would otherwise require multi‑GPU pipelines.
- Ablation studies show that the kNN‑convolution mapping is essential; naïve bilinear interpolation of the foveated input degrades performance by >3 %.
- Latency improvements translate to real‑time processing (>30 fps) on 4K egocentric video streams.
Practical Implications
- Edge devices & AR/VR headsets – FOVI’s low‑compute pipeline makes it feasible to run high‑resolution perception models on battery‑constrained wearables that already have eye‑tracking hardware.
- Robotics & autonomous drones – Active‑sensing robots can allocate high‑resolution processing only where the camera is “looking,” saving bandwidth for simultaneous navigation and mapping tasks.
- Surveillance & medical imaging – Systems that need to scan large fields (e.g., whole‑slide pathology) can focus compute on regions of interest while still maintaining contextual awareness.
- Software libraries – The open‑source
fovi-pytorchpackage provides drop‑in replacements fortorch.nn.Conv2dand tokenizers, allowing developers to retrofit existing pipelines with minimal code changes. - Research acceleration – By reducing resource requirements, larger‑scale experiments (e.g., training on petabyte‑scale video) become more accessible to academic labs and startups.
Limitations & Future Work
- Dependence on gaze data – The current implementation assumes a known fixation point; in scenarios without eye‑tracking, a heuristic (e.g., center‑bias) must be used, which can reduce efficiency.
- Fixed fovea size – The retina grid is static during inference; dynamic resizing of the fovea based on scene complexity is left for future exploration.
- Generalization to non‑egocentric domains – While results are strong on egocentric video, additional benchmarks (e.g., satellite imagery, autonomous driving) are needed to confirm broader applicability.
- Hardware acceleration – kNN‑convolution is not yet optimized for existing GPU kernels; custom CUDA or ASIC implementations could unlock further speed gains.
The authors plan to extend FOVI with adaptive gaze prediction, integrate it with transformer‑based detection heads, and explore hardware‑friendly kernels that natively support irregular sensor layouts.
Authors
- Nicholas M. Blauch
- George A. Alvarez
- Talia Konkle
Paper Information
- arXiv ID: 2602.03766v1
- Categories: cs.CV, cs.NE, q-bio.NC
- Published: February 3, 2026
- PDF: Download PDF