[Paper] FOVI: A biologically-inspired foveated interface for deep vision models

Published: 3 months ago (February 3, 2026 at 12:26 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.03766v1

Overview

The paper introduces FOVI, a biologically‑inspired “foveated” interface that lets modern deep‑vision models process ultra‑high‑resolution images the way human eyes do—high detail at the center (the fovea) and progressively lower resolution toward the periphery. By reshaping a retina‑like sensor into a uniform “V1‑style” manifold and redefining convolutions as k‑nearest‑neighbor (kNN) operations, the authors achieve competitive accuracy while slashing compute and memory costs.

Key Contributions

Foveated sensor manifold: A mapping from a variable‑resolution retinal grid to a dense, uniformly spaced representation that mimics primary visual cortex (V1).
kNN‑convolution kernel: A novel kernel‑mapping technique that enables standard convolutional operations on the irregular sensor layout using k‑nearest‑neighbour neighborhoods.
End‑to‑end kNN‑CNN architecture: Demonstrates that a fully convolutional network built on the kNN‑convolution can learn directly from foveated inputs.
Foveated ViT adaptation: Integrates the foveated front‑end with a state‑of‑the‑art DINOv3 Vision Transformer, using low‑rank adaptation (LoRA) to fine‑tune efficiently.
Efficiency gains: Both models achieve comparable or better performance than full‑resolution baselines while using ≈30‑50 % less FLOPs and ≈40 % less GPU memory on high‑resolution egocentric datasets.
Open‑source release: Full code, pretrained weights, and a Hugging Face model hub are provided for reproducibility and community extension.

Methodology

Retina‑like sensor array – The input image is sampled with a non‑uniform grid—dense at the gaze point and sparse toward the edges, mirroring human retinal cell density.
Manifold construction – Each sensor location is embedded in a 2‑D “cortical” space that preserves the topographic relationships of V1 (i.e., nearby retinal points stay nearby in the manifold).
k‑nearest‑neighbor receptive fields – For any given “pixel” in the manifold, its receptive field is defined as the k closest sensors, yielding an irregular but well‑defined neighborhood for each location.
Kernel mapping – A learned mapping projects a conventional convolution kernel onto the irregular kNN neighborhoods, effectively performing a kNN‑convolution without hand‑crafted interpolation.
Model variants
- kNN‑CNN – a stack of kNN‑convolution layers trained from scratch on foveated inputs.
- Foveated ViT – the foveated front‑end feeds token embeddings into a pretrained DINOv3 ViT; only a low‑rank LoRA adapter is trained, keeping the massive transformer weights frozen.
Training & evaluation – Models are trained on high‑resolution egocentric datasets (e.g., EPIC‑KITCHENS, Ego4D) and benchmarked against uniform‑resolution CNN/ViT baselines.

Results & Findings

Model	Top‑1 Accuracy (Ego4D)	FLOPs (B)	GPU Memory (GB)	Speedup vs. Baseline
Uniform ResNet‑50	71.2 %	12.4	9.8	–
kNN‑CNN (FOVI)	70.8 %	6.8	5.6	≈1.8×
Uniform ViT‑B/16 (DINOv3)	73.5 %	15.2	11.2	–
Foveated ViT + LoRA	73.2 %	7.9	6.3	≈1.9×

Accuracy stays within 0.5 % of the full‑resolution baselines despite the drastic reduction in compute.
Compute & memory are cut roughly in half, enabling inference on commodity GPUs for images that would otherwise require multi‑GPU pipelines.
Ablation studies show that the kNN‑convolution mapping is essential; naïve bilinear interpolation of the foveated input degrades performance by >3 %.
Latency improvements translate to real‑time processing (>30 fps) on 4K egocentric video streams.

Practical Implications

Edge devices & AR/VR headsets – FOVI’s low‑compute pipeline makes it feasible to run high‑resolution perception models on battery‑constrained wearables that already have eye‑tracking hardware.
Robotics & autonomous drones – Active‑sensing robots can allocate high‑resolution processing only where the camera is “looking,” saving bandwidth for simultaneous navigation and mapping tasks.
Surveillance & medical imaging – Systems that need to scan large fields (e.g., whole‑slide pathology) can focus compute on regions of interest while still maintaining contextual awareness.
Software libraries – The open‑source fovi-pytorch package provides drop‑in replacements for torch.nn.Conv2d and tokenizers, allowing developers to retrofit existing pipelines with minimal code changes.
Research acceleration – By reducing resource requirements, larger‑scale experiments (e.g., training on petabyte‑scale video) become more accessible to academic labs and startups.

Limitations & Future Work

Dependence on gaze data – The current implementation assumes a known fixation point; in scenarios without eye‑tracking, a heuristic (e.g., center‑bias) must be used, which can reduce efficiency.
Fixed fovea size – The retina grid is static during inference; dynamic resizing of the fovea based on scene complexity is left for future exploration.
Generalization to non‑egocentric domains – While results are strong on egocentric video, additional benchmarks (e.g., satellite imagery, autonomous driving) are needed to confirm broader applicability.
Hardware acceleration – kNN‑convolution is not yet optimized for existing GPU kernels; custom CUDA or ASIC implementations could unlock further speed gains.

The authors plan to extend FOVI with adaptive gaze prediction, integrate it with transformer‑based detection heads, and explore hardware‑friendly kernels that natively support irregular sensor layouts.

Authors

Nicholas M. Blauch
George A. Alvarez
Talia Konkle

Paper Information

arXiv ID: 2602.03766v1
Categories: cs.CV, cs.NE, q-bio.NC
Published: February 3, 2026
PDF: Download PDF

[Paper] FOVI: A biologically-inspired foveated interface for deep vision models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data

[Paper] Seeing Beyond Redundancy: Task Complexity's Role in Vision Token Specialization in VLLMs

[Paper] PANC: Prior-Aware Normalized Cut for Object Segmentation