[Paper] Neuromorphic Eye Tracking for Low-Latency Pupil Detection
Source: arXiv - 2512.09969v1
Overview
The paper introduces a neuromorphic eye‑tracking pipeline that can locate a user’s pupil with sub‑5‑pixel error while consuming only a few milliwatts and delivering sub‑3 ms latency. By converting a state‑of‑the‑art event‑based eye‑tracking network into a spiking neural network (SNN) built from leaky‑integrate‑and‑fire (LIF) layers and depth‑wise separable convolutions, the authors demonstrate that high‑accuracy gaze estimation is possible on ultra‑low‑power hardware—an essential step for truly responsive AR/VR wearables.
Key Contributions
- Neuromorphic redesign of a top‑performing event‑based eye‑tracker – replaces heavy recurrent/attention blocks with lightweight LIF layers.
- Model compression – achieves a 20× reduction in parameter count and an 850× drop in theoretical FLOPs versus the closest ANN baseline.
- Latency‑power trade‑off – projected to run at 3.9–4.9 mW with ~3 ms end‑to‑end latency at a 1 kHz event stream.
- Accuracy close to specialized hardware – mean pupil‑center error of 3.7–4.1 px, comparable to the Retina neuromorphic system (3.24 px).
- Generalizable design pattern – shows how depth‑wise separable convolutions and LIF neurons can replace complex ANN modules without sacrificing performance.
Methodology
- Event‑based input – The system consumes asynchronous events from a Dynamic Vision Sensor (DVS) rather than conventional video frames, preserving microsecond‑level temporal detail and eliminating motion blur.
- Network architecture – Starting from a high‑performing ANN eye‑tracker, the authors:
- Swap recurrent and attention modules for stacks of LIF neurons that naturally process spikes over time.
- Replace standard convolutions with depth‑wise separable convolutions, dramatically cutting parameters and multiply‑accumulate operations.
- Training pipeline – The SNN is trained using surrogate gradient methods that approximate the non‑differentiable spiking function, allowing back‑propagation on the same labeled event datasets used for the ANN baseline.
- Efficiency estimation – Theoretical compute (MACs) is calculated for both ANN and SNN versions, and power/latency are projected using published neuromorphic accelerator specifications (e.g., Intel Loihi, BrainChip Akida).
Results & Findings
| Model | Mean Pupil Error (px) | Params (M) | Theoretical MACs (M) | Estimated Power (mW) | Latency (ms) |
|---|---|---|---|---|---|
| Original ANN (baseline) | 3.5 | 2.1 | 1,800 | ~3,200 | 6 |
| Neuromorphic SNN (proposed) | 3.7‑4.1 | 0.10 | 2.1 | 3.9‑4.9 | ~3 |
| Retina hardware system | 3.24 | – | – | – | – |
- The SNN retains near‑state‑of‑the‑art accuracy while slashing model size by 20× and compute by ~850×.
- Power and latency estimates place the SNN comfortably within the milliwatt budget of battery‑powered AR glasses, with a response fast enough to support gaze‑contingent rendering (≈300 Hz effective update rate).
Practical Implications
- AR/VR headsets – Real‑time gaze‑aware rendering can now be done on‑device without offloading to a GPU or cloud, reducing bandwidth, preserving privacy, and extending battery life.
- Assistive wearables – Low‑power eye‑tracking enables eye‑controlled interfaces for users with limited motor ability, even on compact form factors like smart glasses.
- Human‑computer interaction research – Researchers can prototype gaze‑driven UI concepts without needing expensive high‑speed cameras; the event‑based pipeline works robustly under rapid head motion.
- Edge AI hardware – The design aligns with existing neuromorphic chips (Loihi, Akida, BrainWave), making it straightforward to integrate into next‑generation edge processors that already support spiking inference.
Limitations & Future Work
- Hardware validation – Power and latency numbers are projected; real‑world measurements on a physical neuromorphic accelerator are needed to confirm the gains.
- Dataset diversity – Experiments focus on a single event‑based eye‑tracking benchmark; broader testing across lighting conditions, eye shapes, and occlusions would strengthen generalizability.
- Robustness to sensor noise – DVS sensors can produce noisy spikes under low‑light; future work could explore adaptive thresholding or noise‑aware training.
- Integration with full AR pipelines – Combining the SNN eye‑tracker with downstream gaze‑contingent rendering or foveated rendering modules remains an open systems‑engineering challenge.
Bottom line: By marrying event‑driven vision with spiking neural networks, this work shows that high‑precision, low‑latency eye tracking is no longer a power‑hungry afterthought—it can become a native capability of next‑generation wearable devices.
Authors
- Paul Hueber
- Luca Peres
- Florian Pitters
- Alejandro Gloriani
- Oliver Rhodes
Paper Information
- arXiv ID: 2512.09969v1
- Categories: cs.CV, cs.NE
- Published: December 10, 2025
- PDF: Download PDF