[Paper] Better, But Not Sufficient: Testing Video ANNs Against Macaque IT Dynamics
Source: arXiv - 2601.03392v1
Overview
A new study asks whether modern video‑capable neural networks can truly mimic the way macaque inferior temporal (IT) cortex processes dynamic visual scenes. While feed‑forward image models have long been the go‑to computational analogues of the primate ventral stream, real‑world vision is inherently temporal. The authors compare several classes of artificial networks—static, recurrent, and video‑trained—to neural recordings from monkeys watching natural movies, revealing where current models succeed and where they fall short.
Key Contributions
- Benchmarking dynamic vision: Introduces the first systematic comparison of macaque IT responses to a suite of video‑trained ANNs, extending the classic static‑image benchmarks.
- Temporal predictivity analysis: Shows that video models modestly improve neural predictivity, especially for later (post‑stimulus) response windows.
- Stress‑test with “appearance‑free” videos: Demonstrates that IT activity generalizes to motion‑only clips (shape/texture removed), whereas all tested ANNs fail to do so.
- Insight into biological dynamics: Provides evidence that IT encodes motion information in an appearance‑invariant manner that current architectures do not capture.
- Roadmap for future model objectives: Argues for training objectives that explicitly incorporate temporal invariances and biological motion statistics.
Methodology
- Neural data collection – Two macaques watched ~30 min of naturalistic video while multi‑unit activity was recorded from IT cortex.
- Model families –
- Static feed‑forward CNNs (e.g., ResNet‑50) applied frame‑by‑frame.
- Recurrent networks (CNN + LSTM/GRU) that integrate information over time.
- Video‑trained networks (e.g., SlowFast, TimeSformer) trained on large video datasets (Kinetics, Something‑Something).
- Predictivity metric – Linear regression decoders were fit from each model’s internal activations to the recorded neural responses, using a cross‑validated Pearson correlation (noise‑corrected).
- Temporal windows – Predictivity was measured at early (0–100 ms), middle (100–200 ms), and late (200–300 ms) post‑stimulus intervals to capture the evolution of the neural response.
- Stress test – The same decoders were evaluated on “appearance‑free” videos where each frame is replaced by a moving noise texture that preserves the original motion field but destroys object shape and texture.
Results & Findings
- Baseline performance: Static CNNs achieve the highest predictivity in the early window, confirming that feed‑forward processing dominates the initial IT response.
- Temporal boost: Recurrent and video‑trained models improve predictivity by ~3–5 % in the middle and late windows, indicating they capture some of the dynamics that emerge after the initial feed‑forward sweep.
- Failure on appearance‑free stimuli: When tested on motion‑only clips, IT responses remain highly correlated with those to the original videos (showing strong motion invariance), but all ANN classes drop to near‑chance predictivity.
- Interpretation: Current video models primarily learn appearance‑bound dynamics (e.g., texture flow) rather than the abstract, motion‑centric representations that IT maintains across visual changes.
Practical Implications
- Computer vision systems: For applications like autonomous driving or robotics that require robust motion understanding under varying appearances (e.g., night vs. day, weather changes), relying on existing video models may leave a blind spot.
- Model design: Incorporating training objectives that reward invariance to texture/shape while preserving motion cues—such as contrastive learning on motion‑only augmentations—could yield more biologically plausible and robust representations.
- Neuro‑AI collaboration: The stress‑test paradigm offers a simple, reproducible benchmark for developers to evaluate whether their models truly capture temporal invariances, beyond raw accuracy on standard video classification tasks.
- Hardware acceleration: Understanding that later IT dynamics involve recurrent‑like processing may inspire hardware designers to allocate more resources to temporal memory units for low‑latency video analytics.
Limitations & Future Work
- Dataset scope: The neural recordings are limited to a single set of naturalistic videos; broader stimulus families (e.g., controlled motion paradigms) could test generality.
- Model diversity: Only a handful of video architectures were examined; newer transformer‑based or biologically inspired spiking models might perform differently.
- Decoding simplicity: Linear decoders may not capture non‑linear readouts that downstream brain areas use; richer readout models could change predictivity estimates.
- Objective design: The authors call for new training losses that explicitly encode temporal statistics—future work should explore how to formulate and optimize such objectives at scale.
Bottom line: While video‑trained ANNs are a step forward, they still fall short of the appearance‑invariant motion processing seen in macaque IT. Bridging this gap will require rethinking both the data we train on and the objectives we optimize, opening exciting avenues for more dynamic, brain‑inspired AI.
Authors
- Matteo Dunnhofer
- Christian Micheloni
- Kohitij Kar
Paper Information
- arXiv ID: 2601.03392v1
- Categories: cs.CV, cs.NE
- Published: January 6, 2026
- PDF: Download PDF