[Paper] Interpreting V1 Population Activity via Image-Neural Latent Representation Alignment

Published: (May 5, 2026 at 05:15 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.04309v1

Overview

The paper introduces Dual‑Tower Image‑Neural Alignment (DINA), a contrastive learning framework that simultaneously aligns visual images and mouse V1 population activity into a common latent space. By doing so, DINA not only boosts decoding accuracy of visual stimuli from neural recordings but also opens a window onto the how of V1’s visual computations, something prior black‑box decoders have struggled to explain.

Key Contributions

  • Dual‑tower architecture that jointly learns image and neural embeddings at the level of intermediate feature maps, preserving spatial structure for interpretability.
  • Contrastive alignment loss that forces corresponding image‑neural pairs to occupy nearby points in the shared latent space while pushing mismatched pairs apart.
  • Demonstrated decoding performance on a massive two‑photon calcium imaging dataset (≈ 10⁶ spikes from thousands of V1 neurons) that rivals or exceeds state‑of‑the‑art neural decoders.
  • Interpretability pipeline that maps latent dimensions back to image regions and to sparse subsets of highly responsive neurons, revealing which visual cues drive decoding.
  • Empirical insight that V1 decoding relies mainly on coarse, low‑level structure (edges, textures) rather than high‑level semantic content.

Methodology

  1. Data preprocessing – Two‑photon calcium imaging traces are de‑convolved into spike‑rate estimates and paired with the corresponding natural‑scene images shown to the mouse.
  2. Dual‑tower design
    • Image tower: a shallow CNN extracts multi‑scale feature maps (e.g., 32 × 32 spatial resolution, 64 channels).
    • Neural tower: a fully‑connected network reshapes the high‑dimensional population vector into the same spatial layout, then applies 1×1 convolutions to produce comparable feature maps.
  3. Contrastive loss – For each (image, neural) pair, the cosine similarity of their latent feature maps is maximized; similarity with all other pairs in the mini‑batch is minimized (InfoNCE style).
  4. Alignment & decoding – After training, a simple linear probe on the shared latent space predicts the presented image (or its class) from neural activity.
  5. Interpretability analysis
    • Spatial saliency: back‑project latent dimensions onto the original image to see which patches contribute most.
    • Neuron importance: compute gradient‑based attribution scores to identify the sparse neuron subset that drives each latent dimension.

All components are implemented in PyTorch and trainable on a single GPU within a few hours.

Results & Findings

MetricDINA (Neural → Image)Prior CNN‑based Decoder
Top‑1 image reconstruction accuracy78 %62 %
Pearson correlation (pixel‑wise)0.710.58
Number of neurons needed for 90 % performance≈ 12 % of the recorded population≈ 35 %
  • Coarse structure dominates: Ablation experiments that blur images to low frequencies cause only a modest drop in decoding accuracy, whereas removing high‑frequency details has little effect.
  • Sparse neuron ensembles: The most predictive latent dimensions are reconstructed by ~5–10 highly responsive neurons plus their pairwise functional interactions, suggesting a “few‑shot” coding scheme.
  • Distributed spatial mapping: Alignable feature maps arise from multiple, non‑contiguous image patches, indicating that V1 integrates shape and texture cues across the visual field rather than focusing on a single region.

Practical Implications

  • Brain‑computer interfaces (BCIs): DINA’s ability to decode visual content from a relatively small, interpretable neuron set could reduce the sensor count and computational load for visual prosthetics or closed‑loop neurofeedback systems.
  • Neuro‑inspired computer vision: The dual‑tower alignment paradigm offers a template for building models that learn joint representations of sensor data and internal states, useful for robotics where perception must be tightly coupled with internal control signals.
  • Model debugging & neuroscience‑AI synergy: Because latent dimensions map back to concrete image patches and neuron groups, developers can inspect failure cases, guide data collection, or even fine‑tune vision models using neural constraints.
  • Efficient data labeling: In scenarios where ground‑truth labels are scarce but neural recordings are abundant (e.g., animal behavior studies), DINA can serve as a self‑supervised label generator, accelerating dataset creation for downstream ML tasks.

Limitations & Future Work

  • Species & modality specificity: The study is limited to mouse V1 and two‑photon calcium imaging; generalizing to primate cortex or electrophysiology may require architectural tweaks.
  • Temporal dynamics omitted: DINA treats each stimulus‑response pair as static, ignoring the rich temporal evolution of V1 activity that could further improve decoding.
  • Interpretability granularity: While feature maps are spatially resolved, the current attribution methods do not capture sub‑pixel or sub‑neuron microcircuits; more fine‑grained causal probing is needed.
  • Scalability to higher visual areas: Extending the framework to areas that encode semantic information (e.g., V4, IT) will test whether the coarse‑structure bias holds or if higher‑level features become dominant.

Overall, DINA bridges the gap between high‑performance neural decoding and mechanistic insight, offering a practical toolkit for developers interested in neuro‑aware AI systems.

Authors

  • Xin Wang
  • Zhuangzhi Gao
  • Hongyi Qin
  • Zhongli Wu
  • Feixiang Zhou
  • He Zhao

Paper Information

  • arXiv ID: 2605.04309v1
  • Categories: cs.NE
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...