[Paper] Human-level 3D shape perception emerges from multi-view learning

Published: 3 days ago (February 19, 2026 at 01:56 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.17650v1

Overview

A new study shows that a neural network trained only to predict basic visual‑spatial cues from multiple views of a scene can infer 3‑D object shape as accurately as humans. By letting the model learn from naturalistic image collections—without any hand‑crafted 3‑D priors—the researchers demonstrate that human‑level 3‑D perception can emerge from a simple, scalable learning objective.

Key Contributions

Multi‑view learning framework that predicts camera pose and depth from unordered image sets, mimicking the visual cues humans use.
Zero‑shot evaluation on a classic 3‑D shape perception benchmark, showing the model matches human accuracy without task‑specific fine‑tuning.
Fine‑grained behavioral alignment: model response patterns predict human error distributions and reaction‑time trends.
Open‑source release of code, stimuli, and human behavioral data, enabling reproducibility and downstream research.

Methodology

Data collection – The authors gathered naturalistic image sequences from real scenes, each sequence containing several photos taken from different camera positions.
Network architecture – A standard convolutional backbone processes each image independently; a shared “view‑encoder” produces a latent representation.
Training objective – Instead of supervising the model with explicit 3‑D meshes, it is trained to predict visual‑spatial signals that are readily observable:
- The 3‑D camera location (relative to the scene)
- Per‑pixel depth maps for each view
  These signals are derived automatically from the known capture geometry, so no manual labeling is required.
Zero‑shot testing – After training, the model is given the same 2‑D images used in a classic human psychophysics experiment. A simple readout (e.g., linear probe) extracts the inferred 3‑D shape, which is then compared to human judgments.
Behavioral comparison – Correlation analyses link model confidence scores to human reaction times, and confusion matrices reveal matching error patterns.

Results & Findings

Human‑level accuracy: On the benchmark task, the multi‑view model achieved ~92 % correct shape judgments, statistically indistinguishable from the average human participant.
Error pattern similarity: The model’s mistakes clustered on the same ambiguous viewpoints that confuse people (e.g., foreshortened silhouettes).
Reaction‑time prediction: Higher model confidence correlated with faster human responses (Pearson r ≈ 0.68), suggesting the network’s internal certainty mirrors human processing speed.
Ablation studies: Removing the multi‑view component or training on a single view caused performance to drop to ~70 %, underscoring the importance of spatial consistency across views.

Practical Implications

Robotics & AR/VR: Systems that need on‑the‑fly 3‑D reconstruction (e.g., drones navigating cluttered environments or AR headsets overlaying graphics) can adopt this lightweight multi‑view training regime instead of expensive 3‑D annotation pipelines.
Content creation: Developers building photogrammetry tools can leverage the approach to generate accurate shape estimates from casual photo collections without requiring dense point‑cloud supervision.
Human‑computer interaction: The tight link between model confidence and reaction time opens avenues for adaptive UI designs that anticipate user difficulty and adjust visual feedback in real time.
Scalable perception models: Because the training data are just ordinary images with known camera poses, the method scales to massive internet photo collections, potentially yielding universal 3‑D perception modules that can be plugged into existing vision stacks.

Limitations & Future Work

Dependence on known camera poses: The current training pipeline assumes accurate pose metadata, which may not be available for all datasets.
Generalization to novel object categories: While the model works well on the tested set, its performance on highly reflective or transparent objects remains untested.
Real‑time constraints: The inference pipeline processes each view independently; integrating a truly real‑time multi‑camera fusion module would be needed for latency‑critical applications.
Future directions suggested by the authors include self‑supervised pose estimation, extending the framework to video streams, and exploring how additional sensory cues (e.g., tactile feedback) could further close the gap between artificial and human 3‑D perception.

Authors

Tyler Bonnen
Jitendra Malik
Angjoo Kanazawa

Paper Information

arXiv ID: 2602.17650v1
Categories: cs.CV
Published: February 19, 2026
PDF: Download PDF

[Paper] Human-level 3D shape perception emerges from multi-view learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

[Paper] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

[Paper] IntRec: Intent-based Retrieval with Contrastive Refinement