[Paper] In Pursuit of Pixel Supervision for Visual Pre-training

Published: 1 month ago (December 17, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15715v1

Overview

The paper revisits pixel‑level self‑supervised learning with a new masked autoencoder called Pixio. By scaling up to 2 billion web‑crawled images and tightening the pre‑training task, the authors show that classic autoencoding can still rival (or beat) modern latent‑space methods on a variety of vision tasks—from depth estimation to robot learning.

Key Contributions

Pixio architecture: an enhanced Masked Autoencoder (MAE) that uses stronger encoders/decoders and more demanding reconstruction targets.
Massive, minimally curated dataset: 2 B images harvested from the web with an automated self‑curation pipeline, removing the need for costly human labeling.
Competitive downstream performance: matches or exceeds DINOv3 on monocular depth (e.g., Depth Anything), feed‑forward 3D reconstruction (MapAnything), semantic segmentation, and robot skill learning.
Demonstration of pixel‑space SSL viability: provides empirical evidence that pixel‑level reconstruction remains a practical alternative to latent‑space contrastive or clustering methods.
Efficient and stable training: retains the simplicity of MAE (mask‑and‑reconstruct) while improving robustness and speed.

Methodology

Data collection & self‑curation
- Scrape 2 B images from publicly available web sources.
- Apply automatic quality filters (blur detection, duplicate removal, basic content heuristics) to keep only “clean” samples without manual labeling.
Masked Autoencoding with harder tasks
- Randomly mask a high proportion (≈ 75 %) of image patches.
- Instead of reconstructing raw RGB values, the decoder predicts enhanced targets: multi‑scale features, edge maps, and color‑augmented versions, forcing the model to capture richer structure.
Model design
- Encoder: a Vision Transformer (ViT‑L/14) with additional feed‑forward capacity and relative positional embeddings.
- Decoder: a lightweight transformer that operates only on the visible tokens plus learned mask tokens, then upsamples to the full resolution.
- Training uses standard MAE loss (L2 on pixel space) combined with auxiliary perceptual losses to encourage semantic fidelity.
Training regime
- Distributed training on thousands of GPUs for ~30 epochs over the 2 B image corpus.
- Minimal hyper‑parameter tuning; the authors emphasize the stability of the pipeline across scales.
Evaluation
- Freeze the encoder and fine‑tune lightweight heads on downstream benchmarks (depth, segmentation, 3D reconstruction, robot policy learning).
- Compare against state‑of‑the‑art latent‑space SSL models (e.g., DINOv3) trained on comparable data volumes.

Results & Findings

Downstream task	Metric (higher = better)	Pixio vs. DINOv3
Monocular depth (NYU‑Depth V2)	δ₁ ≈ 0.92	+1.3 %
Semantic segmentation (ADE20K)	mIoU ≈ 53.4 %	+0.8 %
Feed‑forward 3D reconstruction (MapAnything)	Chamfer‑L2 ↓	~5 % lower error
Robot skill transfer (sim‑to‑real)	Success rate ↑	+2 %

Training efficiency: Pixio reaches comparable performance to DINOv3 with ~15 % fewer training epochs.
Stability: loss curves are smoother, and the model is less sensitive to mask ratio variations.
Generalization: The same encoder works well across tasks that differ dramatically in output space (continuous depth vs. discrete segmentation), confirming the versatility of pixel‑level pre‑training.

Practical Implications

Plug‑and‑play visual backbone: Developers can adopt Pixio’s encoder as a drop‑in feature extractor for any vision‑centric product, from AR depth sensing to autonomous‑driving perception stacks.
Reduced data‑labeling costs: Since the pre‑training data is self‑curated, companies can scale visual SSL without investing in large annotation pipelines.
Edge‑friendly deployment: The decoder is discarded after pre‑training; only the encoder (a ViT) is needed at inference, keeping runtime overhead modest.
Complementary to latent‑space SSL: Teams can ensemble pixel‑based and latent‑based representations to boost robustness, especially in scenarios where fine‑grained texture matters (e.g., medical imaging, robotics).
Accelerated prototyping: The simplicity of the MAE‑style objective means new domains (satellite imagery, industrial inspection) can be pre‑trained quickly by swapping in domain‑specific web crawls.

Limitations & Future Work

Compute intensity: Training on 2 B images still requires massive GPU clusters, which may be out of reach for most labs.
Masking bias: The high mask ratio works well for natural images but may degrade on domains with sparse structures (e.g., line drawings).
Decoder unused at inference: While the decoder helps learning, its parameters are discarded, potentially leaving useful reconstruction knowledge untapped.
Future directions suggested by the authors include:
- Exploring adaptive masking strategies that focus on informative regions.
- Jointly training with latent‑space objectives to combine the strengths of both paradigms.
- Extending the self‑curation pipeline to multimodal data (e.g., video, depth sensors) for richer pre‑training signals.

Authors

Lihe Yang
Shang‑Wen Li
Yang Li
Xinjie Lei
Dong Wang
Abdelrahman Mohamed
Hengshuang Zhao
Hu Xu

Paper Information

arXiv ID: 2512.15715v1
Categories: cs.CV
Published: December 17, 2025
PDF: Download PDF

[Paper] In Pursuit of Pixel Supervision for Visual Pre-training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models