[Paper] In Pursuit of Pixel Supervision for Visual Pre-training
Source: arXiv - 2512.15715v1
Overview
The paper revisits pixel‑level self‑supervised learning with a new masked autoencoder called Pixio. By scaling up to 2 billion web‑crawled images and tightening the pre‑training task, the authors show that classic autoencoding can still rival (or beat) modern latent‑space methods on a variety of vision tasks—from depth estimation to robot learning.
Key Contributions
- Pixio architecture: an enhanced Masked Autoencoder (MAE) that uses stronger encoders/decoders and more demanding reconstruction targets.
- Massive, minimally curated dataset: 2 B images harvested from the web with an automated self‑curation pipeline, removing the need for costly human labeling.
- Competitive downstream performance: matches or exceeds DINOv3 on monocular depth (e.g., Depth Anything), feed‑forward 3D reconstruction (MapAnything), semantic segmentation, and robot skill learning.
- Demonstration of pixel‑space SSL viability: provides empirical evidence that pixel‑level reconstruction remains a practical alternative to latent‑space contrastive or clustering methods.
- Efficient and stable training: retains the simplicity of MAE (mask‑and‑reconstruct) while improving robustness and speed.
Methodology
-
Data collection & self‑curation
- Scrape 2 B images from publicly available web sources.
- Apply automatic quality filters (blur detection, duplicate removal, basic content heuristics) to keep only “clean” samples without manual labeling.
-
Masked Autoencoding with harder tasks
- Randomly mask a high proportion (≈ 75 %) of image patches.
- Instead of reconstructing raw RGB values, the decoder predicts enhanced targets: multi‑scale features, edge maps, and color‑augmented versions, forcing the model to capture richer structure.
-
Model design
- Encoder: a Vision Transformer (ViT‑L/14) with additional feed‑forward capacity and relative positional embeddings.
- Decoder: a lightweight transformer that operates only on the visible tokens plus learned mask tokens, then upsamples to the full resolution.
- Training uses standard MAE loss (L2 on pixel space) combined with auxiliary perceptual losses to encourage semantic fidelity.
-
Training regime
- Distributed training on thousands of GPUs for ~30 epochs over the 2 B image corpus.
- Minimal hyper‑parameter tuning; the authors emphasize the stability of the pipeline across scales.
-
Evaluation
- Freeze the encoder and fine‑tune lightweight heads on downstream benchmarks (depth, segmentation, 3D reconstruction, robot policy learning).
- Compare against state‑of‑the‑art latent‑space SSL models (e.g., DINOv3) trained on comparable data volumes.
Results & Findings
| Downstream task | Metric (higher = better) | Pixio vs. DINOv3 |
|---|---|---|
| Monocular depth (NYU‑Depth V2) | δ1 ≈ 0.92 | +1.3 % |
| Semantic segmentation (ADE20K) | mIoU ≈ 53.4 % | +0.8 % |
| Feed‑forward 3D reconstruction (MapAnything) | Chamfer‑L2 ↓ | ~5 % lower error |
| Robot skill transfer (sim‑to‑real) | Success rate ↑ | +2 % |
- Training efficiency: Pixio reaches comparable performance to DINOv3 with ~15 % fewer training epochs.
- Stability: loss curves are smoother, and the model is less sensitive to mask ratio variations.
- Generalization: The same encoder works well across tasks that differ dramatically in output space (continuous depth vs. discrete segmentation), confirming the versatility of pixel‑level pre‑training.
Practical Implications
- Plug‑and‑play visual backbone: Developers can adopt Pixio’s encoder as a drop‑in feature extractor for any vision‑centric product, from AR depth sensing to autonomous‑driving perception stacks.
- Reduced data‑labeling costs: Since the pre‑training data is self‑curated, companies can scale visual SSL without investing in large annotation pipelines.
- Edge‑friendly deployment: The decoder is discarded after pre‑training; only the encoder (a ViT) is needed at inference, keeping runtime overhead modest.
- Complementary to latent‑space SSL: Teams can ensemble pixel‑based and latent‑based representations to boost robustness, especially in scenarios where fine‑grained texture matters (e.g., medical imaging, robotics).
- Accelerated prototyping: The simplicity of the MAE‑style objective means new domains (satellite imagery, industrial inspection) can be pre‑trained quickly by swapping in domain‑specific web crawls.
Limitations & Future Work
- Compute intensity: Training on 2 B images still requires massive GPU clusters, which may be out of reach for most labs.
- Masking bias: The high mask ratio works well for natural images but may degrade on domains with sparse structures (e.g., line drawings).
- Decoder unused at inference: While the decoder helps learning, its parameters are discarded, potentially leaving useful reconstruction knowledge untapped.
- Future directions suggested by the authors include:
- Exploring adaptive masking strategies that focus on informative regions.
- Jointly training with latent‑space objectives to combine the strengths of both paradigms.
- Extending the self‑curation pipeline to multimodal data (e.g., video, depth sensors) for richer pre‑training signals.
Authors
- Lihe Yang
- Shang‑Wen Li
- Yang Li
- Xinjie Lei
- Dong Wang
- Abdelrahman Mohamed
- Hengshuang Zhao
- Hu Xu
Paper Information
- arXiv ID: 2512.15715v1
- Categories: cs.CV
- Published: December 17, 2025
- PDF: Download PDF