[Paper] In Pursuit of Pixel Supervision for Visual Pre-training

Published: (December 17, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.15715v1

Overview

The paper revisits pixel‑level self‑supervised learning with a new masked autoencoder called Pixio. By scaling up to 2 billion web‑crawled images and tightening the pre‑training task, the authors show that classic autoencoding can still rival (or beat) modern latent‑space methods on a variety of vision tasks—from depth estimation to robot learning.

Key Contributions

  • Pixio architecture: an enhanced Masked Autoencoder (MAE) that uses stronger encoders/decoders and more demanding reconstruction targets.
  • Massive, minimally curated dataset: 2 B images harvested from the web with an automated self‑curation pipeline, removing the need for costly human labeling.
  • Competitive downstream performance: matches or exceeds DINOv3 on monocular depth (e.g., Depth Anything), feed‑forward 3D reconstruction (MapAnything), semantic segmentation, and robot skill learning.
  • Demonstration of pixel‑space SSL viability: provides empirical evidence that pixel‑level reconstruction remains a practical alternative to latent‑space contrastive or clustering methods.
  • Efficient and stable training: retains the simplicity of MAE (mask‑and‑reconstruct) while improving robustness and speed.

Methodology

  1. Data collection & self‑curation

    • Scrape 2 B images from publicly available web sources.
    • Apply automatic quality filters (blur detection, duplicate removal, basic content heuristics) to keep only “clean” samples without manual labeling.
  2. Masked Autoencoding with harder tasks

    • Randomly mask a high proportion (≈ 75 %) of image patches.
    • Instead of reconstructing raw RGB values, the decoder predicts enhanced targets: multi‑scale features, edge maps, and color‑augmented versions, forcing the model to capture richer structure.
  3. Model design

    • Encoder: a Vision Transformer (ViT‑L/14) with additional feed‑forward capacity and relative positional embeddings.
    • Decoder: a lightweight transformer that operates only on the visible tokens plus learned mask tokens, then upsamples to the full resolution.
    • Training uses standard MAE loss (L2 on pixel space) combined with auxiliary perceptual losses to encourage semantic fidelity.
  4. Training regime

    • Distributed training on thousands of GPUs for ~30 epochs over the 2 B image corpus.
    • Minimal hyper‑parameter tuning; the authors emphasize the stability of the pipeline across scales.
  5. Evaluation

    • Freeze the encoder and fine‑tune lightweight heads on downstream benchmarks (depth, segmentation, 3D reconstruction, robot policy learning).
    • Compare against state‑of‑the‑art latent‑space SSL models (e.g., DINOv3) trained on comparable data volumes.

Results & Findings

Downstream taskMetric (higher = better)Pixio vs. DINOv3
Monocular depth (NYU‑Depth V2)δ1 ≈ 0.92+1.3 %
Semantic segmentation (ADE20K)mIoU ≈ 53.4 %+0.8 %
Feed‑forward 3D reconstruction (MapAnything)Chamfer‑L2 ↓~5 % lower error
Robot skill transfer (sim‑to‑real)Success rate ↑+2 %
  • Training efficiency: Pixio reaches comparable performance to DINOv3 with ~15 % fewer training epochs.
  • Stability: loss curves are smoother, and the model is less sensitive to mask ratio variations.
  • Generalization: The same encoder works well across tasks that differ dramatically in output space (continuous depth vs. discrete segmentation), confirming the versatility of pixel‑level pre‑training.

Practical Implications

  • Plug‑and‑play visual backbone: Developers can adopt Pixio’s encoder as a drop‑in feature extractor for any vision‑centric product, from AR depth sensing to autonomous‑driving perception stacks.
  • Reduced data‑labeling costs: Since the pre‑training data is self‑curated, companies can scale visual SSL without investing in large annotation pipelines.
  • Edge‑friendly deployment: The decoder is discarded after pre‑training; only the encoder (a ViT) is needed at inference, keeping runtime overhead modest.
  • Complementary to latent‑space SSL: Teams can ensemble pixel‑based and latent‑based representations to boost robustness, especially in scenarios where fine‑grained texture matters (e.g., medical imaging, robotics).
  • Accelerated prototyping: The simplicity of the MAE‑style objective means new domains (satellite imagery, industrial inspection) can be pre‑trained quickly by swapping in domain‑specific web crawls.

Limitations & Future Work

  • Compute intensity: Training on 2 B images still requires massive GPU clusters, which may be out of reach for most labs.
  • Masking bias: The high mask ratio works well for natural images but may degrade on domains with sparse structures (e.g., line drawings).
  • Decoder unused at inference: While the decoder helps learning, its parameters are discarded, potentially leaving useful reconstruction knowledge untapped.
  • Future directions suggested by the authors include:
    • Exploring adaptive masking strategies that focus on informative regions.
    • Jointly training with latent‑space objectives to combine the strengths of both paradigms.
    • Extending the self‑curation pipeline to multimodal data (e.g., video, depth sensors) for richer pre‑training signals.

Authors

  • Lihe Yang
  • Shang‑Wen Li
  • Yang Li
  • Xinjie Lei
  • Dong Wang
  • Abdelrahman Mohamed
  • Hengshuang Zhao
  • Hu Xu

Paper Information

  • arXiv ID: 2512.15715v1
  • Categories: cs.CV
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Multi-View Foundation Models

Foundation models are vital tools in various Computer Vision applications. They take as input a single RGB image and output a deep feature representation that i...