[Paper] Next-Embedding Prediction Makes Strong Vision Learners

Published: (December 18, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16922v1

Overview

The paper introduces Next‑Embedding Predictive Autoregression (NEPA), a self‑supervised pre‑training recipe for vision models that mirrors the generative‑pretraining paradigm that has transformed NLP. Instead of forcing a network to reconstruct pixels or learn contrastive features, NEPA trains a Vision Transformer (ViT) to predict the embedding of the next image patch given the embeddings of previous patches. A single, clean objective yields state‑of‑the‑art ImageNet accuracy and strong transfer performance—all without extra tokenizers, reconstruction heads, or contrastive tricks.

Key Contributions

  • Embedding‑level generative pre‑training: Proposes predicting future patch embeddings (rather than pixels) as a universal self‑supervised task for vision.
  • Simple, architecture‑agnostic pipeline: Uses a vanilla ViT backbone with causal masking and a stop‑gradient trick; no discrete tokenizers, reconstruction decoders, or contrastive pairs required.
  • Strong empirical results: Achieves 83.8 % (ViT‑B) and 85.3 % (ViT‑L) top‑1 accuracy on ImageNet‑1K after fine‑tuning, matching or surpassing many contemporary SSL methods.
  • Robust transferability: Demonstrates competitive semantic‑segmentation performance on ADE20K, indicating that the learned embeddings capture high‑level semantics.
  • Scalability & modality‑agnostic promise: Shows that the same next‑embedding prediction formulation could be applied to other modalities (e.g., video, audio) with minimal changes.

Methodology

  1. Patch Embedding Extraction – An input image is split into a sequence of non‑overlapping patches (e.g., 16×16 pixels). Each patch is linearly projected into a fixed‑dimensional embedding, just like the standard ViT tokenization.
  2. Causal Masking – The Transformer processes the sequence autoregressively: at step t it can attend only to embeddings from steps ≤ t‑1. This enforces a “predict‑the‑future” setup.
  3. Stop‑Gradient on Targets – The target embedding for step t is taken from a frozen copy of the same backbone (or a momentum encoder). Gradients do not flow into the target, preventing collapse and stabilizing training.
  4. Prediction Head – A lightweight linear layer maps the Transformer’s hidden state at position t‑1 to the predicted embedding for patch t.
  5. Loss – Simple mean‑squared error (MSE) between the predicted embedding and the stopped‑gradient target embedding. No reconstruction loss, contrastive pairs, or discrete token vocabularies are involved.
  6. Training Regime – The model is pretrained on ImageNet‑1K for a few hundred epochs using the NEPA objective alone, then fine‑tuned on downstream tasks (classification, segmentation) with standard supervised heads.

The entire pipeline fits into the familiar ViT training loop, making it easy to drop into existing codebases.

Results & Findings

Model (Backbone)Pre‑training (NEPA)ImageNet‑1K Top‑1 (Fine‑tuned)ADE20K mIoU (Segmentation)
ViT‑B/16300 epochs83.8 %48.2 %
ViT‑L/16300 epochs85.3 %50.1 %
  • Comparable to state‑of‑the‑art SSL (e.g., MAE, DINO) despite using a single loss term.
  • Training efficiency: Because the loss operates on low‑dimensional embeddings, memory and compute footprints are lower than pixel‑reconstruction methods.
  • Representation quality: Linear probing (training only a classifier on frozen features) reaches >70 % top‑1, indicating that the embeddings already encode discriminative information.
  • Ablation studies confirm that causal masking and stop‑gradient are essential; removing either drops accuracy by ~2–3 %.

Practical Implications

  • Simplified pipelines: Teams can replace complex multi‑loss SSL recipes with a single NEPA pre‑training step, reducing engineering overhead.
  • Faster pre‑training: Lower memory usage enables training larger ViTs on commodity GPUs or scaling to larger datasets without prohibitive cost.
  • Modality‑agnostic extension: Since the objective works on embeddings, the same code can be reused for video frames, audio spectrogram patches, or multimodal token streams, opening doors to unified foundation models.
  • Better downstream fine‑tuning: The embeddings already capture semantic structure, so downstream developers may need fewer fine‑tuning epochs to reach production‑grade performance.
  • Potential for on‑device learning: Because the prediction head is lightweight and the loss is MSE on embeddings, NEPA could be adapted for continual learning scenarios on edge devices.

Limitations & Future Work

  • Dependence on a frozen target encoder: The stop‑gradient target must be a stable copy of the model (or a momentum encoder), which adds a small bookkeeping cost and may limit fully online learning.
  • Evaluation limited to image classification & segmentation: While results are promising, broader benchmarks (object detection, video action recognition, cross‑modal retrieval) remain to be explored.
  • Scaling to extremely large datasets: The paper pre‑trains on ImageNet‑1K; it is unclear how NEPA behaves on web‑scale corpora where token diversity and long‑range dependencies are higher.
  • Potential modality‑specific tweaks: For non‑visual data, the optimal patch size, embedding dimension, and masking strategy may differ; future work should systematically study these hyper‑parameters.

Overall, NEPA offers a clean, effective alternative to the current zoo of self‑supervised vision methods, and its simplicity makes it an attractive building block for next‑generation visual AI systems.

Authors

  • Sihan Xu
  • Ziqiao Ma
  • Wenhao Chai
  • Xuweiyi Chen
  • Weiyang Jin
  • Joyce Chai
  • Saining Xie
  • Stella X. Yu

Paper Information

  • arXiv ID: 2512.16922v1
  • Categories: cs.CV
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Dexterous World Models

Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largel...