[Paper] Next-Embedding Prediction Makes Strong Vision Learners

Published: 1 month ago (December 18, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16922v1

Overview

The paper introduces Next‑Embedding Predictive Autoregression (NEPA), a self‑supervised pre‑training recipe for vision models that mirrors the generative‑pretraining paradigm that has transformed NLP. Instead of forcing a network to reconstruct pixels or learn contrastive features, NEPA trains a Vision Transformer (ViT) to predict the embedding of the next image patch given the embeddings of previous patches. A single, clean objective yields state‑of‑the‑art ImageNet accuracy and strong transfer performance—all without extra tokenizers, reconstruction heads, or contrastive tricks.

Key Contributions

Embedding‑level generative pre‑training: Proposes predicting future patch embeddings (rather than pixels) as a universal self‑supervised task for vision.
Simple, architecture‑agnostic pipeline: Uses a vanilla ViT backbone with causal masking and a stop‑gradient trick; no discrete tokenizers, reconstruction decoders, or contrastive pairs required.
Strong empirical results: Achieves 83.8 % (ViT‑B) and 85.3 % (ViT‑L) top‑1 accuracy on ImageNet‑1K after fine‑tuning, matching or surpassing many contemporary SSL methods.
Robust transferability: Demonstrates competitive semantic‑segmentation performance on ADE20K, indicating that the learned embeddings capture high‑level semantics.
Scalability & modality‑agnostic promise: Shows that the same next‑embedding prediction formulation could be applied to other modalities (e.g., video, audio) with minimal changes.

Methodology

Patch Embedding Extraction – An input image is split into a sequence of non‑overlapping patches (e.g., 16×16 pixels). Each patch is linearly projected into a fixed‑dimensional embedding, just like the standard ViT tokenization.
Causal Masking – The Transformer processes the sequence autoregressively: at step t it can attend only to embeddings from steps ≤ t‑1. This enforces a “predict‑the‑future” setup.
Stop‑Gradient on Targets – The target embedding for step t is taken from a frozen copy of the same backbone (or a momentum encoder). Gradients do not flow into the target, preventing collapse and stabilizing training.
Prediction Head – A lightweight linear layer maps the Transformer’s hidden state at position t‑1 to the predicted embedding for patch t.
Loss – Simple mean‑squared error (MSE) between the predicted embedding and the stopped‑gradient target embedding. No reconstruction loss, contrastive pairs, or discrete token vocabularies are involved.
Training Regime – The model is pretrained on ImageNet‑1K for a few hundred epochs using the NEPA objective alone, then fine‑tuned on downstream tasks (classification, segmentation) with standard supervised heads.

The entire pipeline fits into the familiar ViT training loop, making it easy to drop into existing codebases.

Results & Findings

Model (Backbone)	Pre‑training (NEPA)	ImageNet‑1K Top‑1 (Fine‑tuned)	ADE20K mIoU (Segmentation)
ViT‑B/16	300 epochs	83.8 %	48.2 %
ViT‑L/16	300 epochs	85.3 %	50.1 %

Comparable to state‑of‑the‑art SSL (e.g., MAE, DINO) despite using a single loss term.
Training efficiency: Because the loss operates on low‑dimensional embeddings, memory and compute footprints are lower than pixel‑reconstruction methods.
Representation quality: Linear probing (training only a classifier on frozen features) reaches >70 % top‑1, indicating that the embeddings already encode discriminative information.
Ablation studies confirm that causal masking and stop‑gradient are essential; removing either drops accuracy by ~2–3 %.

Practical Implications

Simplified pipelines: Teams can replace complex multi‑loss SSL recipes with a single NEPA pre‑training step, reducing engineering overhead.
Faster pre‑training: Lower memory usage enables training larger ViTs on commodity GPUs or scaling to larger datasets without prohibitive cost.
Modality‑agnostic extension: Since the objective works on embeddings, the same code can be reused for video frames, audio spectrogram patches, or multimodal token streams, opening doors to unified foundation models.
Better downstream fine‑tuning: The embeddings already capture semantic structure, so downstream developers may need fewer fine‑tuning epochs to reach production‑grade performance.
Potential for on‑device learning: Because the prediction head is lightweight and the loss is MSE on embeddings, NEPA could be adapted for continual learning scenarios on edge devices.

Limitations & Future Work

Dependence on a frozen target encoder: The stop‑gradient target must be a stable copy of the model (or a momentum encoder), which adds a small bookkeeping cost and may limit fully online learning.
Evaluation limited to image classification & segmentation: While results are promising, broader benchmarks (object detection, video action recognition, cross‑modal retrieval) remain to be explored.
Scaling to extremely large datasets: The paper pre‑trains on ImageNet‑1K; it is unclear how NEPA behaves on web‑scale corpora where token diversity and long‑range dependencies are higher.
Potential modality‑specific tweaks: For non‑visual data, the optimal patch size, embedding dimension, and masking strategy may differ; future work should systematically study these hyper‑parameters.

Overall, NEPA offers a clean, effective alternative to the current zoo of self‑supervised vision methods, and its simplicity makes it an attractive building block for next‑generation visual AI systems.

Authors

Sihan Xu
Ziqiao Ma
Wenhao Chai
Xuweiyi Chen
Weiyang Jin
Joyce Chai
Saining Xie
Stella X. Yu

Paper Information

arXiv ID: 2512.16922v1
Categories: cs.CV
Published: December 18, 2025
PDF: Download PDF

[Paper] Next-Embedding Prediction Makes Strong Vision Learners

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models