[Paper] What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Published: 3 days ago (March 17, 2026 at 01:46 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.16840v1

Overview

The paper investigates why popular Vision Transformers (ViTs) such as DINOv2 exhibit positional bias—a tendency to “see” patterns that depend on where they appear in the image rather than what they are. This bias hampers zero‑shot transfer, especially in domains like materials science where micro‑structures are direction‑agnostic. By swapping the classic absolute positional embeddings for the ALiBi (Attention with Linear Biases) relative encoding, the authors show that the bias can be dramatically reduced while preserving the model’s semantic power.

Key Contributions

Systematic diagnosis of positional bias in ViTs across multiple pre‑training objectives (self‑supervised, supervised, contrastive) using linear probing.
Demonstration that absolute positional encodings are the primary culprit, even when the downstream task is unrelated to spatial layout.
Implementation of ALiBi relative positional encoding in DINOv2‑style ViTs and a lightweight fine‑tuning recipe that eliminates most of the bias.
Empirical validation that the ALiBi‑augmented models retain high‑quality generic features (ImageNet‑1k accuracy, downstream linear probe performance).
Application to microscopy segmentation, showing that unbiased features lead to cleaner, more reliable masks on complex material‑science images.

Methodology

Baseline models – The authors start from publicly available DINOv2 ViT‑B/16 and ViT‑L/14 checkpoints that use the standard absolute sinusoidal/learned positional embeddings.
Linear probing for bias detection – They train a simple linear classifier on top of frozen ViT features to predict the image quadrant (or other synthetic spatial labels). A high accuracy indicates that the representation encodes location information beyond semantics.
ALiBi integration – ALiBi adds a linear bias term to the attention scores based on the distance between query and key tokens, removing any need for explicit positional vectors. The authors replace the original positional module with ALiBi and fine‑tune the model for a few epochs on the same pre‑training data (no new labels required).
Evaluation suite –
- Positional bias test (same linear probe as step 2).
- Standard downstream benchmarks (ImageNet linear probe, CIFAR‑10/100, VTAB).
- Domain‑specific task: trainable segmentation on electron‑microscopy micrographs using a lightweight decoder.
Ablation studies – Varying the amount of fine‑tuning, the depth at which ALiBi is inserted, and comparing against other relative encodings (e.g., Rotary Positional Embedding).

Results & Findings

Metric	Absolute PE (baseline)	ALiBi‑fine‑tuned
Linear probe for quadrant (accuracy)	≈ 78 %	≈ 12 % (near chance)
ImageNet‑1k linear probe (top‑1)	71.2 %	70.8 %
VTAB average (10 tasks)	71.5 %	71.2 %
Microscopy segmentation IoU (trained decoder)	0.62	0.71
Training FLOPs for fine‑tuning (per GPU)	–	~0.3 B (≈ 0.5 % of full pre‑training)

What it means

Positional bias drops to chance level after a brief ALiBi fine‑tune, confirming that the bias originates from the absolute embeddings.
General visual semantics stay intact – the drop in standard benchmark performance is negligible (<0.5 %).
Domain‑specific downstream tasks benefit – the unbiased features produce noticeably better segmentation masks on homogeneous microstructures, where any artificial directionality would otherwise cause artifacts.

Practical Implications

Zero‑shot transfer becomes more reliable for any application where the spatial layout is arbitrary (e.g., satellite imagery, medical scans, materials microscopy).
Simplified pipeline: developers can adopt the same pre‑trained ViT checkpoint, run a short ALiBi fine‑tune (few hundred steps), and obtain a bias‑free encoder without re‑training from scratch.
Reduced need for data‑augmentation tricks that attempt to “wash out” positional cues (e.g., random rotations, flips). The model itself no longer encodes a preferred orientation.
Better interpretability: attention maps are less likely to highlight spurious edge‑effects, making debugging of downstream models easier.
Potential for on‑device inference – ALiBi adds virtually no runtime overhead (just a linear term in the attention score), so the unbiased model can be deployed in edge or embedded settings without performance penalties.

Limitations & Future Work

Scope of fine‑tuning – The study focuses on DINOv2‑style ViTs; it remains to be seen how well the same recipe works for larger, hybrid architectures (e.g., Swin, Conv‑ViT).
Residual bias – While quadrant prediction drops to chance, subtle location‑dependent cues (e.g., border effects) persist in some layers, suggesting that a deeper architectural redesign could be beneficial.
Cross‑modal extensions – The paper does not explore whether ALiBi helps multimodal models (e.g., CLIP, Flamingo) that also rely on positional encodings.
Theoretical analysis – The authors provide empirical evidence but leave a formal proof of why ALiBi eliminates bias while preserving expressivity for future work.

Bottom line: Swapping absolute positional embeddings for ALiBi is a low‑cost, high‑impact tweak that makes Vision Transformers more universally applicable—especially in scientific imaging domains where “where” should never outweigh “what”. Developers can adopt this technique today to build more robust, direction‑agnostic vision pipelines.

Authors

Moritz Pawlowsky
Antonis Vamvakeros
Alexander Weiss
Anja Bielefeld
Samuel J. Cooper
Ronan Docherty

Paper Information

arXiv ID: 2603.16840v1
Categories: cs.CV, cond-mat.mtrl-sci
Published: March 17, 2026
PDF: Download PDF

[Paper] What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

[Paper] Matryoshka Gaussian Splatting

[Paper] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

[Paper] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction