[Paper] Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Published: 1 month ago (January 9, 2026 at 11:41 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05927v1

Overview

The paper introduces Relay Tokens, a lightweight add‑on that lets Vision Transformers (ViTs) handle ultra‑high‑resolution (UHR) images for semantic segmentation without sacrificing either global context or fine‑grained detail. By processing the same image at two scales in parallel and exchanging information through a handful of learnable tokens, the authors achieve state‑of‑the‑art results on several demanding UHR benchmarks while adding less than 2 % extra parameters.

Key Contributions

Dual‑scale transformer architecture – runs a high‑resolution local branch and a low‑resolution global branch side‑by‑side.
Relay tokens – a small set of learnable vectors that shuttle feature information between the two branches, enabling explicit multi‑scale reasoning.
Backbone‑agnostic design – works with vanilla ViT, Swin‑Transformer, and other standard transformer encoders without architectural overhaul.
Parameter‑efficient – < 2 % increase in model size compared with the baseline transformer.
Strong empirical gains – up to 15 % relative mIoU improvement on ultra‑high‑resolution datasets (Archaeoscape, URUR, Gleason) and consistent boosts on the classic Cityscapes benchmark.
Open‑source release – code, pretrained weights, and a demo are publicly available, facilitating rapid adoption.

Methodology

Two parallel processing streams
- Local stream: The input image is split into many small, high‑resolution crops (e.g., 256 × 256). Each crop is fed to a transformer that preserves pixel‑level detail.
- Global stream: The same image is downsampled to a much lower resolution (e.g., 1/8 of the original size) and processed as a single large crop, giving the model a holistic view of the scene.
Relay tokens as bridges
- A fixed number (typically 4–8) of learnable token vectors are appended to the token sequence of both streams.
- After each transformer block, the local and global streams exchange the current values of these tokens. This lets the local branch inject fine‑grained cues into the global representation and vice‑versa, effectively performing multi‑scale feature fusion inside the transformer’s self‑attention mechanism.
Aggregation & prediction
- The global branch’s output is upsampled and merged with the locally processed patches.
- A lightweight decoder (e.g., a 1×1 convolution) produces the final per‑pixel class logits.

Because the relay tokens are just a few extra vectors, the computational overhead is minimal, and the approach can be dropped into existing ViT‑based segmentation pipelines with a single line of code.

Results & Findings

Dataset	Baseline (ViT/Swin) mIoU	Relay‑Token mIoU	Relative Gain
Archaeoscape (UHR)	61.2 %	70.1 %	+14.5 %
URUR (UHR)	68.4 %	73.9 %	+8.0 %
Gleason (UHR pathology)	72.0 %	78.5 %	+9.0 %
Cityscapes (standard)	78.3 %	81.2 %	+3.7 %

The improvements are consistent across very different domains (archaeology aerial imagery, remote sensing, histopathology, and street scenes).
Ablation studies show that both branches are necessary: removing the global stream hurts large‑object consistency, while dropping the local stream degrades edge precision.
Varying the number of relay tokens reveals diminishing returns after ~6 tokens, confirming that a tiny communication channel suffices.

Practical Implications

Geospatial & remote‑sensing pipelines can now run end‑to‑end segmentation on satellite or drone imagery (often > 10 k × 10 k pixels) without resorting to costly sliding‑window post‑processing.
Medical imaging (e.g., whole‑slide pathology) benefits from preserving cellular detail while still understanding tissue‑level structures, potentially improving computer‑assisted diagnosis.
AR/VR content creation and cultural‑heritage digitization can leverage the method to automatically label large archaeological sites, speeding up mapping and preservation efforts.
For developers, the approach adds negligible memory overhead and can be integrated into existing PyTorch or TensorFlow transformer libraries, making it a drop‑in upgrade for any high‑resolution segmentation task.

Limitations & Future Work

The current design assumes a fixed downsampling factor for the global branch; highly anisotropic images may need adaptive scaling strategies.
Relay tokens are shared across all spatial locations, which could limit expressiveness for extremely heterogeneous scenes; future work might explore spatially‑varying relay tokens or hierarchical token groups.
Real‑time inference on very large images still requires tiling the local branch; optimizing the tiling schedule or leveraging sparse attention could further reduce latency.

Overall, Relay Tokens present a pragmatic, high‑impact solution for bringing the global reasoning power of Vision Transformers to the ultra‑high‑resolution world, opening new doors for developers building next‑generation visual AI systems.

Authors

Yohann Perron
Vladyslav Sydorov
Christophe Pottier
Loic Landrieu

Paper Information

arXiv ID: 2601.05927v1
Categories: cs.CV
Published: January 9, 2026
PDF: Download PDF

[Paper] Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame &amp; Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction