[Paper] RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction

Published: 5 days ago (May 5, 2026 at 01:21 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.03999v1

Overview

The paper introduces RD‑ViT, a Recurrent‑Depth Vision Transformer that re‑thinks the classic ViT architecture for semantic segmentation. By sharing a single transformer block across multiple “depth” iterations, RD‑ViT dramatically cuts the amount of data and parameters needed while still delivering state‑of‑the‑art accuracy on 2‑D and 3‑D medical imaging tasks.

Key Contributions

Recurrent‑Depth design for dense prediction – replaces a deep stack of unique transformer layers with one shared block looped T times.
LTI‑stable state injection – guarantees convergence of the recurrent loop, preventing exploding/vanishing representations.
Adaptive Computation Time (ACT) – lets the model allocate more iterations to hard‑to‑segment regions (e.g., organ boundaries) and fewer to easy regions.
Depth‑wise LoRA adaptation – lightweight low‑rank updates applied per recurrence step, enabling fast fine‑tuning with minimal extra parameters.
Optional Mixture‑of‑Experts (MoE) feed‑forward – adds category‑specific experts that automatically specialize (e.g., right ventricle vs. myocardium) without extra supervision.
Comprehensive 2‑D/3‑D evaluation on the ACDC cardiac MRI benchmark, including real‑world Google Colab experiments and full open‑source release.

Methodology

Core Architecture – A single transformer block (self‑attention + feed‑forward) is executed repeatedly. After each pass, its hidden state is updated via a linear time‑invariant (LTI) stable injection, ensuring the recurrent process converges to a fixed point.
Adaptive Computation Time – For each spatial token, a small halting network predicts whether another iteration is needed. Tokens near organ edges tend to run more loops, while homogeneous background tokens stop early, saving compute.
Depth‑wise LoRA – Instead of learning a full set of parameters for each recurrence, low‑rank matrices are added per depth step, drastically reducing the total trainable weight count.
Mixture‑of‑Experts (optional) – The feed‑forward layer can be replaced by a set of experts; a lightweight router selects which expert(s) to apply per token, allowing the model to learn structure‑specific processing.
Training & Inference – The model is trained on 2‑D slices and 3‑D volumes of cardiac MRI. During inference, the number of recurrence steps can be increased (depth extrapolation) without hurting performance, giving developers flexibility to trade latency for accuracy.

Results & Findings

Setting	Data Used	Params	Dice (RD‑ViT)	Dice (Standard ViT)	Relative Gain
2‑D slice‑level	10 % of training set	–	0.774	0.762	+1.6 %
2‑D slice‑level	100 % of training set	–	0.882	0.872	+1.1 %
3‑D volumetric (with MoE)	Full set	3.0 M	0.812	0.817	–0.6 % (99.4 % of ViT)
3‑D volumetric (without MoE)	Full set	–	0.795	0.817	–2.7 %

Additional observations

Expert specialization: MoE experts self‑organized to focus on RV, MYO, and LV without any explicit label‑based routing.
ACT halting maps: Higher iteration counts clustered around cardiac boundaries, confirming that the model learns to spend more compute where it matters.
Ponder time: Average iterations per token dropped from 2.6 (early training) to 1.4 (later training), showing the network learns to be more efficient.
Depth extrapolation: Running more loops at inference than during training did not degrade Dice, offering a simple knob for latency‑accuracy trade‑offs.

Practical Implications

Reduced data hunger – Developers can train high‑performing segmentation models on modest medical datasets (or any domain with limited annotations) without sacrificing accuracy.
Parameter efficiency – With < 4 M parameters, RD‑ViT fits comfortably on edge devices or GPU‑constrained environments, making it attractive for on‑device diagnostics or real‑time imaging pipelines.
Dynamic compute budgeting – ACT enables per‑pixel compute allocation, which can be leveraged to meet strict latency budgets (e.g., in interventional radiology) by capping maximum iterations.
Plug‑and‑play MoE – The optional MoE layer adds specialization with negligible overhead, useful when a single model must handle multiple organ classes or modalities.
Open‑source notebooks – The authors provide Colab notebooks, allowing teams to prototype quickly, benchmark against standard ViTs, and adapt the recurrent‑depth idea to other dense‑prediction tasks (e.g., satellite segmentation, autonomous‑driving perception).

Limitations & Future Work

Domain focus – Experiments are limited to cardiac MRI; broader validation on natural‑image segmentation benchmarks (e.g., COCO‑Stuff, ADE20K) is needed to confirm generality.
Training stability – While LTI‑stable injection mitigates divergence, the recurrent loop can still be sensitive to learning‑rate schedules and initialization, requiring careful tuning.
ACT overhead – The halting network adds a small compute cost; in ultra‑low‑latency scenarios this may need further pruning.
MoE routing simplicity – The current router is lightweight and unsupervised; future work could explore learned or hierarchical routing to improve expert utilization.
3‑D scalability – Though the model works on 3‑D volumes, memory consumption grows with the number of tokens; hybrid patch‑wise or hierarchical schemes could extend applicability to higher‑resolution volumes.

Bottom line: RD‑ViT demonstrates that sharing transformer layers across depth, combined with adaptive compute and lightweight adaptation tricks, can break the “big data = big model” barrier for semantic segmentation—opening the door for efficient, high‑quality vision models in production‑grade medical and other data‑constrained settings.

Authors

Renjie He

Paper Information

arXiv ID: 2605.03999v1
Categories: cs.CV
Published: May 5, 2026
PDF: Download PDF

[Paper] RD-ViT: Recurrent-Depth Vision Transformer for Semantic Segmentation with Reduced Data Dependence Extending the Recurrent-Depth Transformer Architecture to Dense Prediction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment