[Paper] Repurposing Video Diffusion Transformers for Robust Point Tracking
Source: arXiv - 2512.20606v1
Overview
The paper introduces DiTracker, a new point‑tracking system that repurposes video Diffusion Transformers (DiTs) – models originally trained for video generation – to locate matching points across video frames. By leveraging the spatio‑temporal attention baked into DiTs, the authors achieve far more reliable tracking under fast motion, occlusions, and other real‑world challenges, setting new records on several benchmark suites.
Key Contributions
- Discovery of latent tracking ability in pre‑trained video Diffusion Transformers, showing they already encode robust spatio‑temporal correspondences.
- DiTracker architecture that couples DiT features with a lightweight query‑key attention module for point matching.
- Parameter‑efficient adaptation using LoRA (Low‑Rank Adaptation) fine‑tuning, requiring only a fraction of the original model’s parameters.
- Hybrid cost fusion that blends DiT‑derived matching scores with a conventional ResNet backbone, improving robustness without sacrificing speed.
- State‑of‑the‑art results on the ITTO and TAP‑Vid point‑tracking benchmarks while training with an 8× smaller batch size than prior methods.
Methodology
- Backbone selection – The authors start from a video Diffusion Transformer pre‑trained on large, diverse video datasets (e.g., ImageNet‑VID, Kinetics). These models already process entire video clips with full spatio‑temporal self‑attention.
- Query‑Key attention matching – For each point to be tracked, a query vector is extracted from the reference frame, while key vectors are taken from every pixel of subsequent frames. A dot‑product attention operation yields a dense similarity map, from which the best match is selected.
- LoRA fine‑tuning – Instead of updating the whole DiT (which would be computationally heavy), the authors inject low‑rank adaptation matrices into the attention layers. This adds only a few hundred thousand trainable parameters, enabling fast convergence on the tracking task.
- Cost fusion with ResNet – To capture fine‑grained local texture that DiTs may overlook, a lightweight ResNet backbone processes each frame independently. Its matching cost is linearly combined with the DiT cost, giving a final similarity score that balances global context and local detail.
- Training regime – The system is trained on standard point‑tracking datasets using a contrastive loss that encourages the correct correspondence to have the highest similarity. Despite using a batch size 8× smaller than competing methods, the LoRA‑based adaptation converges quickly.
Results & Findings
| Benchmark | Metric (higher is better) | DiTracker | Prior SOTA |
|---|---|---|---|
| ITTO (hard occlusion & motion) | PCK@0.1 | 0.78 | 0.71 |
| TAP‑Vid (various motion types) | AUC | 0.84 | 0.82 |
| Inference speed | FPS (1080 Ti) | 45 | 30‑35 |
- Robustness to occlusion: DiTracker maintains high matching scores even when points disappear for several frames, thanks to the DiT’s long‑range temporal context.
- Data efficiency: Achieves SOTA with 8× smaller batch size and far fewer trainable parameters, demonstrating that the pre‑trained DiT already contains most of the needed knowledge.
- Ablation studies: Removing the ResNet cost drops performance by ~4 %, confirming the complementary nature of local CNN features. LoRA tuning contributes ~5 % gain over frozen DiT features alone.
Practical Implications
- Video editing tools – Accurate point tracking is the backbone of rotoscoping, object removal, and motion graphics. DiTracker’s robustness means fewer manual corrections for editors working with shaky or occluded footage.
- Robotics & AR – Real‑time tracking of landmarks on moving objects (e.g., hands, tools) can improve pose estimation pipelines without needing specialized sensors. The lightweight LoRA adaptation keeps the model deployable on edge GPUs.
- 3‑D reconstruction pipelines – Better point correspondences translate directly into cleaner structure‑from‑motion and multi‑view stereo results, reducing the need for expensive post‑processing.
- Foundation model reuse – This work showcases a practical recipe for turning large video generative models into perception modules, encouraging the community to treat diffusion‑based video transformers as universal video backbones.
Limitations & Future Work
- Memory footprint – Full‑resolution DiT attention still demands substantial GPU memory, limiting deployment on very low‑end devices.
- Domain shift – While pre‑training on diverse videos helps, extreme domain gaps (e.g., medical endoscopy, satellite imagery) may require additional fine‑tuning.
- Temporal horizon – The current implementation processes short clips (≈8 frames). Extending the temporal window could further improve long‑term occlusion handling.
- Future directions suggested by the authors include exploring hierarchical DiT variants for multi‑scale tracking, integrating explicit motion priors, and compressing the model via distillation for mobile‑first applications.
Authors
- Soowon Son
- Honggyu An
- Chaehyun Kim
- Hyunah Ko
- Jisu Nam
- Dahyun Chung
- Siyoon Jin
- Jung Yi
- Jaewon Min
- Junhwa Hur
- Seungryong Kim
Paper Information
- arXiv ID: 2512.20606v1
- Categories: cs.CV
- Published: December 23, 2025
- PDF: Download PDF