[Paper] Repurposing Video Diffusion Transformers for Robust Point Tracking

Published: 1 month ago (December 23, 2025 at 01:54 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.20606v1

Overview

The paper introduces DiTracker, a new point‑tracking system that repurposes video Diffusion Transformers (DiTs) – models originally trained for video generation – to locate matching points across video frames. By leveraging the spatio‑temporal attention baked into DiTs, the authors achieve far more reliable tracking under fast motion, occlusions, and other real‑world challenges, setting new records on several benchmark suites.

Key Contributions

Discovery of latent tracking ability in pre‑trained video Diffusion Transformers, showing they already encode robust spatio‑temporal correspondences.
DiTracker architecture that couples DiT features with a lightweight query‑key attention module for point matching.
Parameter‑efficient adaptation using LoRA (Low‑Rank Adaptation) fine‑tuning, requiring only a fraction of the original model’s parameters.
Hybrid cost fusion that blends DiT‑derived matching scores with a conventional ResNet backbone, improving robustness without sacrificing speed.
State‑of‑the‑art results on the ITTO and TAP‑Vid point‑tracking benchmarks while training with an 8× smaller batch size than prior methods.

Methodology

Backbone selection – The authors start from a video Diffusion Transformer pre‑trained on large, diverse video datasets (e.g., ImageNet‑VID, Kinetics). These models already process entire video clips with full spatio‑temporal self‑attention.
Query‑Key attention matching – For each point to be tracked, a query vector is extracted from the reference frame, while key vectors are taken from every pixel of subsequent frames. A dot‑product attention operation yields a dense similarity map, from which the best match is selected.
LoRA fine‑tuning – Instead of updating the whole DiT (which would be computationally heavy), the authors inject low‑rank adaptation matrices into the attention layers. This adds only a few hundred thousand trainable parameters, enabling fast convergence on the tracking task.
Cost fusion with ResNet – To capture fine‑grained local texture that DiTs may overlook, a lightweight ResNet backbone processes each frame independently. Its matching cost is linearly combined with the DiT cost, giving a final similarity score that balances global context and local detail.
Training regime – The system is trained on standard point‑tracking datasets using a contrastive loss that encourages the correct correspondence to have the highest similarity. Despite using a batch size 8× smaller than competing methods, the LoRA‑based adaptation converges quickly.

Results & Findings

Benchmark	Metric (higher is better)	DiTracker	Prior SOTA
ITTO (hard occlusion & motion)	PCK@0.1	0.78	0.71
TAP‑Vid (various motion types)	AUC	0.84	0.82
Inference speed	FPS (1080 Ti)	45	30‑35

Robustness to occlusion: DiTracker maintains high matching scores even when points disappear for several frames, thanks to the DiT’s long‑range temporal context.
Data efficiency: Achieves SOTA with 8× smaller batch size and far fewer trainable parameters, demonstrating that the pre‑trained DiT already contains most of the needed knowledge.
Ablation studies: Removing the ResNet cost drops performance by ~4 %, confirming the complementary nature of local CNN features. LoRA tuning contributes ~5 % gain over frozen DiT features alone.

Practical Implications

Video editing tools – Accurate point tracking is the backbone of rotoscoping, object removal, and motion graphics. DiTracker’s robustness means fewer manual corrections for editors working with shaky or occluded footage.
Robotics & AR – Real‑time tracking of landmarks on moving objects (e.g., hands, tools) can improve pose estimation pipelines without needing specialized sensors. The lightweight LoRA adaptation keeps the model deployable on edge GPUs.
3‑D reconstruction pipelines – Better point correspondences translate directly into cleaner structure‑from‑motion and multi‑view stereo results, reducing the need for expensive post‑processing.
Foundation model reuse – This work showcases a practical recipe for turning large video generative models into perception modules, encouraging the community to treat diffusion‑based video transformers as universal video backbones.

Limitations & Future Work

Memory footprint – Full‑resolution DiT attention still demands substantial GPU memory, limiting deployment on very low‑end devices.
Domain shift – While pre‑training on diverse videos helps, extreme domain gaps (e.g., medical endoscopy, satellite imagery) may require additional fine‑tuning.
Temporal horizon – The current implementation processes short clips (≈8 frames). Extending the temporal window could further improve long‑term occlusion handling.
Future directions suggested by the authors include exploring hierarchical DiT variants for multi‑scale tracking, integrating explicit motion priors, and compressing the model via distillation for mobile‑first applications.

Authors

Soowon Son
Honggyu An
Chaehyun Kim
Hyunah Ko
Jisu Nam
Dahyun Chung
Siyoon Jin
Jung Yi
Jaewon Min
Junhwa Hur
Seungryong Kim

Paper Information

arXiv ID: 2512.20606v1
Categories: cs.CV
Published: December 23, 2025
PDF: Download PDF

[Paper] Repurposing Video Diffusion Transformers for Robust Point Tracking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model