[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

Published: 1 month ago (December 26, 2025 at 01:19 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.22105v1

Overview

This paper introduces Track‑Detection Link Prediction (TDLP), a new “tracking‑by‑detection” framework that learns how to stitch together object detections across video frames without relying on hand‑crafted matching rules. By treating the association problem as a link‑prediction task, TDLP bridges the gap between classic, fast trackers and heavyweight end‑to‑end models, delivering state‑of‑the‑art accuracy while staying computationally light.

Key Contributions

Link‑prediction formulation: Recasts per‑frame data association as a supervised link‑prediction problem between existing tracks and new detections.
Geometry‑first architecture: Designed to work primarily with bounding‑box coordinates, yet easily extensible to incorporate pose, appearance, or other cues.
Learning‑based association without full end‑to‑end pipelines: Eliminates hand‑crafted heuristics (e.g., IoU thresholds, motion models) while preserving the modularity and speed of tracking‑by‑detection pipelines.
Comprehensive benchmark validation: Shows consistent gains over both classic tracking‑by‑detection baselines and recent end‑to‑end trackers on multiple public MOT datasets.
Empirical analysis of link prediction vs. metric learning: Demonstrates that link prediction handles heterogeneous feature sets (e.g., raw boxes + pose) more robustly than traditional metric‑learning association.

Methodology

Input preprocessing – For each video frame, a detector supplies a set of bounding boxes (and optionally pose or appearance embeddings).
Track representation – Each active track stores its most recent geometric state (position, size, velocity) and any auxiliary features.
Link‑prediction network – A lightweight neural module receives a pair (track, detection) and outputs a probability that the detection is the true continuation of the track. The network is trained with binary cross‑entropy on ground‑truth association labels from annotated video sequences.
Per‑frame association – For every active track, the model scores all candidate detections. A simple bipartite matching (e.g., Hungarian algorithm) selects the highest‑scoring, conflict‑free links, while unmatched detections spawn new tracks and unmatched tracks are terminated after a short grace period.
Modularity – Because the link predictor only consumes geometric vectors (and optional side‑information), it can be swapped out or combined with any off‑the‑shelf detector, keeping the overall pipeline fast and easy to integrate.

Results & Findings

Performance: TDLP outperforms the previous best tracking‑by‑detection method by +3.2% MOTA and beats the top end‑to‑end tracker by +1.5% MOTA on the MOT17 benchmark, while running at ~30 FPS on a single GPU.
Ablation studies: Removing auxiliary cues (pose, appearance) drops performance modestly (~0.8% MOTA), confirming the core strength lies in the learned geometric link predictor.
Link prediction vs. metric learning: Experiments reveal that metric‑learning based association suffers when mixing heterogeneous features, whereas the link‑prediction formulation maintains high accuracy, especially under occlusions and abrupt motion.
Scalability: The method scales linearly with the number of detections per frame, making it suitable for high‑density scenes (e.g., crowds, traffic).

Practical Implications

Plug‑and‑play for existing pipelines: Developers can replace heuristic association modules in their current tracking‑by‑detection stacks with the TDLP link predictor, gaining a measurable boost in accuracy without redesigning the whole system.
Edge‑friendly deployment: The model’s modest compute footprint (few million parameters) enables real‑time operation on embedded GPUs (Jetson, Coral) for applications like autonomous drones, retail analytics, or smart city cameras.
Flexibility for multimodal data: Because additional cues are optional, TDLP can be adapted to domains where appearance is unreliable (e.g., infrared, thermal) but geometry remains robust.
Open‑source code: The authors provide a ready‑to‑run implementation, complete with training scripts and pretrained weights, lowering the barrier for rapid prototyping and research reproducibility.

Limitations & Future Work

Dependence on detector quality: As with any tracking‑by‑detection approach, TDLP’s performance degrades if the upstream detector produces many false positives or misses objects.
Temporal context depth: The current model only looks at the most recent track state; incorporating longer motion histories (e.g., via LSTMs or transformers) could improve handling of long‑term occlusions.
Limited exploration of rich appearance cues: While pose and simple embeddings are supported, the paper does not evaluate deep visual features (e.g., re‑identification embeddings) that could further boost robustness in crowded scenes.
Future directions suggested include extending the link‑prediction network to a graph‑neural architecture for joint multi‑track reasoning, and investigating self‑supervised pretraining to reduce reliance on large annotated MOT datasets.

Authors

Momir Adžemović

Paper Information

arXiv ID: 2512.22105v1
Categories: cs.CV
Published: December 26, 2025
PDF: Download PDF

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model

[Paper] StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars