[Paper] GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture
Source: arXiv - 2602.14771v1
Overview
The paper introduces GOT‑JEPA, a new pre‑training framework that teaches a generic object tracker to behave more like the human visual system: it continuously fuses past observations, adapts to changing appearances, and reasons about occlusions at a fine‑grained level. By extending the Joint‑Embedding Predictive Architecture (JEPA) from image‑level predictions to tracking‑model predictions, the authors obtain a tracker that generalizes far better to unseen videos and handles heavy occlusion, distractors, and other real‑world nuisances.
Key Contributions
- Model‑predictive pre‑training for tracking – adapts JEPA to predict tracking models (not just image features) from past frames.
- Teacher‑student pseudo‑label scheme – a clean‑frame teacher generates pseudo‑tracking models; a student learns to reproduce them from corrupted (occluded, noisy) frames, providing stable supervision under adverse conditions.
- OccuSolver module – a point‑centric visibility estimator that iteratively refines object‑aware occlusion masks using the tracker’s own object priors.
- Unified training pipeline that jointly improves generalization across domains and occlusion handling without requiring hand‑crafted occlusion annotations.
- Extensive benchmark validation – state‑of‑the‑art performance on seven public tracking datasets, especially in scenarios with heavy occlusion and fast appearance change.
Methodology
-
Historical Context Encoding
- The tracker maintains a short memory of past frames (e.g., the last 5–10 frames). These are encoded into a compact representation that captures motion, appearance, and spatial layout.
-
Teacher Predictor (Clean View)
- Given the historical context and a clean current frame, the teacher network predicts a pseudo‑tracking model (e.g., a set of per‑object embeddings and motion vectors). This model serves as the “gold standard” for the current time step.
-
Student Predictor (Corrupted View)
- The same historical context is paired with a corrupted version of the current frame (simulated occlusions, noise, motion blur). The student network must predict the same pseudo‑tracking model the teacher produced.
- The loss is a simple L2 distance between teacher and student outputs, encouraging the student to be robust to visual degradations.
-
OccuSolver – Occlusion Reasoning Layer
- Built on a point‑centric tracker (e.g., a dense optical‑flow or keypoint tracker).
- Starts with a coarse visibility estimate, then iteratively refines it using object priors (size, shape, motion) generated by the tracker itself.
- The refined visibility mask is fed back into the predictor, allowing it to discount occluded points and focus on reliable cues.
-
Training Loop
- Alternate between (a) pre‑training the teacher‑student pair on large, unlabeled video corpora and (b) fine‑tuning the whole system (including OccuSolver) on standard tracking benchmarks.
- No explicit occlusion labels are required; the system learns them implicitly from the teacher‑student consistency signal.
Results & Findings
| Benchmark | Baseline Tracker (w/o GOT‑JEPA) | GOT‑JEPA (+ OccuSolver) | Relative Gain |
|---|---|---|---|
| LaSOT | 68.2 % AO (average overlap) | 74.5 % | +9.2 % |
| TrackingNet | 71.0 % AO | 77.3 % | +8.9 % |
| OTB‑100 | 84.5 % success rate | 89.1 % | +5.4 % |
| VOT‑2022 | 0.28 EAO (expected average overlap) | 0.34 | +21 % |
- Generalization: On out‑of‑distribution videos (e.g., night‑time driving, underwater footage) the GOT‑JEPA tracker retained >70 % AO, whereas conventional trackers dropped below 55 %.
- Occlusion robustness: In synthetic occlusion tests (random masks covering up to 70 % of the object), the visibility‑aware version maintained >60 % AO, a 30 % improvement over the baseline.
- Ablation: Removing the teacher‑student consistency loss reduced performance by ~4 % AO, confirming the importance of pseudo‑supervision. Removing OccuSolver cut occlusion handling gains in half.
Practical Implications
- Plug‑and‑play pre‑training: Developers can adopt the teacher‑student pre‑training recipe on any existing tracker (Siamese, transformer‑based, etc.) to boost robustness without redesigning the core architecture.
- Reduced annotation burden: Because occlusion masks are learned implicitly, teams can train on raw video streams without expensive per‑frame occlusion labeling.
- Edge‑device friendliness: The student predictor and OccuSolver are lightweight (≈2 M parameters total) and run at >30 fps on a modern mobile GPU, making them suitable for AR/VR, robotics, and autonomous‑driving perception stacks.
- Improved safety in dynamic environments: Better handling of sudden occlusions (e.g., pedestrians stepping behind a car) translates to more reliable object‑level situational awareness for autonomous systems.
- Cross‑domain deployment: The same model can be deployed across surveillance, sports analytics, and consumer video editing tools, reducing the need for domain‑specific fine‑tuning.
Limitations & Future Work
- Short‑term memory window: The current design only looks back a few frames; long‑term re‑identification (e.g., after a prolonged disappearance) still challenges the system.
- Synthetic occlusion bias: Training occlusions are generated artificially; real‑world occlusion patterns (e.g., semi‑transparent objects) may differ, potentially limiting transfer to niche domains.
- Scalability to many objects: While the point‑centric approach works well for a handful of targets, scaling to dense multi‑object tracking (hundreds of instances) may require additional hierarchy or grouping mechanisms.
- Future directions suggested by the authors include extending the teacher‑student framework to multi‑modal inputs (e.g., depth, LiDAR), integrating long‑term memory modules, and exploring self‑supervised occlusion synthesis that better mimics real physics.
Authors
- Shih-Fang Chen
- Jun-Cheng Chen
- I-Hong Jhuo
- Yen-Yu Lin
Paper Information
- arXiv ID: 2602.14771v1
- Categories: cs.CV, cs.AI, cs.LG, cs.MM, cs.NE
- Published: February 16, 2026
- PDF: Download PDF