[Paper] An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers
Source: arXiv - 2606.05149v1
Overview
A new open‑source computer‑vision pipeline tackles a real‑world safety problem: automatically classifying the body type of vehicles captured in naturalistic roadway video. By chaining a state‑of‑the‑art object detector with a fine‑tuned Vision Transformer, the authors achieve near‑human accuracy on six vehicle categories that matter for cyclist injury risk, and they release all code, models, and evaluation tools for the community.
Key Contributions
- Two‑stage architecture:
- Stage 1: RT‑DETR detector for fast, coarse vehicle localization.
- Stage 2: Vision Transformer (ViT‑Base/16) fine‑tuned to distinguish six fine‑grained body types.
- Confidence‑based abstention: predictions with softmax confidence < 0.60 are labeled “unknown” instead of forcing a possibly wrong class.
- Robustness evaluation:
- In‑distribution test (3,805 overtaking events) → 94 % accuracy, per‑class F1 = 0.91–0.97.
- Out‑of‑distribution test (311 events from a different cycling dataset) → 89 % accuracy, with three of four major classes retaining F1 ≥ 0.90.
- Open‑source release: full pipeline, training scripts, pretrained weights, and evaluation utilities are publicly available under a permissive license.
- Domain‑shift analysis: shows how the abstention mechanism gracefully handles uncertainty (e.g., minivan F1 drops due to higher abstention, not misclassification).
Methodology
- Data collection & labeling – 3,805 overtaking events were manually annotated with six vehicle body‑type labels (passenger car, SUV, pickup, minivan, large van, commercial truck).
- Stage 1 – Coarse detection – A pre‑trained RT‑DETR model (a recent transformer‑based detector) scans each video frame, outputting bounding boxes and class‑agnostic confidence scores. This step is lightweight and runs in real time on a single GPU.
- Stage 2 – Fine‑grained classification – Cropped vehicle patches are fed to a ViT‑Base/16 model that was initialized from ImageNet‑21k weights and then fine‑tuned on the annotated dataset. The transformer’s self‑attention layers capture subtle shape cues (roofline, wheelbase, rear profile) that differentiate, say, an SUV from a pickup.
- Abstention logic – After the softmax layer, if the highest class probability < 0.60, the system emits an “unknown” label. This prevents silent errors when the model is unsure (e.g., due to occlusion or lighting).
- Evaluation – Standard metrics (accuracy, per‑class F1) are computed on both the in‑distribution test set and an out‑of‑distribution set drawn from an open cycling video repository, without any additional fine‑tuning.
Results & Findings
| Dataset | Overall Accuracy | Avg. F1 | Notable per‑class F1 |
|---|---|---|---|
| In‑distribution (Ann Arbor) | 0.94 | 0.94 | SUV 0.97, Pickup 0.95, Large Van 0.94, Commercial Truck 0.96, Passenger Car 0.93, Minivan 0.91 |
| Out‑of‑distribution (external cycling data) | 0.89 | 0.89 | SUV 0.95, Pickup 0.93, Large Van 0.91, Commercial Truck 0.92, Passenger Car 0.88, Minivan 0.72 |
Why the minivan drop? The abstention rate for minivans jumped from 2.4 % (in‑distribution) to 25 % (out‑of‑distribution). The model correctly “opts out” rather than forcing a wrong label, which is reflected in a lower F1 but higher overall reliability.
Practical Implications
- Cyclist‑safety analytics – Transportation agencies can now process existing roadside video archives to quantify exposure to high‑risk vehicle types without manual labeling.
- Real‑time monitoring – The lightweight RT‑DETR + ViT pipeline runs at ~15 fps on a single RTX 3080, making it feasible for live deployment at intersections or bike‑lane cameras.
- Transferable framework – Because the detector is generic and the classifier is fine‑tuned on a modest dataset, the same two‑stage design can be repurposed for other fine‑grained tasks (e.g., distinguishing delivery vans vs. passenger vans for logistics analytics).
- Safety‑policy feedback loop – Automated body‑type statistics can feed into risk‑modeling tools that prioritize infrastructure upgrades (e.g., adding protected bike lanes where large trucks dominate).
- Open‑source ecosystem – Researchers and developers can clone the repo, plug in their own video sources, and extend the label set or add domain‑adaptation tricks without starting from scratch.
Limitations & Future Work
- Dataset scope – Training data comes from a single urban corridor; rare vehicle shapes (e.g., electric vans) may still be mis‑identified.
- Abstention threshold – The 0.60 cutoff is heuristic; adaptive thresholds or Bayesian uncertainty estimates could yield better trade‑offs.
- Temporal context – The current pipeline classifies each detected vehicle independently; incorporating motion cues across frames might improve robustness to occlusion.
- Edge deployment – While feasible on a desktop GPU, further optimization (e.g., TensorRT, model pruning) is needed for low‑power edge devices.
The authors invite the community to build on their codebase, explore these extensions, and help bring fine‑grained vehicle classification into everyday traffic‑safety workflows.
Authors
- Gandhimathi Padmanaban
- Fred Feng
Paper Information
- arXiv ID: 2606.05149v1
- Categories: cs.CV, cs.LG, eess.IV
- Published: June 3, 2026
- PDF: Download PDF