[Paper] ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
Source: arXiv - 2604.26806v1
Overview
The paper introduces ViCrop-Det, a training‑free inference add‑on that boosts small‑object detection in transformer‑based detectors (e.g., DETR variants). By measuring the spatial attention entropy (SAE) of the detector’s cross‑attention maps, ViCrop‑Det automatically crops and re‑processes only the most ambiguous, information‑rich regions, recovering fine‑grained features without changing the underlying model architecture.
Key Contributions
- Training‑free adaptive cropping – Uses the detector’s own attention distribution as a probe, eliminating the need for extra data or fine‑tuning.
- Spatial Attention Entropy (SAE) – A lightweight metric that quantifies local uncertainty in the cross‑attention map, guiding where to focus higher‑resolution processing.
- Dynamic spatial routing – Allocates a fixed compute budget to high‑entropy, high‑saliency patches, effectively “shrinking” the trust region around tiny objects.
- Compatibility with existing DETR families – Works out‑of‑the‑box with RT‑DETR‑R50, Deformable DETR, and other transformer detectors.
- Empirical gains on multiple benchmarks – +1–3 mAP@50 on VisDrone and DOTA‑v1.5, and noticeable AP_S improvement on COCO, with only ~20 % extra latency.
Methodology
-
Run the baseline detector once to obtain its cross‑attention maps (the same tensors already computed for object queries).
-
Compute SAE for each spatial location:
[ \text{SAE}(x,y) = -\sum_{h} p_{h}(x,y)\log p_{h}(x,y) ]
where (p_{h}) is the normalized attention weight of head (h). High entropy ⇒ the model is “confused” about what lies there.
-
Select regions that satisfy two criteria:
- High saliency (large attention magnitude, indicating a potential object).
- High entropy (large uncertainty, typical for tiny or densely packed objects).
-
Crop those regions, optionally up‑sample them, and feed the crops through the same detector again (re‑using the weights).
-
Merge the second‑pass detections with the original ones, keeping the higher‑confidence boxes and discarding duplicates via standard NMS.
Because the detector is unchanged, the only extra cost is the second forward pass on a small subset of the image, which can be bounded by a user‑defined compute budget.
Results & Findings
| Dataset | Baseline (RT‑DETR‑R50) | +ViCrop‑Det | Δ mAP@50 |
|---|---|---|---|
| VisDrone | 31.2 | 33.5 | +2.3 |
| DOTA‑v1.5 | 38.7 | 40.9 | +2.2 |
| COCO (AP_S) | 22.1 | 24.0 | +1.9 |
| COCO (AP_M / AP_L) | 38.4 / 45.6 | 38.5 / 45.5 | ≈ 0 |
- Latency: average inference time rises by 20–23 % because only a few cropped patches are re‑processed.
- Accuracy‑speed trade‑off: Compared with uniform image slicing (splitting the whole image into a grid), ViCrop‑Det achieves higher mAP for the same compute budget, confirming the benefit of entropy‑driven routing.
- Robustness: The improvement is consistent across datasets with different object densities and scales, indicating that SAE reliably spots the “hard‑to‑see” small objects.
Practical Implications
| Who Benefits | Why It Matters |
|---|---|
| Edge AI developers | Can retrofit existing DETR models on limited‑resource devices (e.g., drones, mobile phones) to detect tiny objects without retraining. |
| Surveillance & traffic monitoring | Small vehicles, pedestrians, or wildlife that are often missed can now be captured with marginal latency increase. |
| Geospatial analytics (satellite / aerial imagery) | Improves detection of small structures (e.g., cars, containers) while preserving the global context needed for scene understanding. |
| MLOps pipelines | No extra training data or hyper‑parameter tuning; the method is a plug‑and‑play inference wrapper, simplifying deployment. |
| Research prototyping | Provides a diagnostic tool: high SAE regions highlight where the model’s attention is uncertain, guiding data collection or model redesign. |
In short, ViCrop‑Det offers a low‑cost, high‑impact upgrade path for any transformer‑based detector that struggles with small objects.
Limitations & Future Work
- Heuristic nature – SAE is a proxy for uncertainty; it may misclassify noisy backgrounds as ambiguous regions, leading to occasional false positives.
- Fixed compute budget – The current implementation caps the number of crops; dynamic budgeting based on scene complexity could yield better efficiency.
- Single‑stage re‑processing – Only one additional pass is performed; iterative cropping could further refine detections but would increase latency.
- Evaluation limited to DETR‑style detectors – Extending the approach to CNN‑based detectors or hybrid architectures remains an open question.
Future research could explore learned entropy thresholds, multi‑round adaptive cropping, and integration with model‑aware pruning to further shrink the compute overhead while preserving (or even enhancing) detection quality.
Authors
- Hui Wang
- Hongze Li
- Wei Chen
- Xiaojin Zhang
Paper Information
- arXiv ID: 2604.26806v1
- Categories: cs.CV, cs.AI
- Published: April 29, 2026
- PDF: Download PDF