[Paper] Is Bigger Always Better? Efficiency Analysis in Resource-Constrained Small Object Detection
Source: arXiv - 2603.02142v1
Overview
The paper Is Bigger Always Better? Efficiency Analysis in Resource‑Constrained Small Object Detection challenges the prevailing “bigger‑is‑better” dogma in computer‑vision model scaling. By rigorously testing three scaling levers—model size, training‑set size, and image resolution—on rooftop photovoltaic (PV) detection in Madagascar, the authors show that tiny, high‑resolution models can outperform their massive counterparts both in raw accuracy and in efficiency (accuracy per megabyte of model).
Key Contributions
- Systematic efficiency framework: Introduces a metric (mAP₅₀ per unit of model size) to compare models on a fair resource‑budget basis.
- Empirical inversion of scaling laws: Demonstrates that the smallest YOLO 11 N model is 24× more efficient than the largest YOLO 11 X while also achieving the highest absolute mAP₅₀ (0.617).
- Resolution as the dominant lever: Shows that increasing input resolution yields up to +120 % efficiency gain, dwarfing the marginal benefits of adding more training data at low resolutions.
- Pareto‑dominance across 44 deployment scenarios: Small, high‑resolution configurations dominate the accuracy‑throughput trade‑off space, eliminating the need for a classic “accuracy vs. speed” compromise.
- Domain‑specific insight for Earth observation (EO): Provides the first large‑scale, data‑scarce analysis of scaling laws for small‑object detection in satellite imagery.
Methodology
- Dataset & Task – The authors curated a rooftop PV detection benchmark from high‑resolution satellite images of Madagascar, a classic “small‑object” problem where each PV panel occupies only a few pixels.
- Scaling Dimensions
- Model size: Six YOLO 11 variants ranging from the ultra‑light YOLO 11 N (≈1 M parameters) to the heavyweight YOLO 11 X (≈90 M parameters).
- Dataset size: Sub‑samples of the training set (10 %, 30 %, 60 %, 100 %).
- Input resolution: Four resolutions (640×640, 960×960, 1280×1280, 1600×1600).
- Training Protocol – All models were trained with identical hyper‑parameters (learning rate schedule, optimizer, augmentation) to isolate the effect of the three scaling knobs.
- Efficiency Metric – For each configuration, the authors compute mAP₅₀ / model‑size (MB), allowing a direct comparison of “accuracy per byte”.
- Pareto Analysis – The 44 possible configurations (6 models × 4 resolutions × ~2 dataset‑size regimes) are plotted in an accuracy‑throughput space; configurations that are not dominated by any other are identified as Pareto‑optimal.
Results & Findings
| Scaling Lever | Impact on mAP₅₀ | Impact on Efficiency (mAP₅₀/MB) |
|---|---|---|
| Model size (YOLO 11 N → YOLO 11 X) | +0.02 mAP₅₀ (tiny gain) | ‑24× (efficiency collapses) |
| Resolution (640 → 1600) | +0.12 mAP₅₀ | +120 % efficiency boost |
| Dataset size (10 % → 100 %) | +0.01–0.03 mAP₅₀ (negligible) | No measurable efficiency change |
- YOLO 11 N at 1600×1600 achieved the best absolute mAP₅₀ (0.617) and the highest efficiency, beating every larger model even when they used the same or higher resolution.
- Adding more labeled images gave diminishing returns, especially when the resolution was low; the model quickly saturated on the information available in each pixel.
- In all 44 deployment setups, the small‑high‑resolution point sat on the Pareto frontier, meaning no other configuration could improve accuracy without sacrificing throughput (or vice‑versa).
Practical Implications
- Model selection for edge/IoT devices – When deploying CV on satellites, drones, or on‑board processors with strict memory limits, developers should prioritize higher input resolution over bigger backbones.
- Cost‑effective data collection – In data‑scarce EO projects, investing heavily in labeling more imagery may not pay off; instead, allocate resources to acquire higher‑resolution sensors or to up‑sample existing data.
- Simplified pipeline – Smaller models reduce inference latency, power consumption, and simplify containerization, enabling real‑time monitoring of rooftop PV installations for grid operators or NGOs.
- Generalizable recipe – The efficiency‑first evaluation can be applied to other small‑object detection domains (e.g., wildlife counting, traffic sign detection) where the object occupies few pixels.
Limitations & Future Work
- Domain specificity – The study focuses on rooftop PV detection in a single geographic region; results may differ for other object classes or terrains.
- Hardware‑agnostic metric – Efficiency is measured per megabyte of model size, not per FLOPs or actual wall‑clock latency on specific hardware; future work could incorporate device‑specific benchmarks.
- Resolution ceiling – Extremely high resolutions may hit memory limits on some edge devices; exploring tiling or multi‑scale inference strategies would be valuable.
- Model families – Only YOLO 11 variants were examined; extending the analysis to transformer‑based detectors or lightweight CNNs (e.g., MobileNet‑V3) could confirm whether the observed inversion holds more broadly.
Authors
- Kwame Mbobda‑Kuate
- Gabriel Kasmi
Paper Information
- arXiv ID: 2603.02142v1
- Categories: cs.CV, cs.LG
- Published: March 2, 2026
- PDF: Download PDF