[Paper] Label-Efficient School Detection from Aerial Imagery via Weakly Supervised Pretraining and Fine-Tuning

Published: (May 5, 2026 at 12:51 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.03968v1

Overview

Detecting schools in high‑resolution aerial images is a critical step for NGOs, governments, and telecom operators that need to plan infrastructure, allocate resources, or roll out internet connectivity to underserved regions. This paper presents a weakly supervised, label‑efficient pipeline that can train accurate school detectors with only a handful of manually annotated images, leveraging automatically generated labels from sparse location points and semantic segmentation.

Key Contributions

  • Two‑stage training framework: first pre‑train on automatically generated bounding boxes, then fine‑tune on a tiny clean set (as few as 50 images).
  • Automatic labeling pipeline that converts sparse GPS points into segmentation masks and then into object‑level bounding boxes without human drawing.
  • Demonstrated strong detection performance in low‑data regimes, outperforming fully supervised baselines when manual labels are scarce.
  • Open‑source release of models, code, and the auto‑labeled dataset to accelerate research and real‑world deployments.

Methodology

  1. Data Sources

    • Sparse location points (e.g., GPS coordinates of known schools) obtained from public registries or crowd‑sourced maps.
    • High‑resolution aerial imagery covering the same geographic area.
  2. Automatic Label Generation

    • Run a semantic segmentation network (trained on generic building footprints) over the imagery.
    • Overlay the sparse points onto the segmentation map; the intersecting building‑like regions are extracted as school masks.
    • Convert each mask into a tight bounding box to serve as a pseudo‑label for object detection.
  3. Two‑Stage Training

    • Stage 1 – Weakly supervised pre‑training: Train a standard object detector (e.g., Faster RCNN, YOLOv8) on the large set of auto‑labeled boxes. The model learns a generic “school‑like” visual representation.
    • Stage 2 – Fine‑tuning: Use a small, manually verified dataset (≈ 50 images) to refine the detector, correcting noise introduced in Stage 1 and improving localization precision.
  4. Evaluation

    • Standard object detection metrics (AP@0.5, AP@0.75) on a held‑out test set with high‑quality annotations.
    • Ablation studies comparing: (a) fully supervised training with the same 50 images, (b) only Stage 1, and (c) the full two‑stage pipeline.

Results & Findings

Training RegimeAP@0.5AP@0.75
Fully supervised (50 manual images)0.420.21
Stage 1 only (auto‑labels)0.480.24
Two‑stage (auto‑labels + 50 manual)0.660.38
  • The two‑stage approach outperforms pure supervised learning by a large margin despite using the same number of clean annotations.
  • Performance plateaus after ~50 manual images; adding more manual data yields diminishing returns, confirming the method’s label efficiency.
  • Visual inspection shows the detector reliably finds schools in varied contexts (urban blocks, rural clusters, different roof materials) even when the auto‑labels contain noise.

Practical Implications

  • Scalable mapping for NGOs & governments: Organizations can bootstrap a school‑detection model with only a few dozen verified sites, then roll it out across entire countries using the auto‑label pipeline.
  • Rapid assessment for connectivity projects: Telecom operators can quickly estimate the number and distribution of schools to prioritize broadband rollout, reducing costly field surveys.
  • Cost reduction: Manual annotation budgets shrink dramatically—what previously required thousands of hours of labeling can now be achieved with a few days of expert verification.
  • Extensibility: The same weakly supervised recipe can be adapted to other infrastructure types (clinics, water tanks, solar panels) by swapping the semantic segmentation backbone.

Limitations & Future Work

  • Quality of auto‑labels depends on the segmentation model; in regions with atypical building styles or heavy vegetation, masks may be noisy, limiting Stage 1 learning.
  • The approach assumes accurate GPS points; systematic location errors can propagate into mislabeled boxes.
  • Experiments were limited to a few geographic regions; broader cross‑continental validation is needed to confirm robustness to diverse imaging conditions.
  • Future directions include: (a) incorporating multimodal data (e.g., SAR, multispectral) to improve mask generation, (b) exploring self‑training or contrastive learning to further reduce reliance on any manual labels, and (c) building an active‑learning loop where the model requests the most informative manual annotations.

Authors

  • Zakarya Elmimouni
  • Fares Fourati
  • Mohamed‑Slim Alouini

Paper Information

  • arXiv ID: 2605.03968v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...