[Paper] Training a Custom CNN on Five Heterogeneous Image Datasets

Published: (January 8, 2026 at 03:44 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.04727v1

Overview

This paper evaluates how a lightweight, custom‑built Convolutional Neural Network (CNN) stacks up against heavyweight, off‑the‑shelf models (ResNet‑18, VGG‑16) on five very different image collections—from mango‑variety sorting on farms to road‑surface monitoring in cities. By training each model from scratch and with transfer learning, the authors expose the trade‑offs between model size, data volume, and real‑world robustness, offering a practical guide for engineers who need accurate vision solutions on limited hardware.

Key Contributions

  • Custom CNN design – a compact architecture (≈ 0.9 M parameters) that runs comfortably on edge devices while delivering competitive accuracy across all five tasks.
  • Systematic benchmark – side‑by‑side comparison of the custom model, ResNet‑18, and VGG‑16 under three training regimes: (i) random initialization, (ii) ImageNet‑pretrained weights (transfer learning), and (iii) fine‑tuned on each dataset.
  • Cross‑domain analysis – insight into how illumination variance, resolution differences, and class imbalance affect convergence and generalization for each architecture.
  • Guidelines for data‑constrained scenarios – clear recommendations on when transfer learning outweighs the cost of larger models, especially for small or noisy datasets.

Methodology

  1. Datasets – Five publicly released collections covering agricultural (mango, paddy) and urban (road condition, auto‑rickshaw detection, footpath encroachment) domains. Sizes range from ~1 k to ~12 k images, with 2–8 classes per task.
  2. Pre‑processing & augmentation – Uniform resizing to 224 × 224, per‑channel mean subtraction, and on‑the‑fly augmentations (random flips, rotations, brightness jitter) to mitigate class imbalance and illumination shifts.
  3. Model architectures
    • Custom CNN: 3 convolutional blocks (3×3 kernels, batch‑norm, ReLU) → global average pooling → 1 fully‑connected classifier.
    • ResNet‑18 and VGG‑16: standard PyTorch implementations.
  4. Training regimes
    • Scratch: random weight init, Adam optimizer, learning rate = 1e‑3, cosine annealing.
    • Transfer: load ImageNet weights, freeze early layers (first 2 blocks), fine‑tune remaining layers with a reduced LR (1e‑4).
  5. Evaluation – 5‑fold cross‑validation; metrics include overall accuracy, per‑class F1, and inference latency on a Raspberry Pi 4 (CPU) and an NVIDIA Jetson Nano (GPU).

Results & Findings

DatasetModel (Transfer)Accuracy ↑Params (M)CPU latency (ms)
MangoCustom CNN92.1%0.928
ResNet‑1893.4%11.2112
PaddyVGG‑16 (Scratch)88.7%14.7140
RoadCustom CNN95.3%0.930
Auto‑RickshawResNet‑18 (Transfer)97.0%11.2108
FootpathCustom CNN90.5%0.927
  • Transfer learning wins on the two smallest datasets (Mango, Paddy) where the custom CNN reaches >90 % accuracy with far fewer parameters.
  • Depth matters for the more visually complex task (Auto‑Rickshaw detection); ResNet‑18 gains ~2 % absolute accuracy over the custom model.
  • Inference speed: the custom CNN is 3–4× faster on edge hardware, making it suitable for real‑time monitoring.
  • Class imbalance is largely mitigated by augmentation; however, VGG‑16 still overfits on the smallest sets when trained from scratch.

Practical Implications

  • Edge deployment – Developers can ship the custom CNN to low‑cost devices (Raspberry Pi, Jetson Nano) for on‑site agricultural sorting or city‑infrastructure monitoring without sacrificing much accuracy.
  • Rapid prototyping – Transfer‑learning pipelines using ImageNet weights cut training time by ~60 % and boost performance on data‑starved domains, a useful shortcut for startups building niche vision products.
  • Resource budgeting – The paper quantifies the trade‑off between model size and latency, helping product managers decide whether a heavier backbone is justified for a given use‑case (e.g., high‑resolution traffic cameras vs. battery‑powered field sensors).
  • Dataset design – The authors’ augmentation recipe (brightness jitter + random rotations) proves effective across illumination‑varying domains, offering a ready‑made recipe for engineers handling similar heterogeneity.

Limitations & Future Work

  • Dataset scale – All five collections are relatively small (<12 k images); results may differ on large‑scale industrial datasets where deeper networks typically excel.
  • Domain shift – The study does not explore cross‑domain generalization (e.g., training on mango images and testing on a different fruit), leaving open the question of how well the custom CNN transfers without fine‑tuning.
  • Hardware diversity – Benchmarks are limited to two edge platforms; performance on microcontroller‑class devices (e.g., ARM Cortex‑M) remains untested.
  • Future directions suggested by the authors include:
    1. Integrating lightweight attention modules to boost discriminative power.
    2. Exploring self‑supervised pre‑training on unlabeled farm/urban footage.
    3. Extending the evaluation to video‑streaming scenarios for real‑time anomaly detection.

Authors

  • Anika Tabassum
  • Tasnuva Mahazabin Tuba
  • Nafisa Naznin

Paper Information

  • arXiv ID: 2601.04727v1
  • Categories: cs.CV, cs.NE
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »