[Paper] Training a Custom CNN on Five Heterogeneous Image Datasets

Published: 1 month ago (January 8, 2026 at 03:44 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.04727v1

Overview

This paper evaluates how a lightweight, custom‑built Convolutional Neural Network (CNN) stacks up against heavyweight, off‑the‑shelf models (ResNet‑18, VGG‑16) on five very different image collections—from mango‑variety sorting on farms to road‑surface monitoring in cities. By training each model from scratch and with transfer learning, the authors expose the trade‑offs between model size, data volume, and real‑world robustness, offering a practical guide for engineers who need accurate vision solutions on limited hardware.

Key Contributions

Custom CNN design – a compact architecture (≈ 0.9 M parameters) that runs comfortably on edge devices while delivering competitive accuracy across all five tasks.
Systematic benchmark – side‑by‑side comparison of the custom model, ResNet‑18, and VGG‑16 under three training regimes: (i) random initialization, (ii) ImageNet‑pretrained weights (transfer learning), and (iii) fine‑tuned on each dataset.
Cross‑domain analysis – insight into how illumination variance, resolution differences, and class imbalance affect convergence and generalization for each architecture.
Guidelines for data‑constrained scenarios – clear recommendations on when transfer learning outweighs the cost of larger models, especially for small or noisy datasets.

Methodology

Datasets – Five publicly released collections covering agricultural (mango, paddy) and urban (road condition, auto‑rickshaw detection, footpath encroachment) domains. Sizes range from ~1 k to ~12 k images, with 2–8 classes per task.
Pre‑processing & augmentation – Uniform resizing to 224 × 224, per‑channel mean subtraction, and on‑the‑fly augmentations (random flips, rotations, brightness jitter) to mitigate class imbalance and illumination shifts.
Model architectures
- Custom CNN: 3 convolutional blocks (3×3 kernels, batch‑norm, ReLU) → global average pooling → 1 fully‑connected classifier.
- ResNet‑18 and VGG‑16: standard PyTorch implementations.
Training regimes
- Scratch: random weight init, Adam optimizer, learning rate = 1e‑3, cosine annealing.
- Transfer: load ImageNet weights, freeze early layers (first 2 blocks), fine‑tune remaining layers with a reduced LR (1e‑4).
Evaluation – 5‑fold cross‑validation; metrics include overall accuracy, per‑class F1, and inference latency on a Raspberry Pi 4 (CPU) and an NVIDIA Jetson Nano (GPU).

Results & Findings

Dataset	Model (Transfer)	Accuracy ↑	Params (M)	CPU latency (ms)
Mango	Custom CNN	92.1%	0.9	28
	ResNet‑18	93.4%	11.2	112
Paddy	VGG‑16 (Scratch)	88.7%	14.7	140
Road	Custom CNN	95.3%	0.9	30
Auto‑Rickshaw	ResNet‑18 (Transfer)	97.0%	11.2	108
Footpath	Custom CNN	90.5%	0.9	27

Transfer learning wins on the two smallest datasets (Mango, Paddy) where the custom CNN reaches >90 % accuracy with far fewer parameters.
Depth matters for the more visually complex task (Auto‑Rickshaw detection); ResNet‑18 gains ~2 % absolute accuracy over the custom model.
Inference speed: the custom CNN is 3–4× faster on edge hardware, making it suitable for real‑time monitoring.
Class imbalance is largely mitigated by augmentation; however, VGG‑16 still overfits on the smallest sets when trained from scratch.

Practical Implications

Edge deployment – Developers can ship the custom CNN to low‑cost devices (Raspberry Pi, Jetson Nano) for on‑site agricultural sorting or city‑infrastructure monitoring without sacrificing much accuracy.
Rapid prototyping – Transfer‑learning pipelines using ImageNet weights cut training time by ~60 % and boost performance on data‑starved domains, a useful shortcut for startups building niche vision products.
Resource budgeting – The paper quantifies the trade‑off between model size and latency, helping product managers decide whether a heavier backbone is justified for a given use‑case (e.g., high‑resolution traffic cameras vs. battery‑powered field sensors).
Dataset design – The authors’ augmentation recipe (brightness jitter + random rotations) proves effective across illumination‑varying domains, offering a ready‑made recipe for engineers handling similar heterogeneity.

Limitations & Future Work

Dataset scale – All five collections are relatively small (<12 k images); results may differ on large‑scale industrial datasets where deeper networks typically excel.
Domain shift – The study does not explore cross‑domain generalization (e.g., training on mango images and testing on a different fruit), leaving open the question of how well the custom CNN transfers without fine‑tuning.
Hardware diversity – Benchmarks are limited to two edge platforms; performance on microcontroller‑class devices (e.g., ARM Cortex‑M) remains untested.
Future directions suggested by the authors include:
1. Integrating lightweight attention modules to boost discriminative power.
2. Exploring self‑supervised pre‑training on unlabeled farm/urban footage.
3. Extending the evaluation to video‑streaming scenarios for real‑time anomaly detection.

Authors

Anika Tabassum
Tasnuva Mahazabin Tuba
Nafisa Naznin

Paper Information

arXiv ID: 2601.04727v1
Categories: cs.CV, cs.NE
Published: January 8, 2026
PDF: Download PDF

[Paper] Training a Custom CNN on Five Heterogeneous Image Datasets

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame &amp; Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction