[Paper] A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

Published: 2 weeks ago (January 5, 2026 at 11:26 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.02246v1

Overview

This study puts the three most common ways of deploying convolutional neural networks—building a small custom model from scratch, using a large pre‑trained network as a frozen feature extractor, and fine‑tuning a pre‑trained backbone—head‑to‑head on five real‑world image classification tasks. By measuring both predictive quality (accuracy, macro F1) and resource usage (training time, parameter count), the paper gives developers a data‑driven guide for picking the right strategy under different compute budgets.

Key Contributions

Controlled benchmark across five diverse visual datasets (road‑surface defects, crop varieties, plant disease, pedestrian walkway encroachment, and unauthorized vehicle detection).
Side‑by‑side comparison of three CNN deployment paradigms: (1) custom lightweight CNN trained from scratch, (2) frozen pre‑trained CNN used as a static feature extractor, and (3) transfer learning with partial/full fine‑tuning.
Multi‑metric evaluation that combines predictive performance (accuracy, macro F1) with efficiency indicators (training time per epoch, total parameters, memory footprint).
Practical decision matrix that maps dataset characteristics and hardware constraints to the most suitable modeling approach.
Open‑source reproducibility package (code, configs, and trained checkpoints) that lets practitioners replicate the experiments on their own data.

Methodology

Datasets – Five publicly available image sets were curated, each representing a distinct domain and class imbalance profile. All images were resized to 224 × 224 px for consistency.
Model families
- Custom CNN: a 4‑layer architecture (~0.9 M parameters) designed for low‑latency inference.
- Pre‑trained feature extractor: ResNet‑50, EfficientNet‑B0, and MobileNet‑V2 pretrained on ImageNet, with their convolutional stacks frozen and only a linear classifier trained on top.
- Transfer learning: the same backbones fine‑tuned either (a) only the classifier head, (b) the last two blocks, or (c) the entire network.
Training protocol – All experiments used the same optimizer (AdamW), learning‑rate schedule (cosine annealing), batch size (32), and early‑stopping criteria. Hyper‑parameters were tuned via a small grid search per paradigm to avoid bias.
Metrics – Classification accuracy and macro‑averaged F1‑score capture overall and class‑balanced performance. Training time per epoch and total parameter count serve as proxies for compute and memory cost.
Statistical validation – Each configuration was run three times with different random seeds; results are reported as mean ± standard deviation, and paired t‑tests assess significance between paradigms.

Results & Findings

Paradigm	Avg. Accuracy	Avg. Macro F1	Params (M)	Training time / epoch (s)
Custom CNN (scratch)	78.4 %	0.71	0.9	12
Frozen pre‑trained extractor	74.1 %	0.66	7.8 (ResNet‑50)	15
Transfer learning (fine‑tune last 2 blocks)	84.9 %	0.78	7.8	22
Transfer learning (full fine‑tune)	84.3 %	0.77	7.8	28

Key takeaways

Fine‑tuning consistently outperforms both the custom CNN and the frozen extractor, delivering a 6–10 % boost in accuracy across all datasets.
Custom CNNs shine when resources are tight: they achieve respectable performance with <1 M parameters and the fastest epoch time, making them ideal for edge devices or rapid prototyping.
Frozen feature extraction lags behind in both accuracy and macro F1, especially on datasets with domain‑specific textures (e.g., road‑surface cracks).
The marginal gain from full‑network fine‑tuning over partial fine‑tuning is small (<1 % accuracy) but comes with a noticeable increase in training time, suggesting diminishing returns for the extra compute.

Practical Implications

Edge‑AI deployments (e.g., IoT sensors on bridges or farms) can adopt the lightweight custom CNN without sacrificing too much accuracy, keeping inference latency and power consumption low.
Mid‑scale production pipelines (e.g., quality‑control cameras in food processing) benefit most from partial fine‑tuning of a pre‑trained backbone—offering the best trade‑off between model robustness and training cost.
Rapid‑iteration research can start with a frozen extractor to get baseline results quickly, then switch to fine‑tuning once the data pipeline stabilizes.
Model‑ops teams can use the provided decision matrix to automate the selection of the optimal paradigm based on available GPU memory, training window, and target latency.
The study’s open‑source suite makes it straightforward to plug in a new dataset and let the same benchmarking script recommend a strategy, accelerating time‑to‑value for visual AI projects.

Limitations & Future Work

The experiments are limited to ImageNet‑pre‑trained backbones; newer self‑supervised or domain‑specific pre‑training could shift the balance.
Only classification tasks were examined; detection or segmentation pipelines may exhibit different trade‑offs.
Hardware diversity (e.g., TPU, low‑power microcontrollers) was not explored; performance on non‑GPU platforms could alter the efficiency conclusions.
Future research could extend the benchmark to larger‑scale datasets, incorporate neural architecture search for custom models, and evaluate inference‑time metrics on real edge hardware.

Authors

Annoor Sharara Akhand

Paper Information

arXiv ID: 2601.02246v1
Categories: cs.CV, cs.AI, cs.LG
Published: January 5, 2026
PDF: Download PDF

[Paper] A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

[Paper] PRISM-CAFO: Prior-conditioned Remote-sensing Infrastructure Segmentation and Mapping for CAFOs

[Paper] When Are Two Scores Better Than One? Investigating Ensembles of Diffusion Models