[Paper] A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets

Published: (January 5, 2026 at 11:26 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02246v1

Overview

This study puts the three most common ways of deploying convolutional neural networks—building a small custom model from scratch, using a large pre‑trained network as a frozen feature extractor, and fine‑tuning a pre‑trained backbone—head‑to‑head on five real‑world image classification tasks. By measuring both predictive quality (accuracy, macro F1) and resource usage (training time, parameter count), the paper gives developers a data‑driven guide for picking the right strategy under different compute budgets.

Key Contributions

  • Controlled benchmark across five diverse visual datasets (road‑surface defects, crop varieties, plant disease, pedestrian walkway encroachment, and unauthorized vehicle detection).
  • Side‑by‑side comparison of three CNN deployment paradigms: (1) custom lightweight CNN trained from scratch, (2) frozen pre‑trained CNN used as a static feature extractor, and (3) transfer learning with partial/full fine‑tuning.
  • Multi‑metric evaluation that combines predictive performance (accuracy, macro F1) with efficiency indicators (training time per epoch, total parameters, memory footprint).
  • Practical decision matrix that maps dataset characteristics and hardware constraints to the most suitable modeling approach.
  • Open‑source reproducibility package (code, configs, and trained checkpoints) that lets practitioners replicate the experiments on their own data.

Methodology

  1. Datasets – Five publicly available image sets were curated, each representing a distinct domain and class imbalance profile. All images were resized to 224 × 224 px for consistency.
  2. Model families
    • Custom CNN: a 4‑layer architecture (~0.9 M parameters) designed for low‑latency inference.
    • Pre‑trained feature extractor: ResNet‑50, EfficientNet‑B0, and MobileNet‑V2 pretrained on ImageNet, with their convolutional stacks frozen and only a linear classifier trained on top.
    • Transfer learning: the same backbones fine‑tuned either (a) only the classifier head, (b) the last two blocks, or (c) the entire network.
  3. Training protocol – All experiments used the same optimizer (AdamW), learning‑rate schedule (cosine annealing), batch size (32), and early‑stopping criteria. Hyper‑parameters were tuned via a small grid search per paradigm to avoid bias.
  4. Metrics – Classification accuracy and macro‑averaged F1‑score capture overall and class‑balanced performance. Training time per epoch and total parameter count serve as proxies for compute and memory cost.
  5. Statistical validation – Each configuration was run three times with different random seeds; results are reported as mean ± standard deviation, and paired t‑tests assess significance between paradigms.

Results & Findings

ParadigmAvg. AccuracyAvg. Macro F1Params (M)Training time / epoch (s)
Custom CNN (scratch)78.4 %0.710.912
Frozen pre‑trained extractor74.1 %0.667.8 (ResNet‑50)15
Transfer learning (fine‑tune last 2 blocks)84.9 %0.787.822
Transfer learning (full fine‑tune)84.3 %0.777.828

Key takeaways

  • Fine‑tuning consistently outperforms both the custom CNN and the frozen extractor, delivering a 6–10 % boost in accuracy across all datasets.
  • Custom CNNs shine when resources are tight: they achieve respectable performance with <1 M parameters and the fastest epoch time, making them ideal for edge devices or rapid prototyping.
  • Frozen feature extraction lags behind in both accuracy and macro F1, especially on datasets with domain‑specific textures (e.g., road‑surface cracks).
  • The marginal gain from full‑network fine‑tuning over partial fine‑tuning is small (<1 % accuracy) but comes with a noticeable increase in training time, suggesting diminishing returns for the extra compute.

Practical Implications

  • Edge‑AI deployments (e.g., IoT sensors on bridges or farms) can adopt the lightweight custom CNN without sacrificing too much accuracy, keeping inference latency and power consumption low.
  • Mid‑scale production pipelines (e.g., quality‑control cameras in food processing) benefit most from partial fine‑tuning of a pre‑trained backbone—offering the best trade‑off between model robustness and training cost.
  • Rapid‑iteration research can start with a frozen extractor to get baseline results quickly, then switch to fine‑tuning once the data pipeline stabilizes.
  • Model‑ops teams can use the provided decision matrix to automate the selection of the optimal paradigm based on available GPU memory, training window, and target latency.
  • The study’s open‑source suite makes it straightforward to plug in a new dataset and let the same benchmarking script recommend a strategy, accelerating time‑to‑value for visual AI projects.

Limitations & Future Work

  • The experiments are limited to ImageNet‑pre‑trained backbones; newer self‑supervised or domain‑specific pre‑training could shift the balance.
  • Only classification tasks were examined; detection or segmentation pipelines may exhibit different trade‑offs.
  • Hardware diversity (e.g., TPU, low‑power microcontrollers) was not explored; performance on non‑GPU platforms could alter the efficiency conclusions.
  • Future research could extend the benchmark to larger‑scale datasets, incorporate neural architecture search for custom models, and evaluate inference‑time metrics on real edge hardware.

Authors

  • Annoor Sharara Akhand

Paper Information

  • arXiv ID: 2601.02246v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »