[Paper] A Comparative Study of Custom CNNs, Pre-trained Models, and Transfer Learning Across Multiple Visual Datasets
Source: arXiv - 2601.02246v1
Overview
This study puts the three most common ways of deploying convolutional neural networks—building a small custom model from scratch, using a large pre‑trained network as a frozen feature extractor, and fine‑tuning a pre‑trained backbone—head‑to‑head on five real‑world image classification tasks. By measuring both predictive quality (accuracy, macro F1) and resource usage (training time, parameter count), the paper gives developers a data‑driven guide for picking the right strategy under different compute budgets.
Key Contributions
- Controlled benchmark across five diverse visual datasets (road‑surface defects, crop varieties, plant disease, pedestrian walkway encroachment, and unauthorized vehicle detection).
- Side‑by‑side comparison of three CNN deployment paradigms: (1) custom lightweight CNN trained from scratch, (2) frozen pre‑trained CNN used as a static feature extractor, and (3) transfer learning with partial/full fine‑tuning.
- Multi‑metric evaluation that combines predictive performance (accuracy, macro F1) with efficiency indicators (training time per epoch, total parameters, memory footprint).
- Practical decision matrix that maps dataset characteristics and hardware constraints to the most suitable modeling approach.
- Open‑source reproducibility package (code, configs, and trained checkpoints) that lets practitioners replicate the experiments on their own data.
Methodology
- Datasets – Five publicly available image sets were curated, each representing a distinct domain and class imbalance profile. All images were resized to 224 × 224 px for consistency.
- Model families
- Custom CNN: a 4‑layer architecture (~0.9 M parameters) designed for low‑latency inference.
- Pre‑trained feature extractor: ResNet‑50, EfficientNet‑B0, and MobileNet‑V2 pretrained on ImageNet, with their convolutional stacks frozen and only a linear classifier trained on top.
- Transfer learning: the same backbones fine‑tuned either (a) only the classifier head, (b) the last two blocks, or (c) the entire network.
- Training protocol – All experiments used the same optimizer (AdamW), learning‑rate schedule (cosine annealing), batch size (32), and early‑stopping criteria. Hyper‑parameters were tuned via a small grid search per paradigm to avoid bias.
- Metrics – Classification accuracy and macro‑averaged F1‑score capture overall and class‑balanced performance. Training time per epoch and total parameter count serve as proxies for compute and memory cost.
- Statistical validation – Each configuration was run three times with different random seeds; results are reported as mean ± standard deviation, and paired t‑tests assess significance between paradigms.
Results & Findings
| Paradigm | Avg. Accuracy | Avg. Macro F1 | Params (M) | Training time / epoch (s) |
|---|---|---|---|---|
| Custom CNN (scratch) | 78.4 % | 0.71 | 0.9 | 12 |
| Frozen pre‑trained extractor | 74.1 % | 0.66 | 7.8 (ResNet‑50) | 15 |
| Transfer learning (fine‑tune last 2 blocks) | 84.9 % | 0.78 | 7.8 | 22 |
| Transfer learning (full fine‑tune) | 84.3 % | 0.77 | 7.8 | 28 |
Key takeaways
- Fine‑tuning consistently outperforms both the custom CNN and the frozen extractor, delivering a 6–10 % boost in accuracy across all datasets.
- Custom CNNs shine when resources are tight: they achieve respectable performance with <1 M parameters and the fastest epoch time, making them ideal for edge devices or rapid prototyping.
- Frozen feature extraction lags behind in both accuracy and macro F1, especially on datasets with domain‑specific textures (e.g., road‑surface cracks).
- The marginal gain from full‑network fine‑tuning over partial fine‑tuning is small (<1 % accuracy) but comes with a noticeable increase in training time, suggesting diminishing returns for the extra compute.
Practical Implications
- Edge‑AI deployments (e.g., IoT sensors on bridges or farms) can adopt the lightweight custom CNN without sacrificing too much accuracy, keeping inference latency and power consumption low.
- Mid‑scale production pipelines (e.g., quality‑control cameras in food processing) benefit most from partial fine‑tuning of a pre‑trained backbone—offering the best trade‑off between model robustness and training cost.
- Rapid‑iteration research can start with a frozen extractor to get baseline results quickly, then switch to fine‑tuning once the data pipeline stabilizes.
- Model‑ops teams can use the provided decision matrix to automate the selection of the optimal paradigm based on available GPU memory, training window, and target latency.
- The study’s open‑source suite makes it straightforward to plug in a new dataset and let the same benchmarking script recommend a strategy, accelerating time‑to‑value for visual AI projects.
Limitations & Future Work
- The experiments are limited to ImageNet‑pre‑trained backbones; newer self‑supervised or domain‑specific pre‑training could shift the balance.
- Only classification tasks were examined; detection or segmentation pipelines may exhibit different trade‑offs.
- Hardware diversity (e.g., TPU, low‑power microcontrollers) was not explored; performance on non‑GPU platforms could alter the efficiency conclusions.
- Future research could extend the benchmark to larger‑scale datasets, incorporate neural architecture search for custom models, and evaluate inference‑time metrics on real edge hardware.
Authors
- Annoor Sharara Akhand
Paper Information
- arXiv ID: 2601.02246v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: January 5, 2026
- PDF: Download PDF