You Do NOT need augmentations to train your Classifiers!!
Source: Dev.to
Contents of this blog
- Introduction
- Working with Classifiers: The First Classifier
- Understanding Classifiers: Unreliable Sources
- Actual Research: The Study
- Results
Introductory warning: If you’d rather begin directly with the study, start reading from Actual Research and then Results.
Key Terms
Binary Referral – A yes/no clinical decision about whether a patient should be referred to another service, specialist, or level of care.
Exotropia – A form of strabismus (eye mis‑alignment) in which one or both eyes turn outward, away from the nose. It can be constant or intermittent and may cause double vision, eye strain, or reduced depth perception.
Esotropia – A type of strabismus in which one or both eyes turn inward toward the nose. It can be constant or intermittent, is common in children, and occurs at all ages.
Resolution – A measure of how well forecasts separate situations with different observed outcome frequencies. Higher resolution means the model assigns different probabilities to cases that genuinely differ in event likelihood.
Introduction
The most common ways to improve classifier performance are:
- Using more data
- Using pretrained architectures
- Employing augmentations
Previously, I’ve written extensively on classifier training—from common pitfalls to augmentation techniques like AutoAugment, RandAugment, and TrivialAugment, with Cutout, CutMix, and MixUp also in progress.
Across those posts, I often guided newcomers toward TrivialAugment or suggested they explore Generative Adversarial Networks (GANs). Within the medical domain, StyleGAN2‑ADA stood out to me: it performs well with limited data, is relatively intuitive once you grasp GAN fundamentals, and holds up strongly against predecessors like StyleGAN and StyleGAN2.
However, my recent research made me rethink some of those assumptions.
June 2025 – The First Classifier
In June, I had just started contributing to an open‑source project, studying chatbots, and polishing a few independent research projects. Around that time I built my first classifier—not for research, but for a small hackathon I decided to join. The classifier was central to the project because it had to provide exercise recommendations based on predictions. Accuracy was crucial.
The project was completed and submitted successfully. Did I win? No, but not because of the classifier. The issues were instead due to dependency updates and package incompatibilities (I discuss them in detail here).
Still, the experience—its frustrations, limitations, and small victories—sparked a six‑month deep dive into classifiers.
July 2025 – August 2025 – Unreliable Sources
A few weeks later, I revisited that original classifier and began experimenting with LLMs to refine it. My goal: learn the best strategy for building an effective classifier with just 500 images across 5 classes.
Initially, everything worked smoothly; the LLM suggestions improved the model. Then the infamous decline began: output quality dropped, changes became less meaningful, and eventually the classifier’s performance worsened.
Despite studying “programs and algorithms,” I found myself repeatedly pressing Ctrl + C and Ctrl + V. Fed up with the irony, I asked myself:
“How hard can studying classifiers actually be?”
TL;DR: Extremely hard if you’re new.
I refreshed my understanding of CNNs (a topic I’d studied long ago and also blogged about in Juggling Multiple Interests). Then I moved on to augmentations.
With my trust in LLMs diminishing due to contradictions and back‑tracking, I still used them for basic definitions, but I could clearly tell when the information was unreliable. Eventually I decided:
“What better way to learn something than from the source?”
That decision came with challenges: AutoAugment requires substantial foundational knowledge. It was ultimately worth it.
During this period I learned about:
- How AutoAugment works
- Computational demands and constraints
- Performance across datasets like ImageNet, CIFAR‑10, SVHN
- Architectural, optimization, GPU, and CPU considerations
This naturally led me to RandAugment, AutoAugment’s computationally cheaper successor. Around the same time I entered the medical/clinical perspective space, and one particular question stuck with me:
“Which of these would be preferable in a clinical setting?”
That single question became the motivation behind the study I pursued for months.
June 2025 – November 2025 – The Study
In late July, I began an independent research study to benchmark augmentation techniques for a specific task: binary referral.
My goal was to determine whether augmentations truly help under accurate, but also sub‑optimal, clinical conditions.
At this point I was already deep into dataset‑specific augmentations (AA, RA, TA). To compare them with more general, robustness‑focused augmentations, I included the Mix family: Cutout, CutMix, and MixUp.
Final augmentation set
| Augmentation | Type |
|---|---|
| AutoAugment | Dataset‑specific |
| RandAugment | Dataset‑specific (lighter) |
| TrivialAugment | Dataset‑specific (very light) |
| Cutout | General robustness |
| CutMix | General robustness |
| MixUp | General robustness |
| Baseline | No augmentation |
Simulating sub‑optimal conditions
- Hardware: CPU‑only (no GPU)
- Dataset size: ~100 images per class – a stress test
- Models: Pre‑trained EfficientNet‑B0, MobileNet‑V2, MobileNet‑V3 (ImageNet‑trained) to mitigate data scarcity
A practical issue emerged: these models require 224 × 224 inputs. Cropping was not viable because it removed spatially important medical features. I solved this by padding images into a square, producing a proper 224 × 224 input while preserving structure. Grad‑CAM confirmed that models still localized the correct regions.
Evaluation metrics
- Statistical analysis
- Brier score decomposition
- Odds ratios (and other calibration measures)
Results
(The results section follows the study description and is presented here for completeness. Continue reading for detailed findings, tables, and visualisations.)
AUC/DeLong Comparisons
I tested using esotropia and exotropia datasets because of their distinct characteristics.
Esotropia
- AutoAugment gave the most consistent results.
- I haven’t yet done a full qualitative analysis, but AutoAugment likely learned policies that emphasized key esotropia features.
Exotropia
- TrivialAugment performed most consistently.
- This suggests that simple random transformations can help stabilize performance.
Underperformer: CutMix
- CutMix consistently underperformed across nearly all seeds and models.
- DeLong’s test (on AUCs) repeatedly indicated worse performance for CutMix compared to the other augmentations.
Brier Decomposition
- MixUp had the most frequent issues with low reliability, followed closely by AutoAugment.
- For resolution, AutoAugment was the most consistent, showing a strong ability to differentiate cases.
The Biggest Takeaway
Across multiple seeds, pretrained models, and the dataset, the baseline performed similarly to the augmented versions on nearly all metrics.
Conclusion: You do not need augmentations to train your classifiers.
With a high‑quality dataset, proper preprocessing, and the right pretrained model, even small datasets can reach strong baselines (e.g., ~0.93 across varied metrics).
If you do choose to use an augmentation, my recommendation is AutoAugment.
Further Reading
-
Study: Introducing UBAEF
(A slight warning: the full paper with appendices is 131 pages long!) -
GitHub Repository: ML‑framework‑s‑taxonomy – contains confidence intervals, training time, and more.
Until next time, with another project.
And remember, sometimes the baseline is already great.