You Do NOT need augmentations to train your Classifiers!!

Published: 1 month ago (December 27, 2025 at 08:33 AM EST)

5 min read

Source: Dev.to

Contents of this blog

Introduction
Working with Classifiers: The First Classifier
Understanding Classifiers: Unreliable Sources
Actual Research: The Study
Results

Introductory warning: If you’d rather begin directly with the study, start reading from Actual Research and then Results.

Key Terms

Binary Referral – A yes/no clinical decision about whether a patient should be referred to another service, specialist, or level of care.

Exotropia – A form of strabismus (eye mis‑alignment) in which one or both eyes turn outward, away from the nose. It can be constant or intermittent and may cause double vision, eye strain, or reduced depth perception.

Esotropia – A type of strabismus in which one or both eyes turn inward toward the nose. It can be constant or intermittent, is common in children, and occurs at all ages.

Resolution – A measure of how well forecasts separate situations with different observed outcome frequencies. Higher resolution means the model assigns different probabilities to cases that genuinely differ in event likelihood.

Introduction

The most common ways to improve classifier performance are:

Using more data
Using pretrained architectures
Employing augmentations

Previously, I’ve written extensively on classifier training—from common pitfalls to augmentation techniques like AutoAugment, RandAugment, and TrivialAugment, with Cutout, CutMix, and MixUp also in progress.

Across those posts, I often guided newcomers toward TrivialAugment or suggested they explore Generative Adversarial Networks (GANs). Within the medical domain, StyleGAN2‑ADA stood out to me: it performs well with limited data, is relatively intuitive once you grasp GAN fundamentals, and holds up strongly against predecessors like StyleGAN and StyleGAN2.

However, my recent research made me rethink some of those assumptions.

June 2025 – The First Classifier

In June, I had just started contributing to an open‑source project, studying chatbots, and polishing a few independent research projects. Around that time I built my first classifier—not for research, but for a small hackathon I decided to join. The classifier was central to the project because it had to provide exercise recommendations based on predictions. Accuracy was crucial.

The project was completed and submitted successfully. Did I win? No, but not because of the classifier. The issues were instead due to dependency updates and package incompatibilities (I discuss them in detail here).

Still, the experience—its frustrations, limitations, and small victories—sparked a six‑month deep dive into classifiers.

July 2025 – August 2025 – Unreliable Sources

A few weeks later, I revisited that original classifier and began experimenting with LLMs to refine it. My goal: learn the best strategy for building an effective classifier with just 500 images across 5 classes.

Initially, everything worked smoothly; the LLM suggestions improved the model. Then the infamous decline began: output quality dropped, changes became less meaningful, and eventually the classifier’s performance worsened.

Despite studying “programs and algorithms,” I found myself repeatedly pressing Ctrl + C and Ctrl + V. Fed up with the irony, I asked myself:

“How hard can studying classifiers actually be?”

TL;DR: Extremely hard if you’re new.

I refreshed my understanding of CNNs (a topic I’d studied long ago and also blogged about in Juggling Multiple Interests). Then I moved on to augmentations.

With my trust in LLMs diminishing due to contradictions and back‑tracking, I still used them for basic definitions, but I could clearly tell when the information was unreliable. Eventually I decided:

“What better way to learn something than from the source?”

That decision came with challenges: AutoAugment requires substantial foundational knowledge. It was ultimately worth it.

During this period I learned about:

How AutoAugment works
Computational demands and constraints
Performance across datasets like ImageNet, CIFAR‑10, SVHN
Architectural, optimization, GPU, and CPU considerations

This naturally led me to RandAugment, AutoAugment’s computationally cheaper successor. Around the same time I entered the medical/clinical perspective space, and one particular question stuck with me:

“Which of these would be preferable in a clinical setting?”

That single question became the motivation behind the study I pursued for months.

June 2025 – November 2025 – The Study

In late July, I began an independent research study to benchmark augmentation techniques for a specific task: binary referral.

My goal was to determine whether augmentations truly help under accurate, but also sub‑optimal, clinical conditions.

At this point I was already deep into dataset‑specific augmentations (AA, RA, TA). To compare them with more general, robustness‑focused augmentations, I included the Mix family: Cutout, CutMix, and MixUp.

Final augmentation set

Augmentation	Type
AutoAugment	Dataset‑specific
RandAugment	Dataset‑specific (lighter)
TrivialAugment	Dataset‑specific (very light)
Cutout	General robustness
CutMix	General robustness
MixUp	General robustness
Baseline	No augmentation

Simulating sub‑optimal conditions

Hardware: CPU‑only (no GPU)
Dataset size: ~100 images per class – a stress test
Models: Pre‑trained EfficientNet‑B0, MobileNet‑V2, MobileNet‑V3 (ImageNet‑trained) to mitigate data scarcity

A practical issue emerged: these models require 224 × 224 inputs. Cropping was not viable because it removed spatially important medical features. I solved this by padding images into a square, producing a proper 224 × 224 input while preserving structure. Grad‑CAM confirmed that models still localized the correct regions.

Evaluation metrics

Statistical analysis
Brier score decomposition
Odds ratios (and other calibration measures)

Results

(The results section follows the study description and is presented here for completeness. Continue reading for detailed findings, tables, and visualisations.)

AUC/DeLong Comparisons

I tested using esotropia and exotropia datasets because of their distinct characteristics.

Esotropia

AutoAugment gave the most consistent results.
I haven’t yet done a full qualitative analysis, but AutoAugment likely learned policies that emphasized key esotropia features.

Exotropia

TrivialAugment performed most consistently.
This suggests that simple random transformations can help stabilize performance.

Underperformer: CutMix

CutMix consistently underperformed across nearly all seeds and models.
DeLong’s test (on AUCs) repeatedly indicated worse performance for CutMix compared to the other augmentations.

Brier Decomposition

MixUp had the most frequent issues with low reliability, followed closely by AutoAugment.
For resolution, AutoAugment was the most consistent, showing a strong ability to differentiate cases.

The Biggest Takeaway

Across multiple seeds, pretrained models, and the dataset, the baseline performed similarly to the augmented versions on nearly all metrics.

Conclusion: You do not need augmentations to train your classifiers.
With a high‑quality dataset, proper preprocessing, and the right pretrained model, even small datasets can reach strong baselines (e.g., ~0.93 across varied metrics).

If you do choose to use an augmentation, my recommendation is AutoAugment.

You Do NOT need augmentations to train your Classifiers!!

Contents of this blog

Key Terms

Introduction

June 2025 – The First Classifier

July 2025 – August 2025 – Unreliable Sources

June 2025 – November 2025 – The Study

Final augmentation set

Simulating sub‑optimal conditions

Evaluation metrics

Results

AUC/DeLong Comparisons

Esotropia

Exotropia

Underperformer: CutMix

Brier Decomposition

The Biggest Takeaway

Further Reading

Related posts

The $0 Localization Stack for Solo .NET Developers

Building an AI-Powered Code Editor: (part 2) LLM like interpreter

Networking for DevOps (Senior-Level, Production-Focused)

# The Engineering Behind Zero-Buffer 4K Streaming: A Deep Dive into High-Performance Smart4k IPTV Architecture

Contents of this blog

Key Terms

Introduction

June 2025 – The First Classifier

July 2025 – August 2025 – Unreliable Sources

June 2025 – November 2025 – The Study

Final augmentation set

Simulating sub‑optimal conditions

Evaluation metrics

Results

AUC/DeLong Comparisons

Esotropia

Exotropia

Underperformer: CutMix

Brier Decomposition

The Biggest Takeaway

Further Reading

Related posts

The $0 Localization Stack for Solo .NET Developers

Building an AI-Powered Code Editor: (part 2) LLM like interpreter

Networking for DevOps (Senior-Level, Production-Focused)

# The Engineering Behind Zero-Buffer 4K Streaming: A Deep Dive into High-Performance Smart4k IPTV Architecture

June 2025 – The First Classifier

July 2025 – August 2025 – Unreliable Sources

June 2025 – November 2025 – The Study