You Don’t Need Many Labels to Learn

Published: 2 days ago (April 17, 2026 at 11:00 AM EDT)

8 min read

Source: Towards Data Science

Introduction

Usually comes with an implicit assumption: you need a lot of labeled data.

At the same time, many models are capable of discovering structure in data without any labels at all.

Generative models, in particular, often organize data into meaningful clusters during unsupervised training. When trained on images, they may naturally separate digits, objects, or styles in their latent representations.

This raises a simple but important question:

If a model has already discovered the structure of the data without labels, how much supervision is actually needed to turn it into a classifier?

In this article, we explore this question using a Gaussian Mixture Variational Autoencoder (GMVAE) (Dilokthanakul et al., 2016).

Dataset

We use the EMNIST Letters dataset introduced by Cohen et al. (2017), which is an extension of the original MNIST dataset.

Source: NIST Special Database 19
Processed by: Cohen et al. (2017)
Size: 145 600 images (26 balanced classes)
Ownership: U.S. National Institute of Standards and Technology (NIST)
License: Public domain (U.S. government work)

Disclaimer
The code provided in this article is intended for research and reproducibility purposes only. It is currently tailored to the MNIST and EMNIST datasets, and is not designed as a general‑purpose framework. Extending it to other datasets requires adaptations (data preprocessing, architecture tuning, and hyperparameter selection).

Code and experiments are available on GitHub: https://github.com/murex/gmvae-label-decoding

This choice is not arbitrary. EMNIST is far more ambiguous than the classical MNIST dataset, which makes it a better benchmark to highlight the importance of probabilistic representations (Figure 1).

The GMVAE: Learning Structure in an Unsupervised Way

A standard Variational Autoencoder (VAE) is a generative model that learns a continuous latent representation 𝒛 of the data.

More precisely, each data point 𝒙 is mapped to a multivariate normal distribution 𝒒(𝒛|𝒙), called the posterior.

However, this is not sufficient if we want to perform clustering. With a standard Gaussian prior, the latent space tends to remain continuous and does not naturally separate into distinct groups.

This is where GMVAEs come into play.

A GMVAE extends the VAE by replacing the prior with a mixture of K components, where K is chosen beforehand. To achieve this, a new discrete latent variable 𝒄 is introduced:

c in 1..K

This allows the model to learn a posterior distribution over clusters:

q(c|x)

Each component of the mixture can then be interpreted as a cluster.

In other words, GMVAEs intrinsically learn clusters during training.

The choice of K controls a trade‑off between expressivity and reliability.

If K is too small, clusters tend to merge distinct styles or even different letters, limiting the model’s ability to capture fine‑grained structure.
If K is too large, clusters become too fragmented, making it harder to estimate reliable label–cluster relationships from a limited labeled subset.

We choose K = 100 as a compromise: large enough to capture stylistic variations within each class, yet small enough to ensure that each cluster is sufficiently represented in the labeled data (Figure 1).

Figure 1 – Samples generated from several GMVAE components
Different stylistic variants of the same letter are captured, such as an uppercase F (c = 36) and a lowercase f (c = 0). Clusters are not pure: component c = 73 predominantly represents the letter “T”, but also includes samples of “J”.

Turning Clusters Into a Classifier

Once the GMVAE is trained, each image is associated with a posterior distribution over clusters: 𝒒(𝒄|𝒙).

In practice, when the number of clusters is unknown, it can be treated as a hyperparameter and tuned via grid search.

A natural idea is to assign each data point to a single cluster. However, clusters themselves do not yet have semantic meaning. To connect clusters to labels, we need a labeled subset.

A classic baseline is the cluster‑then‑label approach: data are first clustered using an unsupervised method (e.g., k‑means or GMM), and each cluster is assigned a label based on the labeled subset, typically via majority voting. This corresponds to a hard‑assignment strategy.

In contrast, our approach does not rely on a single cluster assignment. Instead, it leverages the full posterior distribution over clusters, allowing each data point to be represented as a mixture of clusters rather than a single discrete assignment. This can be seen as a probabilistic generalization of the cluster‑then‑label paradigm.

How many labels are theoretically required?

In an ideal scenario, clusters are perfectly pure and of equal size. If we could choose which data points to label, a single labeled example per cluster would be sufficient—i.e., only K labels in total.

With N = 145 600 and K = 100, this corresponds to 0.07 % of labeled data.

In practice, we assume that labeled samples are drawn at random. Under this assumption and equal cluster sizes, an approximate lower bound can be derived to cover all K clusters with a chosen confidence level. For K = 100, about 0.6 % labeled data are needed to achieve 95 % confidence.

Relaxing the equal‑size assumption yields a more general inequality, but it does not admit a closed‑form solution. All these calculations are optimistic: real clusters are not perfectly pure (e.g., a cluster may contain both “i” and “l” in comparable proportions).

Assigning labels to the remaining data

We compare two strategies:

Hard decoding – ignore the probability distributions provided by the model.
Soft decoding – fully exploit the posterior distributions.

Hard decoding

Cluster‑to‑label mapping: For each cluster 𝒄, assign the most frequent label among the labeled points belonging to that cluster, yielding a function ℓ(𝒄).
Label prediction: For an unlabeled image 𝒙, find its most likely cluster

and assign the label ℓ(𝒄ₕₐᵣ𝒹(𝒙))

.

Limitations

Ignores model uncertainty (the GMVAE may “hesitate” between several clusters).
Assumes clusters are pure, which is generally false.

Soft decoding

Instead of a single label per cluster, we estimate, for each label ℓ, a probability vector of size K:

m(ℓ) vector

This vector empirically represents p(𝒄|ℓ).

For each image 𝒙, the GMVAE provides a posterior probability vector:

q(x) vector

We assign to 𝒙 the label ℓ that maximizes the similarity between m(ℓ) and q(𝒙):

soft rule

This formulation accounts for both uncertainty in cluster assignment and impurity of clusters.

Interpretation: compare q(𝒄|𝒙) with p(𝒄|ℓ) and select the label whose cluster distribution best matches the posterior of 𝒙.

Concrete example where soft decoding helps

Figure 2 illustrates a case where soft decoding outperforms hard decoding.

Figure 2 – Interest of soft decoding
The true label is e. The model’s posterior over clusters (center) assigns high probability to clusters 76, 4, 0, 35, 81, 61.

The hard rule selects the most probable cluster (76), which is mostly associated with label c, leading to an incorrect prediction.

Soft decoding aggregates information from all plausible clusters, effectively performing a weighted vote. In this example, the weighted scores for e exceed those for c, resulting in the correct prediction.

This demonstrates that hard decoding discards most of the information contained in the posterior distribution q(𝒄|𝒙), whereas soft decoding leverages the full uncertainty of the generative model.

How Much Supervision Do We Need in Practice?

Theory aside, we evaluate the approach on real data with the following goals:

Determine how many labeled samples are needed to achieve good accuracy.
Identify when soft decoding provides a benefit.

We progressively increase the number of labeled samples and evaluate accuracy on the remaining data, comparing against standard baselines: logistic regression, MLP, and XGBoost. Results are reported as mean accuracy with 95 % confidence intervals over 5 random seeds (Figure 3).

$Figure 3 – Accuracy vs. labeled data fraction$

Even with extremely small labeled subsets, the classifier already performs surprisingly well.

With only 73 labeled samples (several clusters not represented), soft decoding achieves an absolute accuracy gain of ~18 percentage points over hard decoding.
With 0.2 % labeled data (291 samples, roughly 3 labeled examples per cluster), the GMVAE‑based classifier reaches 80 % accuracy.
In contrast, XGBoost requires around 7 % labeled data (≈35× more supervision) to achieve comparable performance.

These results highlight a key point: most of the structure required for classification is already learned during the unsupervised phase—labels are only needed to interpret it.

Conclusion

Using a GMVAE trained without labels, we can build a classifier with as little as 0.2 % labeled data.

The unsupervised model learns a large part of the structure required for classification.
Labels are used only to interpret clusters that the model has already discovered.
A simple hard decoding rule performs well, but leveraging the full posterior distribution provides a consistent improvement, especially when supervision is scarce.

More broadly, this experiment suggests a promising paradigm for label‑efficient machine learning:

Learn structure first (unsupervised).
Add labels later to interpret the learned representations.

In many cases, labels are not needed to learn—only to name what has already been learned.

All experiments were conducted using our own implementation of GMVAE and evaluation pipeline.

References

Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: Extending MNIST to handwritten letters.
Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., & Shanahan, M. (2016). Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders.

© 2026 MUREX S.A.S. and Université Paris Dauphine — PSL
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/.