[Paper] A Theoretical and Empirical Taxonomy of Imbalance in Binary Classification
Source: arXiv - 2601.04149v1
Overview
The paper presents a unified, theory‑driven way to understand why binary classifiers stumble when the classes are imbalanced. By boiling the problem down to three intuitive quantities—how skewed the class frequencies are, how many features you have relative to samples, and how naturally separable the data are—the authors derive concrete “regimes” that predict exactly how performance metrics will degrade.
Key Contributions
- Triplet taxonomy – Introduces a three‑dimensional framework ((\eta, \kappa, \Delta)) that captures class‑frequency imbalance ((\eta)), sample‑to‑dimension ratio ((\kappa)), and intrinsic separability ((\Delta)).
- Closed‑form Bayes error analysis – Starting from the Gaussian Bayes classifier, derives analytic expressions for the optimal error and shows how the decision boundary shifts with imbalance.
- Four deterioration regimes – Defines Normal, Mild, Extreme, and Catastrophic regimes based on the relationship (\log(\eta) \gtrless \Delta\sqrt{\kappa}).
- Empirical validation on high‑dimensional genomics data – Keeps (\kappa) and (\Delta) fixed while sweeping (\eta); observes that recall, precision, F1, and PR‑AUC follow the theoretical predictions across linear, tree‑based, and kernel models.
- Model‑agnostic insight – Shows that the taxonomy holds regardless of whether the classifier is parametric (e.g., logistic regression) or non‑parametric (e.g., random forest).
Methodology
- Theoretical backbone – Assume data are generated from two multivariate Gaussian distributions with equal covariance. Under this setting, the Bayes optimal classifier has a known linear discriminant. By inserting an imbalance coefficient (\eta = \frac{n_{\text{minor}}}{n_{\text{major}}}) into the class priors, the authors obtain a shifted decision hyperplane and a closed‑form Bayes error that depends on (\eta), the dimensionality‑to‑sample ratio (\kappa = \frac{p}{n}), and the Mahalanobis distance (\Delta) between class means.
- Regime derivation – Analyzing the error expression yields a critical threshold (\log(\eta) = \Delta\sqrt{\kappa}). Below the threshold the classifier behaves “normally”; crossing it leads to progressively harsher performance drops, culminating in a “catastrophic” regime where the minority class is essentially invisible.
- Experimental setup – A publicly available high‑dimensional genomic dataset (≈10 k features, few hundred samples) is balanced, then artificially re‑imbalanced by subsampling the minority class to achieve target (\eta) values while keeping (\kappa) and (\Delta) constant. Multiple learners (logistic regression, SVM, random forest, k‑NN) are trained on each version.
- Metrics tracked – Recall (sensitivity) for the minority class, precision, F1‑score, and area under the precision‑recall curve (PR‑AUC) are reported as functions of (\eta).
Results & Findings
| Regime | Condition (in terms of (\log\eta) vs. (\Delta\sqrt{\kappa})) | Observed behavior |
|---|---|---|
| Normal | (\log\eta < \Delta\sqrt{\kappa} - 1) | Minority recall stays high (> 0.9); precision and F1 are stable. |
| Mild | (\Delta\sqrt{\kappa} - 1 \le \log\eta < \Delta\sqrt{\kappa}) | Recall begins a gentle decline; precision rises slightly because false positives drop. |
| Extreme | (\Delta\sqrt{\kappa} \le \log\eta < \Delta\sqrt{\kappa} + 1) | Recall collapses sharply (often < 0.2); precision becomes erratic; F1 and PR‑AUC drop > 30 %. |
| Catastrophic | (\log\eta \ge \Delta\sqrt{\kappa} + 1) | Minority class is effectively ignored; recall ≈ 0, precision ≈ 1 (only true negatives remain). |
Across all models, the minority recall curve aligns almost perfectly with the theoretical prediction
[ \text{Recall} \approx \Phi\bigl(\Delta\sqrt{\kappa} - \log\eta\bigr) ]
(where (\Phi) is the Gaussian CDF). Precision shows an asymmetric rise because the denominator (predicted positives) shrinks faster than false positives. The composite metrics (F1, PR‑AUC) mirror the transition points, confirming that the taxonomy is model‑agnostic.
Practical Implications
- Metric‑driven monitoring – Developers can compute (\eta), (\kappa), and an estimate of (\Delta) (e.g., via a quick linear discriminant analysis) to anticipate which deterioration regime their pipeline is entering, allowing proactive mitigation before performance collapses.
- Guidance for data collection – The framework quantifies the trade‑off between acquiring more features (increasing (\kappa)) and maintaining separability. In high‑dimensional domains (genomics, text embeddings), simply adding features without increasing sample size can push you into the Extreme regime even with modest imbalance.
- Algorithm selection – Since the regime effect is model‑agnostic, the taxonomy suggests that “fancy” imbalance‑aware algorithms (cost‑sensitive loss, SMOTE) will only help if you stay in the Normal or Mild regimes; once you cross into Extreme, you need data‑level interventions (collect more minority samples, reduce dimensionality).
- Automated alerts – Production ML monitoring tools can embed the (\log(\eta) > \Delta\sqrt{\kappa}) check as a health‑check rule, triggering alerts or automated re‑balancing pipelines.
- Explainability for stakeholders – The geometric interpretation (boundary shift) offers a simple visual story for product managers: “Your model is still optimal, but the decision line has moved because the minority class is under‑represented.”
Limitations & Future Work
- Gaussian assumption – The closed‑form derivations rely on equal‑covariance Gaussian class distributions; real‑world data often violate this, potentially shifting the regime boundaries.
- Estimating (\Delta) in practice – Computing the true Mahalanobis distance requires knowledge of class means and covariances, which may be noisy in small‑sample settings. Approximation strategies need validation.
- Only binary classification – Extending the taxonomy to multi‑class or multi‑label scenarios is non‑trivial and left for future research.
- Dynamic data streams – The current analysis is static; handling evolving class ratios (concept drift) would require a time‑varying version of the framework.
Authors
- Rose Yvette Bandolo Essomba
- Ernest Fokoué
Paper Information
- arXiv ID: 2601.04149v1
- Categories: stat.ML, cs.LG
- Published: January 7, 2026
- PDF: Download PDF