[Paper] Evolved SampleWeights for Bias Mitigation: Effectiveness Depends on Optimization Objectives
Source: arXiv - 2511.20909v1
Overview
This paper investigates how to automatically assign sample‑level weights during model training to reduce algorithmic bias. By evolving the weights with a genetic algorithm (GA) and comparing them to simple heuristic and uniform weighting schemes, the authors show that carefully‑tuned weights can give a better balance between predictive accuracy and fairness—provided the right optimization objectives are chosen.
Key Contributions
- Three weighting strategies compared: (1) GA‑evolved weights, (2) analytically derived weights based only on dataset statistics, and (3) uniform (equal) weights.
- Multi‑objective GA design: The GA simultaneously optimizes two predictive metrics (accuracy, AUC‑ROC) and two fairness metrics (demographic parity difference, subgroup false‑negative rate).
- Extensive empirical evaluation: Experiments run on 11 public datasets (including two medical datasets) to assess trade‑offs across a wide range of domains.
- Insight on objective selection: The benefit of evolved weights hinges on which pair of metrics the GA is asked to optimize; the accuracy + demographic parity combination yields the most consistent improvements.
- Statistical validation: Significance testing demonstrates that evolved weights outperform the other two strategies on a majority of datasets for the chosen objectives.
Methodology
-
Data & Baselines – For each dataset, the authors train a standard classifier (e.g., logistic regression or a shallow neural net) under three weighting regimes:
- Uniform: every sample weight = 1.
- Heuristic: weights derived from class‑imbalance and protected‑group proportions (no learning).
- GA‑evolved: a population of weight vectors is evolved over generations.
-
Genetic Algorithm –
- Encoding: Each individual encodes a weight for every training instance.
- Fitness: A multi‑objective fitness function combines two predictive scores (accuracy or AUC) and two fairness scores (demographic parity difference, subgroup false‑negative disparity).
- Selection & Variation: Standard tournament selection, crossover, and mutation operators are used. The GA runs until convergence or a fixed generation limit.
-
Evaluation – After training with the selected weights, the model is evaluated on a held‑out test set. Paired predictive and fairness metrics are recorded, and the authors use Wilcoxon signed‑rank tests to assess whether GA‑evolved weights give statistically significant gains over the baselines.
Results & Findings
| Optimization Pair | Datasets where GA‑evolved beats baselines (significant) |
|---|---|
| Accuracy + Demographic Parity | 8 / 11 |
| Accuracy + Subgroup FNR | 5 / 11 |
| AUC + Demographic Parity | 6 / 11 |
| AUC + Subgroup FNR | 4 / 11 |
- Trade‑off quality: GA‑evolved weights consistently locate points on the Pareto front that are closer to the ideal corner (high accuracy, low parity gap) than the other two methods.
- Magnitude of gain: On average, accuracy improves by ~1.5 % and demographic parity gap shrinks by ~3 % relative to uniform weighting.
- Dataset sensitivity: Gains are larger on datasets with pronounced class imbalance or where the protected attribute is highly correlated with the target.
- Objective dependence: When the GA is asked to optimize AUC together with subgroup false‑negative fairness, the improvements are modest, suggesting that the choice of predictive metric matters.
Practical Implications
- Plug‑and‑play fairness layer: Developers can wrap a GA‑based weight optimizer around any off‑the‑shelf classifier to obtain a model that respects a chosen fairness‑accuracy trade‑off without redesigning the learning algorithm.
- Customizable objectives: By swapping in different fairness or performance metrics, teams can align the optimizer with product‑specific SLAs (e.g., minimizing false negatives for a medical screening tool while keeping demographic parity).
- Reduced engineering overhead: Compared to adversarial debiasing or post‑processing methods, weight evolution works directly on the training data, meaning existing pipelines (feature engineering, hyper‑parameter tuning) stay intact.
- Scalable to medium‑size data: The GA operates on per‑sample weights, so memory scales linearly with the training set. For tens of thousands of records (common in many SaaS or health‑tech use‑cases) the approach runs in a few minutes on a single CPU core.
- Potential for AutoML integration: The weight‑evolution step can be treated as another hyper‑parameter search dimension, enabling automated fairness‑aware model selection in CI/CD pipelines.
Limitations & Future Work
- Scalability to massive datasets: The per‑sample encoding makes the GA expensive for millions of rows; future work could explore surrogate models or clustering‑based weight sharing.
- Metric selection bias: The study only examined four metrics; real‑world deployments may require other fairness notions (e.g., equalized odds) or domain‑specific utility functions.
- Static weighting: Weights are fixed after training; dynamic or instance‑wise weighting at inference time (e.g., based on context) remains unexplored.
- Robustness to noisy protected attributes: The approach assumes accurate group labels; handling label uncertainty or multi‑protected‑attribute scenarios is an open challenge.
Bottom line: Evolving sample weights with a multi‑objective genetic algorithm offers a practical, model‑agnostic path to better fairness‑performance trade‑offs, especially when developers can clearly define the objectives they care about. As tooling matures, this technique could become a standard component of responsible AI pipelines.