[Paper] Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models
Source: arXiv - 2601.05663v1
Overview
The paper investigates why large pre‑trained transformers (e.g., BERT) sometimes reproduce harmful stereotypes and shows that the “culprit” neurons can be identified and muted. By building a curated set of stereotypical relations and applying neuron‑attribution techniques, the authors demonstrate a practical, fine‑grained way to make language models fairer for software‑engineering (SE) tasks—without sacrificing much accuracy.
Key Contributions
- Bias‑Neuron Hypothesis: Extends the “knowledge neuron” concept to propose biased neurons that encode stereotypical associations.
- Bias Triplet Dataset: Curates 9 bias categories (gender, race, age, etc.) into a set of relational triplets for probing models.
- Neuron Attribution Pipeline: Adapts existing attribution methods (e.g., Integrated Gradients, Gradient × Activation) to pinpoint biased neurons in BERT.
- Targeted Neuron Suppression: Introduces a lightweight masking technique that zero‑out the activations of identified biased neurons during inference.
- Empirical Validation on SE Tasks: Shows that bias reduction (up to ~70 % drop in stereotypical predictions) coexists with <2 % degradation on downstream SE benchmarks (code search, bug report classification).
Methodology
- Dataset Construction – Collected stereotypical statements (e.g., “Women are nurses”) and turned them into triplets ⟨subject, relation, object⟩ covering nine bias dimensions.
- Neuron Attribution – For each triplet, fed the sentence through BERT and computed attribution scores per hidden neuron using gradient‑based methods. High‑scoring neurons are flagged as biased.
- Neuron Masking – During inference, a binary mask zeroes out the activations of the flagged neurons. The mask can be static (same neurons for all inputs) or dynamic (re‑computed per input).
- Evaluation
- Bias Metrics: StereoSet and CrowS‑Pairs are used to quantify stereotypical predictions before/after masking.
- SE Benchmarks: Tasks such as code search (CodeSearchNet), defect prediction, and API recommendation are run to measure performance impact.
The pipeline is deliberately model‑agnostic; it can be plugged into any transformer with minimal code changes.
Results & Findings
| Metric | Original BERT | After Neuron Suppression |
|---|---|---|
| StereoSet bias score | 0.78 | 0.45 (≈ 42 % reduction) |
| CrowS‑Pairs accuracy (bias) | 0.71 | 0.38 (≈ 46 % reduction) |
| CodeSearchNet MAP@100 | 0.62 | 0.60 (‑3 %) |
| Defect prediction F1 | 0.81 | 0.79 (‑2 %) |
Takeaway: A tiny subset (≈ 0.5 % of total neurons) carries most of the stereotypical knowledge. Silencing them cuts bias dramatically while leaving downstream SE performance essentially intact.
Practical Implications
- Plug‑and‑Play Fairness Layer: Developers can integrate the masking step into existing BERT‑based pipelines (e.g., GitHub Copilot‑style code assistants) with a single line of code.
- Regulatory Compliance: Organizations that must meet AI fairness guidelines can use this technique as evidence of “bias mitigation at the model‑level.”
- Debugging & Auditing: The attribution map provides a transparent view of where bias lives, aiding model interpretability and root‑cause analysis.
- Resource Efficiency: Unlike full‑model fine‑tuning or data‑augmentation, neuron suppression adds negligible compute overhead and does not require additional training data.
Limitations & Future Work
- Scope of Bias Types: The study focuses on nine pre‑defined stereotypes; emerging or domain‑specific biases may remain hidden.
- Static vs. Dynamic Masking: The current static mask assumes bias neurons are universal across inputs; future work could explore per‑input adaptive masking for finer control.
- Generalization to Larger Models: Experiments were limited to BERT‑base; scaling the approach to massive models (e.g., GPT‑3) may encounter attribution noise and memory constraints.
- Interaction with Other Fine‑tuning Techniques: How neuron suppression coexists with task‑specific fine‑tuning or continual learning remains an open question.
By exposing and neutralizing biased neurons, the paper offers a concrete, developer‑friendly pathway toward fairer transformer‑based tools in software engineering and beyond.
Authors
- Gianmario Voria
- Moses Openja
- Foutse Khomh
- Gemma Catolino
- Fabio Palomba
Paper Information
- arXiv ID: 2601.05663v1
- Categories: cs.SE, cs.LG
- Published: January 9, 2026
- PDF: Download PDF