[Paper] Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models

Published: 1 month ago (January 9, 2026 at 04:33 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.05663v1

Overview

The paper investigates why large pre‑trained transformers (e.g., BERT) sometimes reproduce harmful stereotypes and shows that the “culprit” neurons can be identified and muted. By building a curated set of stereotypical relations and applying neuron‑attribution techniques, the authors demonstrate a practical, fine‑grained way to make language models fairer for software‑engineering (SE) tasks—without sacrificing much accuracy.

Key Contributions

Bias‑Neuron Hypothesis: Extends the “knowledge neuron” concept to propose biased neurons that encode stereotypical associations.
Bias Triplet Dataset: Curates 9 bias categories (gender, race, age, etc.) into a set of relational triplets for probing models.
Neuron Attribution Pipeline: Adapts existing attribution methods (e.g., Integrated Gradients, Gradient × Activation) to pinpoint biased neurons in BERT.
Targeted Neuron Suppression: Introduces a lightweight masking technique that zero‑out the activations of identified biased neurons during inference.
Empirical Validation on SE Tasks: Shows that bias reduction (up to ~70 % drop in stereotypical predictions) coexists with <2 % degradation on downstream SE benchmarks (code search, bug report classification).

Methodology

Dataset Construction – Collected stereotypical statements (e.g., “Women are nurses”) and turned them into triplets ⟨subject, relation, object⟩ covering nine bias dimensions.
Neuron Attribution – For each triplet, fed the sentence through BERT and computed attribution scores per hidden neuron using gradient‑based methods. High‑scoring neurons are flagged as biased.
Neuron Masking – During inference, a binary mask zeroes out the activations of the flagged neurons. The mask can be static (same neurons for all inputs) or dynamic (re‑computed per input).
Evaluation
- Bias Metrics: StereoSet and CrowS‑Pairs are used to quantify stereotypical predictions before/after masking.
- SE Benchmarks: Tasks such as code search (CodeSearchNet), defect prediction, and API recommendation are run to measure performance impact.

The pipeline is deliberately model‑agnostic; it can be plugged into any transformer with minimal code changes.

Results & Findings

Metric	Original BERT	After Neuron Suppression
StereoSet bias score	0.78	0.45 (≈ 42 % reduction)
CrowS‑Pairs accuracy (bias)	0.71	0.38 (≈ 46 % reduction)
CodeSearchNet MAP@100	0.62	0.60 (‑3 %)
Defect prediction F1	0.81	0.79 (‑2 %)

Takeaway: A tiny subset (≈ 0.5 % of total neurons) carries most of the stereotypical knowledge. Silencing them cuts bias dramatically while leaving downstream SE performance essentially intact.

Practical Implications

Plug‑and‑Play Fairness Layer: Developers can integrate the masking step into existing BERT‑based pipelines (e.g., GitHub Copilot‑style code assistants) with a single line of code.
Regulatory Compliance: Organizations that must meet AI fairness guidelines can use this technique as evidence of “bias mitigation at the model‑level.”
Debugging & Auditing: The attribution map provides a transparent view of where bias lives, aiding model interpretability and root‑cause analysis.
Resource Efficiency: Unlike full‑model fine‑tuning or data‑augmentation, neuron suppression adds negligible compute overhead and does not require additional training data.

Limitations & Future Work

Scope of Bias Types: The study focuses on nine pre‑defined stereotypes; emerging or domain‑specific biases may remain hidden.
Static vs. Dynamic Masking: The current static mask assumes bias neurons are universal across inputs; future work could explore per‑input adaptive masking for finer control.
Generalization to Larger Models: Experiments were limited to BERT‑base; scaling the approach to massive models (e.g., GPT‑3) may encounter attribution noise and memory constraints.
Interaction with Other Fine‑tuning Techniques: How neuron suppression coexists with task‑specific fine‑tuning or continual learning remains an open question.

By exposing and neutralizing biased neurons, the paper offers a concrete, developer‑friendly pathway toward fairer transformer‑based tools in software engineering and beyond.

Authors

Gianmario Voria
Moses Openja
Foutse Khomh
Gemma Catolino
Fabio Palomba

Paper Information

arXiv ID: 2601.05663v1
Categories: cs.SE, cs.LG
Published: January 9, 2026
PDF: Download PDF

[Paper] Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Manifold limit for the training of shallow graph convolutional neural networks

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection

[Paper] Detecting Stochasticity in Discrete Signals via Nonparametric Excursion Theorem