[Paper] Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models

Published: (January 9, 2026 at 04:33 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.05663v1

Overview

The paper investigates why large pre‑trained transformers (e.g., BERT) sometimes reproduce harmful stereotypes and shows that the “culprit” neurons can be identified and muted. By building a curated set of stereotypical relations and applying neuron‑attribution techniques, the authors demonstrate a practical, fine‑grained way to make language models fairer for software‑engineering (SE) tasks—without sacrificing much accuracy.

Key Contributions

  • Bias‑Neuron Hypothesis: Extends the “knowledge neuron” concept to propose biased neurons that encode stereotypical associations.
  • Bias Triplet Dataset: Curates 9 bias categories (gender, race, age, etc.) into a set of relational triplets for probing models.
  • Neuron Attribution Pipeline: Adapts existing attribution methods (e.g., Integrated Gradients, Gradient × Activation) to pinpoint biased neurons in BERT.
  • Targeted Neuron Suppression: Introduces a lightweight masking technique that zero‑out the activations of identified biased neurons during inference.
  • Empirical Validation on SE Tasks: Shows that bias reduction (up to ~70 % drop in stereotypical predictions) coexists with <2 % degradation on downstream SE benchmarks (code search, bug report classification).

Methodology

  1. Dataset Construction – Collected stereotypical statements (e.g., “Women are nurses”) and turned them into triplets ⟨subject, relation, object⟩ covering nine bias dimensions.
  2. Neuron Attribution – For each triplet, fed the sentence through BERT and computed attribution scores per hidden neuron using gradient‑based methods. High‑scoring neurons are flagged as biased.
  3. Neuron Masking – During inference, a binary mask zeroes out the activations of the flagged neurons. The mask can be static (same neurons for all inputs) or dynamic (re‑computed per input).
  4. Evaluation
    • Bias Metrics: StereoSet and CrowS‑Pairs are used to quantify stereotypical predictions before/after masking.
    • SE Benchmarks: Tasks such as code search (CodeSearchNet), defect prediction, and API recommendation are run to measure performance impact.

The pipeline is deliberately model‑agnostic; it can be plugged into any transformer with minimal code changes.

Results & Findings

MetricOriginal BERTAfter Neuron Suppression
StereoSet bias score0.780.45 (≈ 42 % reduction)
CrowS‑Pairs accuracy (bias)0.710.38 (≈ 46 % reduction)
CodeSearchNet MAP@1000.620.60 (‑3 %)
Defect prediction F10.810.79 (‑2 %)

Takeaway: A tiny subset (≈ 0.5 % of total neurons) carries most of the stereotypical knowledge. Silencing them cuts bias dramatically while leaving downstream SE performance essentially intact.

Practical Implications

  • Plug‑and‑Play Fairness Layer: Developers can integrate the masking step into existing BERT‑based pipelines (e.g., GitHub Copilot‑style code assistants) with a single line of code.
  • Regulatory Compliance: Organizations that must meet AI fairness guidelines can use this technique as evidence of “bias mitigation at the model‑level.”
  • Debugging & Auditing: The attribution map provides a transparent view of where bias lives, aiding model interpretability and root‑cause analysis.
  • Resource Efficiency: Unlike full‑model fine‑tuning or data‑augmentation, neuron suppression adds negligible compute overhead and does not require additional training data.

Limitations & Future Work

  • Scope of Bias Types: The study focuses on nine pre‑defined stereotypes; emerging or domain‑specific biases may remain hidden.
  • Static vs. Dynamic Masking: The current static mask assumes bias neurons are universal across inputs; future work could explore per‑input adaptive masking for finer control.
  • Generalization to Larger Models: Experiments were limited to BERT‑base; scaling the approach to massive models (e.g., GPT‑3) may encounter attribution noise and memory constraints.
  • Interaction with Other Fine‑tuning Techniques: How neuron suppression coexists with task‑specific fine‑tuning or continual learning remains an open question.

By exposing and neutralizing biased neurons, the paper offers a concrete, developer‑friendly pathway toward fairer transformer‑based tools in software engineering and beyond.

Authors

  • Gianmario Voria
  • Moses Openja
  • Foutse Khomh
  • Gemma Catolino
  • Fabio Palomba

Paper Information

  • arXiv ID: 2601.05663v1
  • Categories: cs.SE, cs.LG
  • Published: January 9, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »