[Paper] From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

Published: 1 day ago (April 23, 2026 at 10:22 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21716v1

Overview

The paper “From If‑Statements to ML Pipelines: Revisiting Bias in Code‑Generation” shows that the common practice of measuring bias in AI‑generated code with tiny conditional snippets dramatically under‑estimates the problem. By probing large language models (LLMs) that generate full machine‑learning pipelines, the authors reveal that bias seeps in during feature selection far more often than previously thought—raising red flags for any real‑world deployment that relies on generated code.

Key Contributions

Real‑world bias benchmark: Introduces a novel evaluation suite that asks LLMs to synthesize end‑to‑end ML pipelines (data preprocessing, feature selection, model training) instead of isolated if statements.
Empirical bias gap: Demonstrates that sensitive attributes (e.g., race, gender) appear in 87.7 % of generated pipelines, versus 59.2 % in the traditional conditional‑statement benchmark.
Cross‑model analysis: Tests both code‑specialized models (e.g., Code‑Llama, StarCoder) and general‑purpose instruction‑tuned models (e.g., GPT‑4, Claude), finding the bias gap persists across architectures.
Robustness checks: Shows the bias discrepancy holds under various prompt‑level mitigations, different numbers of protected attributes, and pipelines of varying difficulty (simple linear models to complex ensembles).
Critical insight: Argues that simple conditional statements are insufficient proxies for bias evaluation, urging the community to adopt richer, task‑centric benchmarks.

Methodology

Task definition: The authors design a set of realistic ML‑pipeline generation prompts (e.g., “Create a credit‑scoring model using the provided dataset”). Each prompt includes a list of potential features, some of which are protected (race, gender, etc.) and some non‑protected (favorite color, zip code).
Model selection: Six LLMs are evaluated—three code‑oriented (Code‑Llama 13B, StarCoder 15B, Codex) and three general‑purpose instruction‑tuned (GPT‑4, Claude 2, LLaMA‑2‑Chat).
Prompt variations: For each model, the authors experiment with (a) plain prompts, (b) prompts that explicitly ask the model to avoid bias, and (c) prompts that provide “bias‑mitigation” examples.
Bias detection: After generation, the pipeline code is parsed to extract the feature‑selection step. The presence of any protected attribute in the selected feature set is counted as a bias instance.
Baseline comparison: The same models are also asked to generate simple if‑statement snippets that encode a decision rule (e.g., “if age > 18 then approve”). The frequency of protected attributes in these snippets serves as the traditional benchmark.
Statistical analysis: Results are aggregated across 500 generated pipelines per model, and significance is assessed with chi‑square tests.

Results & Findings

Model Category	Sensitive Feature Appears in Pipelines	Sensitive Feature Appears in If‑Statements
Code‑specialized	88.3 %	60.1 %
General‑purpose	87.1 %	58.3 %

Bias persists despite mitigation prompts: Even when explicitly instructed to avoid protected attributes, the inclusion rate drops by only ~3 %, staying well above the conditional baseline.
Feature‑selection logic is the hotspot: Models correctly omit irrelevant protected attributes (e.g., dropping “race” when “favorite color” is more predictive) but still tend to add at least one protected attribute, indicating a systematic bias toward over‑reliance on demographic data.
Difficulty scaling: More complex pipelines (e.g., multi‑stage preprocessing + ensemble models) show a slightly higher bias rate (≈90 %) than simpler linear‑regression pipelines (≈85 %).
Robustness: Varying the number of protected attributes from 2 to 6 does not materially change the bias gap, confirming the effect is not an artifact of a particular attribute set.

Practical Implications

Tooling risk: Developers who rely on LLMs to auto‑generate data‑science code (e.g., “Copilot for ML”) may unintentionally embed discriminatory logic into production systems, even if they run a quick bias‑check on simple conditionals.
Compliance challenges: Regulations such as the EU AI Act or US Fair Credit Reporting Act require demonstrable mitigation of disparate impact. The hidden bias in feature selection could make compliance audits far more difficult.
Need for richer evaluation pipelines: Companies should incorporate end‑to‑end bias testing (including feature‑selection audits) into their CI/CD pipelines for AI‑generated code, rather than relying on token‑level or snippet‑level checks.
Prompt engineering limits: Simple “avoid protected attributes” instructions are insufficient; more sophisticated guardrails (e.g., constrained decoding, external feature‑audit modules) are required.
Opportunity for new products: The findings open a market for bias‑monitoring SDKs that automatically parse generated pipelines, flag protected features, and suggest replacements.

Limitations & Future Work

Scope of tasks: The study focuses on tabular ML pipelines for binary classification; extending to NLP pipelines, reinforcement‑learning agents, or time‑series models may reveal different bias dynamics.
Dataset bias: The synthetic datasets used may not capture the full complexity of real‑world feature correlations, potentially inflating or deflating bias rates.
Mitigation techniques: Only prompt‑based mitigations were explored; future work should evaluate model‑level interventions (e.g., fine‑tuning on debiased code, reinforcement learning from human feedback).
User interaction: The experiments assume a single‑shot generation; interactive coding assistants that refine code over multiple turns could exhibit different bias patterns.

Bottom line: If you’re building or using AI‑powered code generators, it’s time to look beyond tiny if‑statement tests and audit the full pipelines they produce. The hidden bias uncovered here could have real‑world consequences for fairness, compliance, and trust in AI‑driven software.

Authors

Minh Duc Bui
Xenia Heilmann
Mattia Cerrato
Manuel Mager
Katharina von der Wense

Paper Information

arXiv ID: 2604.21716v1
Categories: cs.CL, cs.SE
Published: April 23, 2026
PDF: Download PDF

[Paper] From If-Statements to ML Pipelines: Revisiting Bias in Code-Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation