[Paper] Once Upon a Team: Investigating Bias in LLM-Driven Software Team Composition and Task Allocation

Published: 1 day ago (January 7, 2026 at 07:13 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.03857v1

Overview

Large language models (LLMs) are being tapped to automate many software‑engineering chores—from code generation to project planning. This paper asks a tougher question: what happens when LLMs are asked to decide who should join a software team and which tasks they should get? By simulating thousands of such decisions, the authors uncover systematic demographic biases that could reinforce existing inequities in the industry.

Key Contributions

Empirical bias audit of three popular LLMs (e.g., GPT‑4, Claude, LLaMA) on team composition and task allocation decisions.
Intersectional analysis that jointly considers a candidate’s country of origin and pronoun‑based gender cues, moving beyond single‑attribute studies.
Large‑scale simulation of 3,000 decision scenarios that control for expertise (skill level, experience) to isolate demographic effects.
Evidence of stereotype‑driven task distribution, showing technical vs. leadership roles are allocated unevenly across demographic groups.
Call for fairness‑aware pipelines in LLM‑driven software‑engineering tools, with concrete recommendations for developers and product teams.

Methodology

Scenario generation – The researchers created synthetic candidate profiles varying along two sensitive axes: (a) country (e.g., USA, India, Brazil) and (b) pronoun (he/she/they). Each profile also included realistic expertise attributes (years of experience, known technologies).
Prompt design – For each profile, a prompt asked the LLM to (i) decide whether the candidate should be selected for a team and (ii) assign a specific task (e.g., “backend API development”, “project coordination”). The prompts mirrored how a project manager might interact with an AI assistant.
Model selection – Three state‑of‑the‑art LLMs were queried under identical conditions to compare behavior.
Statistical analysis – Logistic regression and chi‑square tests measured the impact of country and pronoun on selection odds and task categories, while controlling for expertise variables.
Intersectional focus – The analysis examined not just the main effects of each attribute but also their interaction (e.g., “female candidate from Brazil”).

Results & Findings

Selection bias – Candidates from certain countries (e.g., Western Europe, North America) were 12‑18% more likely to be selected than equally qualified peers from other regions, even after accounting for skill level.
Gender‑pronoun effect – Female‑identified pronouns reduced selection probability by ~7% on average; non‑binary pronouns saw the steepest drop (~10%).
Intersectional disparity – The combination of “female + non‑Western country” produced the largest penalty (≈ 20% lower selection odds).
Task allocation stereotypes – Technical tasks (e.g., algorithm design) were disproportionately assigned to male‑identified candidates, while coordination or “soft‑skill” tasks (e.g., stakeholder communication) went more often to female‑identified candidates.
Consistency across models – All three LLMs displayed similar bias patterns, suggesting the issue stems from shared training data rather than model‑specific quirks.

Practical Implications

Tool developers should embed bias‑detection checkpoints when LLMs are used for HR‑related recommendations (e.g., auto‑suggested team rosters).
Project managers need to treat AI suggestions as advice rather than authoritative decisions, especially for staffing and role assignment.
CI/CD pipelines that automatically generate task boards from LLM output must incorporate fairness audits to avoid propagating inequities at scale.
Open‑source communities can contribute bias‑test suites (similar to this paper’s simulation framework) to evaluate new LLM releases before integration.
Legal & compliance teams should be aware that reliance on biased LLM outputs could expose organizations to discrimination claims under labor laws.

Limitations & Future Work

The study uses synthetic profiles, which, while controlled, may not capture the full nuance of real‑world résumés and interpersonal dynamics.
Only three LLMs were examined; newer or domain‑fine‑tuned models could behave differently.
The bias analysis focuses on country and gender pronouns; other protected attributes (e.g., disability, age) remain unexplored.
Future research could integrate human‑in‑the‑loop evaluations, test mitigation strategies (e.g., prompt engineering, post‑processing filters), and expand to live deployment settings where feedback loops might amplify or dampen bias.

Authors

Alessandra Parziale
Gianmario Voria
Valeria Pontillo
Amleto Di Salle
Patrizio Pelliccione
Gemma Catolino
Fabio Palomba

Paper Information

arXiv ID: 2601.03857v1
Categories: cs.SE
Published: January 7, 2026
PDF: Download PDF

[Paper] Once Upon a Team: Investigating Bias in LLM-Driven Software Team Composition and Task Allocation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Smells Depend on the Context: An Interview Study of Issue Tracking Problems and Smells in Practice

[Paper] An Ontology-Based Approach to Security Risk Identification of Container Deployments in OT Contexts

[Paper] Understanding Specification-Driven Code Generation with LLMs: An Empirical Study Design

[Paper] Assessing and Improving the Representativeness of Code Generation Benchmarks Using Knowledge Units (KUs) of Programming Languages -- An Empirical Study