[Paper] Once Upon a Team: Investigating Bias in LLM-Driven Software Team Composition and Task Allocation
Source: arXiv - 2601.03857v1
Overview
Large language models (LLMs) are being tapped to automate many software‑engineering chores—from code generation to project planning. This paper asks a tougher question: what happens when LLMs are asked to decide who should join a software team and which tasks they should get? By simulating thousands of such decisions, the authors uncover systematic demographic biases that could reinforce existing inequities in the industry.
Key Contributions
- Empirical bias audit of three popular LLMs (e.g., GPT‑4, Claude, LLaMA) on team composition and task allocation decisions.
- Intersectional analysis that jointly considers a candidate’s country of origin and pronoun‑based gender cues, moving beyond single‑attribute studies.
- Large‑scale simulation of 3,000 decision scenarios that control for expertise (skill level, experience) to isolate demographic effects.
- Evidence of stereotype‑driven task distribution, showing technical vs. leadership roles are allocated unevenly across demographic groups.
- Call for fairness‑aware pipelines in LLM‑driven software‑engineering tools, with concrete recommendations for developers and product teams.
Methodology
- Scenario generation – The researchers created synthetic candidate profiles varying along two sensitive axes: (a) country (e.g., USA, India, Brazil) and (b) pronoun (he/she/they). Each profile also included realistic expertise attributes (years of experience, known technologies).
- Prompt design – For each profile, a prompt asked the LLM to (i) decide whether the candidate should be selected for a team and (ii) assign a specific task (e.g., “backend API development”, “project coordination”). The prompts mirrored how a project manager might interact with an AI assistant.
- Model selection – Three state‑of‑the‑art LLMs were queried under identical conditions to compare behavior.
- Statistical analysis – Logistic regression and chi‑square tests measured the impact of country and pronoun on selection odds and task categories, while controlling for expertise variables.
- Intersectional focus – The analysis examined not just the main effects of each attribute but also their interaction (e.g., “female candidate from Brazil”).
Results & Findings
- Selection bias – Candidates from certain countries (e.g., Western Europe, North America) were 12‑18% more likely to be selected than equally qualified peers from other regions, even after accounting for skill level.
- Gender‑pronoun effect – Female‑identified pronouns reduced selection probability by ~7% on average; non‑binary pronouns saw the steepest drop (~10%).
- Intersectional disparity – The combination of “female + non‑Western country” produced the largest penalty (≈ 20% lower selection odds).
- Task allocation stereotypes – Technical tasks (e.g., algorithm design) were disproportionately assigned to male‑identified candidates, while coordination or “soft‑skill” tasks (e.g., stakeholder communication) went more often to female‑identified candidates.
- Consistency across models – All three LLMs displayed similar bias patterns, suggesting the issue stems from shared training data rather than model‑specific quirks.
Practical Implications
- Tool developers should embed bias‑detection checkpoints when LLMs are used for HR‑related recommendations (e.g., auto‑suggested team rosters).
- Project managers need to treat AI suggestions as advice rather than authoritative decisions, especially for staffing and role assignment.
- CI/CD pipelines that automatically generate task boards from LLM output must incorporate fairness audits to avoid propagating inequities at scale.
- Open‑source communities can contribute bias‑test suites (similar to this paper’s simulation framework) to evaluate new LLM releases before integration.
- Legal & compliance teams should be aware that reliance on biased LLM outputs could expose organizations to discrimination claims under labor laws.
Limitations & Future Work
- The study uses synthetic profiles, which, while controlled, may not capture the full nuance of real‑world résumés and interpersonal dynamics.
- Only three LLMs were examined; newer or domain‑fine‑tuned models could behave differently.
- The bias analysis focuses on country and gender pronouns; other protected attributes (e.g., disability, age) remain unexplored.
- Future research could integrate human‑in‑the‑loop evaluations, test mitigation strategies (e.g., prompt engineering, post‑processing filters), and expand to live deployment settings where feedback loops might amplify or dampen bias.
Authors
- Alessandra Parziale
- Gianmario Voria
- Valeria Pontillo
- Amleto Di Salle
- Patrizio Pelliccione
- Gemma Catolino
- Fabio Palomba
Paper Information
- arXiv ID: 2601.03857v1
- Categories: cs.SE
- Published: January 7, 2026
- PDF: Download PDF