[Paper] Once Upon a Team: Investigating Bias in LLM-Driven Software Team Composition and Task Allocation

Published: (January 7, 2026 at 07:13 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.03857v1

Overview

Large language models (LLMs) are being tapped to automate many software‑engineering chores—from code generation to project planning. This paper asks a tougher question: what happens when LLMs are asked to decide who should join a software team and which tasks they should get? By simulating thousands of such decisions, the authors uncover systematic demographic biases that could reinforce existing inequities in the industry.

Key Contributions

  • Empirical bias audit of three popular LLMs (e.g., GPT‑4, Claude, LLaMA) on team composition and task allocation decisions.
  • Intersectional analysis that jointly considers a candidate’s country of origin and pronoun‑based gender cues, moving beyond single‑attribute studies.
  • Large‑scale simulation of 3,000 decision scenarios that control for expertise (skill level, experience) to isolate demographic effects.
  • Evidence of stereotype‑driven task distribution, showing technical vs. leadership roles are allocated unevenly across demographic groups.
  • Call for fairness‑aware pipelines in LLM‑driven software‑engineering tools, with concrete recommendations for developers and product teams.

Methodology

  1. Scenario generation – The researchers created synthetic candidate profiles varying along two sensitive axes: (a) country (e.g., USA, India, Brazil) and (b) pronoun (he/she/they). Each profile also included realistic expertise attributes (years of experience, known technologies).
  2. Prompt design – For each profile, a prompt asked the LLM to (i) decide whether the candidate should be selected for a team and (ii) assign a specific task (e.g., “backend API development”, “project coordination”). The prompts mirrored how a project manager might interact with an AI assistant.
  3. Model selection – Three state‑of‑the‑art LLMs were queried under identical conditions to compare behavior.
  4. Statistical analysis – Logistic regression and chi‑square tests measured the impact of country and pronoun on selection odds and task categories, while controlling for expertise variables.
  5. Intersectional focus – The analysis examined not just the main effects of each attribute but also their interaction (e.g., “female candidate from Brazil”).

Results & Findings

  • Selection bias – Candidates from certain countries (e.g., Western Europe, North America) were 12‑18% more likely to be selected than equally qualified peers from other regions, even after accounting for skill level.
  • Gender‑pronoun effect – Female‑identified pronouns reduced selection probability by ~7% on average; non‑binary pronouns saw the steepest drop (~10%).
  • Intersectional disparity – The combination of “female + non‑Western country” produced the largest penalty (≈ 20% lower selection odds).
  • Task allocation stereotypes – Technical tasks (e.g., algorithm design) were disproportionately assigned to male‑identified candidates, while coordination or “soft‑skill” tasks (e.g., stakeholder communication) went more often to female‑identified candidates.
  • Consistency across models – All three LLMs displayed similar bias patterns, suggesting the issue stems from shared training data rather than model‑specific quirks.

Practical Implications

  • Tool developers should embed bias‑detection checkpoints when LLMs are used for HR‑related recommendations (e.g., auto‑suggested team rosters).
  • Project managers need to treat AI suggestions as advice rather than authoritative decisions, especially for staffing and role assignment.
  • CI/CD pipelines that automatically generate task boards from LLM output must incorporate fairness audits to avoid propagating inequities at scale.
  • Open‑source communities can contribute bias‑test suites (similar to this paper’s simulation framework) to evaluate new LLM releases before integration.
  • Legal & compliance teams should be aware that reliance on biased LLM outputs could expose organizations to discrimination claims under labor laws.

Limitations & Future Work

  • The study uses synthetic profiles, which, while controlled, may not capture the full nuance of real‑world résumés and interpersonal dynamics.
  • Only three LLMs were examined; newer or domain‑fine‑tuned models could behave differently.
  • The bias analysis focuses on country and gender pronouns; other protected attributes (e.g., disability, age) remain unexplored.
  • Future research could integrate human‑in‑the‑loop evaluations, test mitigation strategies (e.g., prompt engineering, post‑processing filters), and expand to live deployment settings where feedback loops might amplify or dampen bias.

Authors

  • Alessandra Parziale
  • Gianmario Voria
  • Valeria Pontillo
  • Amleto Di Salle
  • Patrizio Pelliccione
  • Gemma Catolino
  • Fabio Palomba

Paper Information

  • arXiv ID: 2601.03857v1
  • Categories: cs.SE
  • Published: January 7, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »