[Paper] Beyond the Commit: Developer Perspectives on Productivity with AI Coding Assistants
Source: arXiv - 2602.03593v1
Overview
The paper investigates how AI‑powered coding assistants (e.g., GitHub Copilot, Tabnine) actually affect developer productivity. By combining a large‑scale survey of 2,989 engineers at BNY Mellon with in‑depth interviews, the authors reveal that traditional “commit‑count” or “lines‑of‑code” metrics miss many of the ways AI tools influence work—especially over the long term.
Key Contributions
- Mixed‑method evaluation framework: blends quantitative survey data with qualitative interview insights to capture both immediate and lasting productivity effects.
- Six‑factor model of AI‑augmented productivity: identifies short‑term (speed, suggestion relevance) and long‑term (skill growth, ownership, code quality, cognitive load) dimensions.
- Empirical evidence of divergent perceptions: survey respondents are split on AI usefulness, while interviews expose nuanced trade‑offs.
- Critique of legacy metrics: demonstrates why commit‑based or LOC‑based measures are insufficient for AI‑assisted development.
- Guidelines for industry adoption: proposes a holistic evaluation checklist that can be integrated into engineering performance dashboards.
Methodology
- Survey (Quantitative) – Distributed internally at BNY Mellon; 2,989 developers answered Likert‑scale items about AI tool usage, perceived speed gains, error rates, and overall satisfaction.
- Semi‑structured Interviews (Qualitative) – 11 participants were selected to represent a range of experience levels and AI tool adoption rates. Interviews probed daily workflows, learning experiences, and perceived impact on code ownership.
- Thematic analysis – Interview transcripts were coded using an inductive approach, yielding six recurring themes that formed the new productivity factor model.
- Triangulation – Survey trends were cross‑checked against interview narratives to validate findings and surface contradictions.
Results & Findings
- Mixed sentiment: ~48 % of survey respondents reported “significant speed improvements,” while ~32 % felt AI suggestions often introduced bugs or required extra debugging.
- Six-factor productivity model:
- Task Completion Speed – Immediate reduction in keystrokes/time.
- Suggestion Relevance – Accuracy of generated code snippets.
- Cognitive Load – Mental effort saved or added when reviewing AI output.
- Skill Development – Whether AI acts as a learning aid or a crutch.
- Code Ownership & Trust – Developers’ confidence in the code they ship when AI‑generated.
- Long‑term Code Quality – Impact on maintainability, test coverage, and technical debt.
- Long‑term factors dominate: Interviewees emphasized that the true value of AI assistants lies in how they affect expertise growth and ownership, not just raw speed.
- Metric mismatch: Traditional productivity proxies (e.g., commits per day) correlated weakly with the six factors, confirming their limited explanatory power.
Practical Implications
- Rethink performance dashboards: Incorporate metrics like “AI‑assisted learning events” (e.g., number of new APIs discovered via suggestions) and “ownership confidence scores” alongside velocity.
- Tool‑selection criteria: Evaluate AI assistants on relevance and cognitive‑load reduction rather than just raw suggestion volume.
- Onboarding & training: Design curricula that teach developers how to critically assess AI output, turning the assistant into a mentorship layer.
- Policy & governance: Establish guidelines for code review that explicitly address AI‑generated sections to preserve accountability and maintainability.
- Product roadmap: Vendors can differentiate by providing transparency features (e.g., provenance of suggestions) that support the long‑term factors highlighted in the study.
Limitations & Future Work
- Single‑company context: All data come from BNY Mellon; results may differ in open‑source or startup environments.
- Self‑reported bias: Survey responses rely on participants’ perception of productivity, which can be optimistic or defensive.
- Tool heterogeneity: The study aggregates across multiple AI assistants, obscuring tool‑specific strengths/weaknesses.
- Future directions: The authors suggest longitudinal studies tracking skill acquisition over months, experiments comparing specific AI tools, and extending the factor model to other domains such as data‑science notebooks or low‑code platforms.
Authors
- Valerie Chen
- Jasmyn He
- Behnjamin Williams
- Jason Valentino
- Ameet Talwalkar
Paper Information
- arXiv ID: 2602.03593v1
- Categories: cs.SE
- Published: February 3, 2026
- PDF: Download PDF