[Paper] A meta-analysis of the effect of generative AI on productivity and learning in programming
Source: arXiv - 2605.04779v1
Overview
A new meta‑analysis synthesises findings from 23 empirical studies (27 effect sizes) to answer a question that’s on every developer’s mind: Does using generative AI (GenAI) coding assistants actually make us code faster or learn better? The authors show that GenAI tools give a modest boost to productivity, but the evidence for learning gains is inconclusive—and the size of the boost varies wildly across real‑world contexts.
Key Contributions
- First large‑scale quantitative synthesis of GenAI’s impact on both productivity and learning in programming.
- Standardised effect‑size estimates (Hedges’ g) for productivity (g = 0.33) and learning (g = 0.14), with confidence intervals and heterogeneity analyses.
- Contextual breakdown showing stronger productivity gains in controlled lab experiments than in open‑source or enterprise environments.
- Rigorous bias assessment using RoB2 (randomised trials) and ROBINS‑I (non‑randomised studies) to gauge study quality.
- Practical guidelines for educators and industry leaders on when and how to integrate GenAI assistants.
Methodology
- Systematic literature search across ACM, arXiv, Scopus, and Web of Science for papers published 2019‑2025 that compared GenAI‑assisted vs. unassisted programming.
- Inclusion criteria: quantitative measures of (a) productivity (task completion time, number of commits, lines of code) and (b) learning (exam or test scores).
- Data extraction: 27 effect sizes were extracted, each converted to Hedges’ g to correct for small‑sample bias.
- Risk‑of‑bias assessment: RoB2 for randomized controlled trials; ROBINS‑I for observational studies.
- Meta‑analytic model: Random‑effects model to account for between‑study heterogeneity, with subgroup analyses for experimental vs. real‑world settings.
The approach is deliberately transparent: all search strings, inclusion decisions, and statistical scripts are made publicly available, allowing other researchers (or curious developers) to reproduce the analysis.
Results & Findings
| Outcome | Hedges’ g | 95 % CI | Interpretation |
|---|---|---|---|
| Productivity | 0.33 | [0.09, 0.58] | Small‑to‑moderate positive effect; developers finish tasks faster or produce more code when using GenAI. |
| Learning | 0.14 | [‑0.18, 0.47] | Not statistically different from zero; no clear evidence that GenAI improves exam performance or skill retention. |
- Heterogeneity: The I² statistic indicated substantial variability (≈ 70 %) for productivity, driven mainly by study context. Controlled lab experiments reported g ≈ 0.55, while open‑source projects and enterprise teams showed g ≈ 0.15–0.20.
- Bias: Most studies were low‑to‑moderate risk of bias; a few high‑risk observational studies contributed to the heterogeneity.
Practical Implications
For Developers & Teams
- Adopt GenAI as a productivity aid, especially for repetitive or boilerplate‑heavy tasks (e.g., scaffolding, API calls). Expect roughly a 10‑30 % speedup in ideal conditions, but be prepared for smaller gains in complex, collaborative codebases.
- Pair GenAI with code review: Since the productivity boost is context‑dependent, integrating AI suggestions into existing pull‑request workflows can capture benefits while maintaining quality control.
For Tool Vendors
- Focus on integration depth: Tools that surface suggestions within the IDE and allow quick acceptance/rejection tend to show larger effects in controlled settings.
- Provide usage analytics: Giving teams visibility into acceptance rates and time‑saved metrics can help justify ROI and tune the AI model for specific domains.
For Educators & Training Programs
- Treat GenAI as a “coach” rather than a shortcut: The meta‑analysis suggests that simply letting students rely on AI does not automatically improve test scores. Structured activities (e.g., “explain the generated code” or “debug AI‑produced snippets”) may be needed to translate assistance into learning.
- Design assessment that isolates AI use: Open‑book style exams or project‑based assessments can better capture whether students are internalising concepts rather than copying AI output.
For Open‑Source Communities
- Expect modest productivity gains: Contributions assisted by AI may still require substantial human review, especially for maintainability and style consistency.
Limitations & Future Work
- Study heterogeneity: The wide spread of effect sizes limits the ability to pinpoint why some contexts benefit more than others (e.g., language, team size, task complexity).
- Short‑term metrics: Most primary studies measured immediate task completion or exam scores; long‑term skill retention and career progression remain unexamined.
- Rapidly evolving tools: The field of GenAI is moving fast; newer models (e.g., GPT‑4‑Turbo, Claude‑3) may exhibit different effect profiles than the tools covered in the 2019‑2025 literature.
- Potential publication bias: Although funnel‑plot analyses were performed, the relatively small number of studies means subtle bias could remain.
Future research should (1) conduct longitudinal studies tracking developer performance over months, (2) isolate the impact of specific AI features (e.g., code completion vs. full‑function synthesis), and (3) explore pedagogical designs that harness GenAI for deeper learning rather than shortcutting.
Bottom line: Generative AI coding assistants can give developers a measurable productivity edge, but the boost isn’t a universal silver bullet, and educators can’t count on them to automatically improve learning outcomes. Thoughtful integration—paired with human oversight and purposeful instructional design—is the key to unlocking their full potential.
Authors
- Sebastian Maier
- Moritz Gunzenhäuser
- Jonas Schweisthal
- Manuel Schneider
- Stefan Feuerriegel
Paper Information
- arXiv ID: 2605.04779v1
- Categories: cs.SE, cs.HC
- Published: May 6, 2026
- PDF: Download PDF