[Paper] Rethinking Software Empirical Studies with Structural Causal Models

Published: 2 weeks ago (May 27, 2026 at 09:41 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.28482v1

Overview

The paper presents CausalSE, a practical framework that brings Judea Pearl’s structural causal modeling (SCM) toolbox into empirical software engineering (ESE). By moving beyond mere correlations, the authors show how developers and researchers can rigorously isolate the true impact of software‑related interventions—illustrated with a case study on prompt engineering for GPT‑3 code generation.

Key Contributions

CausalSE Framework – a step‑by‑step guide for applying SCMs (graphical models, do‑calculus, and propensity‑score matching) to typical software experiments.
Tutorial‑style Methodology – concrete recipes (data‑preparation, DAG construction, identification, estimation) that require only standard statistical tools (R/Python) and no deep causal‑theory background.
Real‑World Case Study – analysis of the Galeras dataset, revealing why more complex prompts appear beneficial in associative tests but lose significance once confounding is controlled.
Open‑source Artefacts – reproducible code, DAG templates, and a Jupyter notebook that let practitioners plug in their own datasets.
Critical Insight – demonstration that many published “effects” in software engineering literature may be false positives caused by hidden confounders.

Methodology

Define the Treatment & Outcome – e.g., prompt complexity (treatment) vs. code generation quality (outcome).
Build a Causal DAG – sketch variables (prompt length, model temperature, task difficulty, developer expertise) and draw directed edges to encode assumed causal relations.
Identify Confounders – nodes that influence both treatment and outcome (e.g., task difficulty) are flagged for adjustment.
Apply Propensity‑Score Matching (PSM) – compute the probability of receiving a “complex” prompt given the confounders, then pair similar observations across treatment groups.
Estimate the Causal Effect – use the matched sample to compute the average treatment effect (ATE) with simple statistical tests (t‑test, bootstrap).
Validate Assumptions – check balance diagnostics (standardized mean differences) and perform sensitivity analysis to gauge robustness to unobserved confounding.

All steps are implemented with familiar libraries such as pandas, statsmodels, and causalgraphicalmodels, making the pipeline approachable for developers.

Results & Findings

Associational Analysis (raw correlation) suggested that complex prompts improve GPT‑3 code quality by ~12 %.
CausalSE (PSM‑adjusted) found the ATE to be statistically indistinguishable from zero (≈ 1 % improvement, p > 0.2).
Balance Checks confirmed that after matching, prompt complexity was no longer correlated with task difficulty or model temperature, eliminating the confounding bias.
Sensitivity Tests indicated that only a very strong hidden confounder could overturn the null result, reinforcing confidence in the causal conclusion.

Practical Implications

Better Experiment Design – developers can now plan A/B tests for tooling, APIs, or prompt strategies with a clear checklist for confounder control, reducing wasted effort on spurious optimizations.
More Trustworthy Benchmarks – performance claims for LLM‑based code assistants, static analysis tools, or CI pipelines can be backed by causal evidence, improving stakeholder confidence.
Tooling Integration – the open‑source CausalSE notebooks can be embedded into CI/CD pipelines to automatically assess whether a new feature truly causes a performance lift.
Risk Mitigation – by exposing hidden biases (e.g., dataset composition, developer skill), teams can avoid costly roll‑outs based on misleading correlation‑only studies.

Limitations & Future Work

Assumption‑Heavy – causal validity hinges on the correctness of the DAG; misspecified relationships can still bias results.
Observational Data Only – the case study relies on existing logs; randomized controlled trials remain the gold standard for certain interventions.
Scalability – while PSM works well for moderate‑size datasets, larger telemetry streams may require more advanced estimators (e.g., doubly robust or machine‑learning‑based propensity models).
Domain Generalization – the authors plan to extend CausalSE to other SE sub‑areas (bug triage, effort estimation) and to integrate with emerging causal‑inference libraries (e.g., DoWhy, CausalML).

By equipping software engineers with a ready‑to‑use causal toolbox, the paper paves the way for more reliable, action‑oriented empirical research in the fast‑moving world of software development.

Authors

Daniel Rodriguez-Cardenas
Aya Garryyeva
David Nader Palacio
Antonio Mastropaolo
Denys Poshyvanyk

Paper Information

arXiv ID: 2605.28482v1
Categories: cs.SE
Published: May 27, 2026
PDF: Download PDF

[Paper] Rethinking Software Empirical Studies with Structural Causal Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Ladder Logic Translation using Large Language Models in Industrial Automation

[Paper] Governance-Aware Software Architecture for Multi-Stakeholder Platforms

[Paper] R+R: Reassessing Java Security API Misuse in Current LLMs: A Replication on JCA and JSSE APIs with External Security Knowledge

[Paper] What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants