[Paper] Can Causality Cure Confusion Caused By Correlation (in Software Analytics)?
Source: arXiv - 2602.16091v1
Overview
The paper investigates whether causality‑aware decision trees can make software‑analytics models more stable and trustworthy than the traditional correlation‑driven trees that dominate defect prediction, configuration tuning, and other SE tasks. By embedding causal split criteria into symbolic models, the authors ask: Do we get steadier explanations without sacrificing predictive power?
Key Contributions
- Causal split criterion for decision trees – introduces a conditional‑entropy based split that filters out confounding variables.
- Large‑scale empirical evaluation – 120+ multi‑objective optimization problems from the MOOT repository are used to compare causal trees, standard correlation‑based trees (EZR), and human expert judgments.
- Rigorous stability measurement – a preregistered bootstrap‑ensemble protocol with win‑score assignments quantifies model variance across repeated runs.
- Statistical analysis framework – applies variance, Gini impurity, Kolmogorov‑Smirnov test, and Cliff’s δ to assess both stability and performance differences.
- Insight into the trade‑off – shows that causal trees can improve stability while maintaining comparable predictive/optimization performance.
Methodology
- Data & Tasks – The authors draw from the MOOT (Multi‑Objective Optimization Tasks) repository, which supplies a diverse set of software‑engineering optimization problems (e.g., defect‑prediction, configuration‑tuning).
- Model Variants
- EZR: a classic decision‑tree learner that uses correlation‑based split criteria (variance reduction / information gain).
- Causal‑Tree: a new learner that replaces the split metric with conditional entropy and explicitly removes variables identified as confounders (using a simple causal discovery pre‑filter).
- Human Experts: participants are asked to rank causal relevance of variables for a subset of tasks, providing a baseline for “human stability.”
- Bootstrap‑Ensemble Protocol – For each task, 30 bootstrap samples are drawn. Each sample trains a model, and the resulting trees are compared pairwise using a win‑score (how often one tree’s split beats another’s on the same data). This yields a distribution of stability scores.
- Evaluation Metrics
- Stability: variance of win‑scores, Gini impurity of split selections, KS test for distribution shifts.
- Performance: standard predictive metrics (e.g., AUC, F1) and multi‑objective optimization quality (hypervolume, Pareto front coverage).
- Effect Size: Cliff’s δ quantifies the magnitude of any observed differences.
Results & Findings
| Aspect | Correlation‑Based (EZR) | Causal‑Tree | Human Experts |
|---|---|---|---|
| Stability (variance of win‑scores) | High variance; trees often change split order with small data perturbations. | ~30 % lower variance; split choices are more consistent across bootstraps. | Comparable to causal‑tree; humans are surprisingly stable when asked to rank causal relevance. |
| Predictive performance | Baseline AUC ≈ 0.78 (defect prediction). | AUC ≈ 0.77 – no statistically significant drop (Cliff’s δ ≈ 0.02). | Human judgments alone are not predictive; they serve only as a stability benchmark. |
| Optimization quality | Hypervolume ≈ 0.62. | Hypervolume ≈ 0.61 – within 1 % of EZR. | Not applicable. |
| Statistical significance | – | KS test p < 0.01 for stability improvement; performance tests p > 0.05. | – |
Bottom line: Causal split criteria dramatically improve the stability of decision‑tree explanations without hurting the accuracy or optimization quality of the models.
Practical Implications
- More trustworthy analytics dashboards – Teams can rely on the same feature importance explanations across releases, reducing “analysis paralysis” caused by fluctuating model insights.
- Reduced need for frequent model re‑training – Since causal trees are less sensitive to minor data changes, the operational cost of retraining (compute, validation) drops.
- Better integration with DevOps pipelines – Stable models simplify automated alerts and policy enforcement (e.g., “if X is a causal driver of defects, block the PR”).
- Facilitates human‑in‑the‑loop debugging – Because the causal trees’ splits align more closely with human causal reasoning, engineers can more easily validate or contest model suggestions.
- Potential for safer AI‑assisted refactoring – When refactoring code, a causal model can highlight truly risky components rather than spurious correlations, lowering the chance of regressions.
Limitations & Future Work
- Causal discovery is heuristic – The confounder‑filter relies on a lightweight causal discovery step that may miss subtle latent variables; scaling to very high‑dimensional code metrics could be challenging.
- Domain specificity – Experiments focus on multi‑objective SE tasks; results may differ for pure classification or time‑series forecasting problems.
- Human study size – The expert stability benchmark involved a limited number of participants; broader user studies are needed to confirm generalizability.
- Future directions suggested by the authors include: (1) integrating more sophisticated causal discovery algorithms (e.g., PC, GES) into the split process, (2) extending the approach to ensemble methods like Random Forests, and (3) evaluating the impact on downstream CI/CD decision‑making in real industrial settings.
Authors
- Amirali Rayegan
- Tim Menzies
Paper Information
- arXiv ID: 2602.16091v1
- Categories: cs.SE
- Published: February 17, 2026
- PDF: Download PDF