[Paper] Code Lifespan Survival Analysis (CLSA): Predicting the Survival of Source Code Lines Using AST-Aware Mining

Published: 1 week ago (June 3, 2026 at 11:13 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.04993v1

Overview

Predicting which lines of source code are likely to be removed—and when—can help teams prioritize code reviews, manage technical debt, and streamline refactoring. This paper introduces Code Lifespan Survival Analysis (CLSA), the first framework that models the “survival” of individual source‑code lines using survival‑analysis techniques traditionally employed in medical research. By treating each line as a subject that can be “censored” (i.e., never deleted) or “event‑ful” (deleted), CLSA uncovers fine‑grained risk factors that are computable from a single file’s abstract syntax tree (AST) and a few static metrics, without needing the full project history.

Key Contributions

Line‑level survival modeling: First approach to predict deletion risk at the granularity of individual source‑code lines.
AST‑aware feature set: Uses structural (AST depth, parent node type), contextual (branching, entropy), and temporal covariates that can be extracted statically from a file.
Large‑scale empirical study: Analyzed 32.5 M line‑birth events from 120 open‑source TypeScript repositories, with a sophisticated 5‑stage matching pipeline that filters out refactoring noise (≈8.3 M false “deaths”).
Robust statistical validation: Fits a Cox Proportional Hazards model and validates it with Weibull/Log‑Logistic Accelerated Failure Time (AFT) models, gamma frailty for repository effects, and time‑stratified landmark analyses.
Interpretability & calibration recipe: Provides clear hazard ratios (HRs) for each covariate and a practical method to compute time‑conditional risk scores that can be plugged into IDEs or code‑review tools.

Methodology

Data collection:
- Cloned 120 popular TypeScript projects from GitHub.
- Tracked every line’s birth (first appearance) across the commit history, yielding 32.5 M events.
Noise removal:
- Implemented a bipartite matching pipeline that distinguishes true deletions from refactorings such as code moves or rewrites.
- This step eliminated 8.3 M spurious “deaths,” ensuring the survival analysis reflects genuine line removals.
Feature extraction:
- Structural: AST depth, node type (e.g., expression, declaration), whether the line is inside a conditional branch.
- Contextual: Shannon entropy of the line’s token distribution (a proxy for “complexity” or “uniqueness”).
- Temporal: Age of the line at observation, repository‑level random effect (frailty).
Statistical modeling:
- Primary model: Cox Proportional Hazards with 15 covariates.
- Checked proportional‑hazards assumptions; where violated (e.g., entropy, branch presence), introduced time‑varying coefficients.
- Complementary models: Weibull and Log‑Logistic AFT for robustness; gamma frailty to capture repository‑specific heterogeneity; landmark models to evaluate risk at specific ages (e.g., 0‑90 days, 90‑365 days, >365 days).
Evaluation:
- Concordance index (C‑index) as discrimination metric.
- Calibration plots to verify predicted vs. observed survival probabilities.

Results & Findings

Metric	Observation
Overall survivability	> 50 % of lines never get deleted (median survival not reached).
Median lifespan of deleted lines	95.7 days.
Entropy effect	Protective for new code (HR = 0.84, 0‑90 days) and strongly protective for mature code (HR = 0.36, >365 days).
Conditional branch	Slightly protective at birth (HR = 0.97) but becomes a risk factor after 90 days (HR = 1.21).
Repository frailty	Largest source of variance; adding a gamma frailty term raises C‑index from 0.586 → 0.666.
Time‑varying regimes	Covariate impacts split into three temporal regimes (new, intermediate, mature), confirming that risk factors evolve as code ages.

The models achieve moderate discrimination (C‑index ≈ 0.66) and are well‑calibrated, meaning the predicted probabilities align closely with observed deletion rates across time windows.

Practical Implications

IDE‑integrated risk scores: Developers can get a live “deletion risk” badge for each line they write, helping them spot potentially volatile code early (e.g., low‑entropy, deep‑branch statements).
Prioritized code review: Review tools can surface high‑risk lines (e.g., recent lines inside conditional branches) for extra scrutiny, reducing the chance of introducing bugs or technical debt.
Technical debt dashboards: Project managers can aggregate line‑level survival probabilities to quantify “code churn risk” at the module or repository level, informing refactoring schedules.
Automated refactoring suggestions: Static analysis tools could recommend simplifying high‑entropy lines or extracting volatile conditional logic into separate functions before they become maintenance hotspots.
Cross‑project benchmarking: The repository frailty term highlights that some projects inherently produce more volatile code; teams can compare their frailty scores against industry baselines to gauge process health.

Limitations & Future Work

Language scope: The study focuses exclusively on TypeScript; applicability to other languages (e.g., Java, Python) needs validation.
Static‑only features: While the authors deliberately avoid version‑history or bug‑tracker data, incorporating such dynamic signals could improve predictive power.
C‑index ceiling: A concordance of ~0.66 suggests substantial unexplained variance; richer contextual features (e.g., developer experience, test coverage) might lift performance.
Real‑time deployment: Translating the survival model into a low‑latency IDE plugin will require efficient feature extraction and incremental model updates.
Long‑term evolution: The current analysis covers up to a few years of history; extending to longer horizons could reveal different survival regimes for legacy code.

Overall, CLSA opens a new avenue for fine‑grained, interpretable code‑health analytics that can be directly leveraged by developers and tooling ecosystems.

Authors

Pavel Gurov

Paper Information

arXiv ID: 2606.04993v1
Categories: cs.SE
Published: June 3, 2026
PDF: Download PDF

[Paper] Code Lifespan Survival Analysis (CLSA): Predicting the Survival of Source Code Lines Using AST-Aware Mining

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Very Much! Adoption of Coding Agent in New GitHub Projects

[Paper] Is US Defense Acquisition Ready to Acquire AI-Enabled Capabilities? Assessing the DoD Software Acquisition Pathway Through a Scenario-Based Policy Analysis

[Paper] On the Shoulders of Giants: Empowering Automated Smart Contract Auditing via the GiAnt Corpus

[Paper] QBugLM: An Agentic Benchmarking Framework for LLM-based Quantum Software Debugging