[Paper] Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics
Source: arXiv - 2601.02200v1
Overview
The paper Code for Machines, Not Just Humans investigates whether code that scores well on traditional “human‑friendly” quality metrics is also easier for AI coding assistants (e.g., large language models, LLMs) to understand and modify. By running LLM‑driven refactorings on 5 000 Python snippets from competitive‑programming contests, the authors show a strong link between a human‑centric metric called CodeHealth and the AI’s ability to preserve program semantics after automated edits. In short, writing maintainable code today pays off for tomorrow’s AI‑augmented development pipelines.
Key Contributions
- Empirical link between CodeHealth (a human‑oriented maintainability score) and semantic preservation after LLM‑based refactoring.
- Large‑scale experiment on 5 k real‑world Python files, using state‑of‑the‑art LLMs to perform automated refactorings.
- Risk‑assessment framework that leverages CodeHealth to flag code regions where AI‑driven changes are likely safe versus those needing human review.
- Open dataset & tooling (scripts, prompts, and evaluation pipeline) released for reproducibility and further research.
Methodology
- Dataset collection – 5 000 Python solutions from competitive‑programming platforms (e.g., Codeforces, AtCoder) were harvested, providing a diverse mix of algorithmic styles and code quality levels.
- CodeHealth scoring – Each file was evaluated with the CodeHealth metric, which aggregates readability, cyclomatic complexity, naming consistency, and comment density—factors traditionally tied to human maintainability.
- LLM‑based refactoring – A leading LLM (GPT‑4‑style) was prompted to perform a set of standard refactorings (renaming, extracting functions, simplifying loops, etc.). The same prompt was applied uniformly across all files.
- Semantic preservation check – After refactoring, the original and transformed programs were run against a hidden test suite. If all tests passed, the change was deemed semantically preserved.
- Statistical analysis – Correlation and logistic regression were used to quantify how CodeHealth predicts the likelihood of successful, semantics‑preserving AI edits.
Results & Findings
| Metric | Observation |
|---|---|
| Correlation (CodeHealth ↔ success rate) | Pearson r ≈ 0.62 (p < 0.001) – higher CodeHealth strongly predicts successful AI refactoring. |
| Success rate by CodeHealth quartile | Q1 (lowest) ≈ 38 % success, Q4 (highest) ≈ 84 % success. |
| Error types | Most failures stemmed from subtle logic changes (e.g., off‑by‑one errors) rather than syntax issues, and they clustered in low‑CodeHealth files. |
| Prompt robustness | The same prompt worked across the entire corpus, indicating that the observed effect is not prompt‑specific. |
What it means: Code that is already easy for humans to read and maintain is also easier for LLMs to manipulate without breaking functionality. Conversely, “messy” code raises the risk of AI‑induced bugs.
Practical Implications
- AI‑ready code reviews – Teams can integrate CodeHealth checks into CI pipelines to flag high‑risk modules before handing them to AI assistants (e.g., Copilot, Tabnine).
- Prioritized refactoring – Organizations can allocate human refactoring effort to low‑CodeHealth hotspots, thereby reducing the chance of costly AI‑generated regressions.
- Tooling enhancements – LLM‑based IDE plugins could surface a “AI‑risk score” derived from CodeHealth, guiding developers to accept or reject suggested edits.
- Onboarding new AI agents – When rolling out a new code‑generation model, companies can start with the “AI‑friendly” portion of their codebase, accelerating adoption while minimizing disruption.
- Cost savings – By preventing AI‑induced bugs early, firms can cut downstream debugging time, which is especially valuable in large monorepos where a single faulty refactor can cascade.
Limitations & Future Work
- Domain scope – The study focuses on algorithmic Python scripts; results may differ for large‑scale, object‑oriented systems or other languages.
- LLM version lock‑in – Only one LLM (GPT‑4‑style) was evaluated; newer or smaller models might exhibit different behavior.
- Static metric reliance – CodeHealth captures many maintainability aspects but omits dynamic factors (e.g., runtime performance) that could affect AI friendliness.
- Future directions – Extending the analysis to Java/TypeScript, testing with fine‑tuned domain‑specific LLMs, and exploring additional AI‑centric metrics (e.g., token predictability) are natural next steps.
Bottom line: Investing in clean, maintainable code isn’t just a human‑centric best practice—it also builds a safer foundation for the AI‑augmented development workflows that are rapidly becoming the norm. By measuring and improving CodeHealth today, teams can lower the risk of AI‑generated bugs tomorrow.
Authors
- Markus Borg
- Nadim Hagatulah
- Adam Tornhill
- Emma Söderberg
Paper Information
- arXiv ID: 2601.02200v1
- Categories: cs.SE, cs.AI
- Published: January 5, 2026
- PDF: Download PDF