[Paper] Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics

Published: 2 weeks ago (January 5, 2026 at 10:23 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.02200v1

Overview

The paper Code for Machines, Not Just Humans investigates whether code that scores well on traditional “human‑friendly” quality metrics is also easier for AI coding assistants (e.g., large language models, LLMs) to understand and modify. By running LLM‑driven refactorings on 5 000 Python snippets from competitive‑programming contests, the authors show a strong link between a human‑centric metric called CodeHealth and the AI’s ability to preserve program semantics after automated edits. In short, writing maintainable code today pays off for tomorrow’s AI‑augmented development pipelines.

Key Contributions

Empirical link between CodeHealth (a human‑oriented maintainability score) and semantic preservation after LLM‑based refactoring.
Large‑scale experiment on 5 k real‑world Python files, using state‑of‑the‑art LLMs to perform automated refactorings.
Risk‑assessment framework that leverages CodeHealth to flag code regions where AI‑driven changes are likely safe versus those needing human review.
Open dataset & tooling (scripts, prompts, and evaluation pipeline) released for reproducibility and further research.

Methodology

Dataset collection – 5 000 Python solutions from competitive‑programming platforms (e.g., Codeforces, AtCoder) were harvested, providing a diverse mix of algorithmic styles and code quality levels.
CodeHealth scoring – Each file was evaluated with the CodeHealth metric, which aggregates readability, cyclomatic complexity, naming consistency, and comment density—factors traditionally tied to human maintainability.
LLM‑based refactoring – A leading LLM (GPT‑4‑style) was prompted to perform a set of standard refactorings (renaming, extracting functions, simplifying loops, etc.). The same prompt was applied uniformly across all files.
Semantic preservation check – After refactoring, the original and transformed programs were run against a hidden test suite. If all tests passed, the change was deemed semantically preserved.
Statistical analysis – Correlation and logistic regression were used to quantify how CodeHealth predicts the likelihood of successful, semantics‑preserving AI edits.

Results & Findings

Metric	Observation
Correlation (CodeHealth ↔ success rate)	Pearson r ≈ 0.62 (p < 0.001) – higher CodeHealth strongly predicts successful AI refactoring.
Success rate by CodeHealth quartile	Q1 (lowest) ≈ 38 % success, Q4 (highest) ≈ 84 % success.
Error types	Most failures stemmed from subtle logic changes (e.g., off‑by‑one errors) rather than syntax issues, and they clustered in low‑CodeHealth files.
Prompt robustness	The same prompt worked across the entire corpus, indicating that the observed effect is not prompt‑specific.

What it means: Code that is already easy for humans to read and maintain is also easier for LLMs to manipulate without breaking functionality. Conversely, “messy” code raises the risk of AI‑induced bugs.

Practical Implications

AI‑ready code reviews – Teams can integrate CodeHealth checks into CI pipelines to flag high‑risk modules before handing them to AI assistants (e.g., Copilot, Tabnine).
Prioritized refactoring – Organizations can allocate human refactoring effort to low‑CodeHealth hotspots, thereby reducing the chance of costly AI‑generated regressions.
Tooling enhancements – LLM‑based IDE plugins could surface a “AI‑risk score” derived from CodeHealth, guiding developers to accept or reject suggested edits.
Onboarding new AI agents – When rolling out a new code‑generation model, companies can start with the “AI‑friendly” portion of their codebase, accelerating adoption while minimizing disruption.
Cost savings – By preventing AI‑induced bugs early, firms can cut downstream debugging time, which is especially valuable in large monorepos where a single faulty refactor can cascade.

Limitations & Future Work

Domain scope – The study focuses on algorithmic Python scripts; results may differ for large‑scale, object‑oriented systems or other languages.
LLM version lock‑in – Only one LLM (GPT‑4‑style) was evaluated; newer or smaller models might exhibit different behavior.
Static metric reliance – CodeHealth captures many maintainability aspects but omits dynamic factors (e.g., runtime performance) that could affect AI friendliness.
Future directions – Extending the analysis to Java/TypeScript, testing with fine‑tuned domain‑specific LLMs, and exploring additional AI‑centric metrics (e.g., token predictability) are natural next steps.

Bottom line: Investing in clean, maintainable code isn’t just a human‑centric best practice—it also builds a safer foundation for the AI‑augmented development workflows that are rapidly becoming the norm. By measuring and improving CodeHealth today, teams can lower the risk of AI‑generated bugs tomorrow.

Authors

Markus Borg
Nadim Hagatulah
Adam Tornhill
Emma Söderberg

Paper Information

arXiv ID: 2601.02200v1
Categories: cs.SE, cs.AI
Published: January 5, 2026
PDF: Download PDF

[Paper] Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management