[Paper] PerfCoder: Large Language Models for Interpretable Code Performance Optimization
Source: arXiv - 2512.14018v1
Overview
PerfCoder is a new family of large language models (LLMs) that go beyond just writing code – they automatically rewrite existing code to run faster, while explaining why each change helps. By fine‑tuning on real‑world optimization examples and using runtime‑based reinforcement learning, PerfCoder learns to suggest concrete, human‑readable performance tricks that work for the specific input program.
Key Contributions
- Optimization‑aware LLM: Introduces a model that treats performance improvement as a first‑class objective, not an after‑thought.
- Curated optimization trajectories: Builds a dataset of real code edits together with human‑written annotations describing each optimization step.
- Preference‑aligned RL fine‑tuning: Uses actual runtime measurements as a reward signal, teaching the model to prefer edits that yield measurable speedups.
- One‑shot code rewriting: Generates input‑specific, performance‑enhanced code in a single pass, eliminating costly iterative compilation loops.
- Interpretable feedback: Produces natural‑language explanations of the applied optimizations, enabling a “planner‑and‑optimizer” workflow with larger LLMs.
- State‑of‑the‑art results: Beats all prior models on the PIE benchmark in both raw speedup and the proportion of successful optimizations.
Methodology
- Data collection – The authors gathered a corpus of optimization trajectories: original source files, the transformed (faster) versions, and line‑by‑line human comments that describe each change (e.g., “replace
std::vectorwithstd::arrayto avoid heap allocation”). - Supervised fine‑tuning – A base code‑generation LLM is first fine‑tuned on this corpus, teaching it the syntax of performance‑focused edits and the associated natural‑language rationale.
- Reinforcement learning (RL) alignment – For each candidate rewrite, the model’s output is compiled and timed on a set of benchmark inputs. The measured runtime reduction serves as a reward, guiding a policy‑gradient RL step that nudges the model toward edits that actually speed up execution.
- Inference pipeline – At test time, the model receives a source file and emits a single, optimized version plus a concise explanation of each modification. No iterative search or external compiler feedback loop is required.
- Planner‑optimizer cooperation – The generated explanations can be fed to a larger LLM (e.g., a 32B or GPT‑5 model) that acts as a high‑level planner, further refining the optimization strategy.
Results & Findings
- Runtime speedup: On the PIE benchmark, PerfCoder achieved an average 23 % reduction in execution time, outpacing the previous best LLM by ~9 %.
- Effective optimization rate: 78 % of the test programs received at least one beneficial rewrite, compared to 61 % for the nearest competitor.
- Scalability vs. strategy awareness: Scaling the base model size alone (e.g., moving from 7B to 32B) did not close the gap; the strategy‑aware fine‑tuning was the dominant factor.
- Cooperative gains: When the interpretable feedback was used to guide a 32B planner model, overall speedups rose to 31 %, and a GPT‑5‑sized model saw a 28 % improvement over its vanilla version.
- Interpretability: Human evaluators rated the generated explanations as “clear and actionable” in 84 % of cases, confirming that the model’s suggestions are not black‑box transformations.
Practical Implications
- Developer productivity: Integrating PerfCoder into IDEs or CI pipelines could automatically suggest performance patches, saving engineers hours of manual profiling and tuning.
- Edge and embedded systems: Tight resource budgets make every microsecond count; PerfCoder can produce low‑overhead code without requiring developers to be performance experts.
- Automated code review: Teams can treat PerfCoder’s output as a supplemental reviewer that flags potential bottlenecks and offers concrete fixes.
- Model‑driven optimization services: Cloud providers could expose PerfCoder as an API, allowing users to upload code and receive an optimized version plus a human‑readable report.
- Educational tool: The natural‑language explanations double as teaching material, helping junior developers learn common performance idioms (e.g., loop unrolling, cache‑friendly data layouts).
Limitations & Future Work
- Dataset bias: The optimization trajectories come mainly from C/C++ projects; performance patterns in other languages (Rust, Go, Python) may not transfer directly.
- Compilation overhead: Although inference is one‑shot, the RL training still requires compiling and timing many candidates, which limits rapid scaling to new hardware architectures.
- Safety & correctness: The current pipeline assumes the transformed code preserves functional semantics; subtle bugs could slip through if the compiler does not catch them.
- Generalization to large codebases: Experiments focused on relatively small benchmark programs; applying PerfCoder to multi‑module, build‑system‑driven projects remains an open challenge.
- Future directions: The authors plan to expand the dataset to cover more languages, incorporate static‑analysis safety checks, and explore hierarchical planning where a high‑level LLM proposes what to optimize and PerfCoder handles the how.
Authors
- Jiuding Yang
- Shengyao Lu
- Hongxuan Liu
- Shayan Shirahmad Gale Bagi
- Zahra Fazel
- Tomasz Czajkowski
- Di Niu
Paper Information
- arXiv ID: 2512.14018v1
- Categories: cs.SE, cs.AI
- Published: December 16, 2025
- PDF: Download PDF