[Paper] PerfCoder: Large Language Models for Interpretable Code Performance Optimization

Published: 1 month ago (December 15, 2025 at 09:30 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14018v1

Overview

PerfCoder is a new family of large language models (LLMs) that go beyond just writing code – they automatically rewrite existing code to run faster, while explaining why each change helps. By fine‑tuning on real‑world optimization examples and using runtime‑based reinforcement learning, PerfCoder learns to suggest concrete, human‑readable performance tricks that work for the specific input program.

Key Contributions

Optimization‑aware LLM: Introduces a model that treats performance improvement as a first‑class objective, not an after‑thought.
Curated optimization trajectories: Builds a dataset of real code edits together with human‑written annotations describing each optimization step.
Preference‑aligned RL fine‑tuning: Uses actual runtime measurements as a reward signal, teaching the model to prefer edits that yield measurable speedups.
One‑shot code rewriting: Generates input‑specific, performance‑enhanced code in a single pass, eliminating costly iterative compilation loops.
Interpretable feedback: Produces natural‑language explanations of the applied optimizations, enabling a “planner‑and‑optimizer” workflow with larger LLMs.
State‑of‑the‑art results: Beats all prior models on the PIE benchmark in both raw speedup and the proportion of successful optimizations.

Methodology

Data collection – The authors gathered a corpus of optimization trajectories: original source files, the transformed (faster) versions, and line‑by‑line human comments that describe each change (e.g., “replace std::vector with std::array to avoid heap allocation”).
Supervised fine‑tuning – A base code‑generation LLM is first fine‑tuned on this corpus, teaching it the syntax of performance‑focused edits and the associated natural‑language rationale.
Reinforcement learning (RL) alignment – For each candidate rewrite, the model’s output is compiled and timed on a set of benchmark inputs. The measured runtime reduction serves as a reward, guiding a policy‑gradient RL step that nudges the model toward edits that actually speed up execution.
Inference pipeline – At test time, the model receives a source file and emits a single, optimized version plus a concise explanation of each modification. No iterative search or external compiler feedback loop is required.
Planner‑optimizer cooperation – The generated explanations can be fed to a larger LLM (e.g., a 32B or GPT‑5 model) that acts as a high‑level planner, further refining the optimization strategy.

Results & Findings

Runtime speedup: On the PIE benchmark, PerfCoder achieved an average 23 % reduction in execution time, outpacing the previous best LLM by ~9 %.
Effective optimization rate: 78 % of the test programs received at least one beneficial rewrite, compared to 61 % for the nearest competitor.
Scalability vs. strategy awareness: Scaling the base model size alone (e.g., moving from 7B to 32B) did not close the gap; the strategy‑aware fine‑tuning was the dominant factor.
Cooperative gains: When the interpretable feedback was used to guide a 32B planner model, overall speedups rose to 31 %, and a GPT‑5‑sized model saw a 28 % improvement over its vanilla version.
Interpretability: Human evaluators rated the generated explanations as “clear and actionable” in 84 % of cases, confirming that the model’s suggestions are not black‑box transformations.

Practical Implications

Developer productivity: Integrating PerfCoder into IDEs or CI pipelines could automatically suggest performance patches, saving engineers hours of manual profiling and tuning.
Edge and embedded systems: Tight resource budgets make every microsecond count; PerfCoder can produce low‑overhead code without requiring developers to be performance experts.
Automated code review: Teams can treat PerfCoder’s output as a supplemental reviewer that flags potential bottlenecks and offers concrete fixes.
Model‑driven optimization services: Cloud providers could expose PerfCoder as an API, allowing users to upload code and receive an optimized version plus a human‑readable report.
Educational tool: The natural‑language explanations double as teaching material, helping junior developers learn common performance idioms (e.g., loop unrolling, cache‑friendly data layouts).

Limitations & Future Work

Dataset bias: The optimization trajectories come mainly from C/C++ projects; performance patterns in other languages (Rust, Go, Python) may not transfer directly.
Compilation overhead: Although inference is one‑shot, the RL training still requires compiling and timing many candidates, which limits rapid scaling to new hardware architectures.
Safety & correctness: The current pipeline assumes the transformed code preserves functional semantics; subtle bugs could slip through if the compiler does not catch them.
Generalization to large codebases: Experiments focused on relatively small benchmark programs; applying PerfCoder to multi‑module, build‑system‑driven projects remains an open challenge.
Future directions: The authors plan to expand the dataset to cover more languages, incorporate static‑analysis safety checks, and explore hierarchical planning where a high‑level LLM proposes what to optimize and PerfCoder handles the how.

Authors

Jiuding Yang
Shengyao Lu
Hongxuan Liu
Shayan Shirahmad Gale Bagi
Zahra Fazel
Tomasz Czajkowski
Di Niu

Paper Information

arXiv ID: 2512.14018v1
Categories: cs.SE, cs.AI
Published: December 16, 2025
PDF: Download PDF

[Paper] PerfCoder: Large Language Models for Interpretable Code Performance Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy