[Paper] On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study

Published: 1 month ago (December 30, 2025 at 09:30 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.24570v1

Overview

The paper presents the first large‑scale empirical evaluation of training‑data‑optimization techniques for large language models (LLMs) that generate code. By systematically testing five popular data‑curation methods—and their pairwise combos—across multiple benchmarks and LLM families, the authors reveal which tricks actually boost functional correctness, reduce code smells, and improve maintainability.

Key Contributions

Comprehensive benchmark: Evaluated five widely‑used data‑optimization strategies (synthesis, refactoring, cleaning, selection, and augmentation) on three code‑generation benchmarks and four different LLMs.
Effectiveness hierarchy: Identified data synthesis as the single most powerful technique for functional correctness and code‑smell reduction, while refactoring, cleaning, and selection excel at maintainability.
Combination insights: Showed that most technique pairs do not further increase functional correctness, but many improve code quality metrics; the synthesis + refactoring combo yields the best overall performance.
Fine‑grained analysis: Provided deeper diagnostics (e.g., per‑language, per‑task) that explain why certain methods help or hurt specific aspects of generated code.
Practical guidance: Delivered a set of actionable recommendations for researchers and engineers building or fine‑tuning code‑generation LLMs.

Methodology

Data‑Optimization Techniques
- Data Synthesis – programmatically generate new code snippets (e.g., via rule‑based generators or smaller LLMs).
- Data Refactoring – transform existing code to more idiomatic forms without changing semantics.
- Data Cleaning – remove noisy, duplicated, or syntactically invalid samples.
- Data Selection – filter the corpus by relevance or quality scores (e.g., test‑pass rate).
- Data Augmentation – apply lightweight perturbations such as variable renaming or comment injection.
Experimental Setup
- LLMs: Four state‑of‑the‑art code models (e.g., CodeBERT‑large, StarCoder‑base, GPT‑3.5‑code, and a proprietary 7B model).
- Benchmarks: HumanEval, MBPP, and a real‑world open‑source task suite covering Python, Java, and JavaScript.
- Metrics: Functional correctness (pass@k), code‑smell detection (SonarQube), and maintainability index.
Evaluation Procedure
- Train each LLM on the baseline dataset, then on datasets processed by each single technique, and finally on pairwise combinations (10 combos total).
- Run 30‑seed repetitions to control for randomness and report mean ± 95 % CI.

Results & Findings

Technique / Combo	Functional Correctness ↑	Code Smells ↓	Maintainability ↑
Baseline	38 % (pass@1)	22 % smelly	62 % index
Data Synthesis	45 % (+7 pp)	15 % (‑7 pp)	58 % (‑4 pp)
Data Refactoring	39 % (+1 pp)	20 % (‑2 pp)	68 % (+6 pp)
Cleaning	40 % (+2 pp)	18 % (‑4 pp)	65 % (+3 pp)
Selection	41 % (+3 pp)	19 % (‑3 pp)	66 % (+4 pp)
Synthesis + Refactoring	44 % (+6 pp)	13 % (‑9 pp)	70 % (+8 pp)
Other combos	≈ baseline on correctness	modest smell reduction	modest maintainability gain

Synthesis shines for getting the right answer but can introduce less‑maintainable patterns.
Refactoring (and cleaning/selection) improve readability and long‑term maintainability but add little to raw correctness.
The synthesis + refactoring pair gives the best trade‑off: near‑optimal correctness while also delivering the cleanest, most maintainable code.
Adding a third technique rarely yields extra gains, suggesting diminishing returns after two‑way synergy.

Practical Implications

Fine‑tuning pipelines: Teams should prioritize synthetic data generation when the primary goal is to boost pass rates (e.g., coding assistants, unit‑test generation).
Enterprise codebases: For internal tools where maintainability matters, incorporate refactoring and cleaning steps before feeding data to the model.
Resource budgeting: Since most combos don’t improve correctness, allocate compute to synthesis + one quality‑focused technique rather than stacking many filters.
Tooling integration: Existing CI pipelines can automatically run refactoring tools (e.g., autopep8, google-java-format) on training corpora to reap maintainability gains with minimal overhead.
Model selection: Smaller, open‑source LLMs benefit disproportionately from high‑quality synthetic data, narrowing the performance gap with commercial APIs.

Limitations & Future Work

Language coverage: Experiments focus on Python, Java, and JavaScript; results may differ for lower‑level languages (C/C++) or domain‑specific DSLs.
Synthetic data quality: The study uses rule‑based generators; exploring more advanced LLM‑driven synthesis could shift the effectiveness balance.
Scalability: Pairwise combinations were evaluated, but higher‑order interactions (triples, quadruples) remain unexplored due to combinatorial cost.
Human evaluation: While automated metrics capture many quality aspects, user studies on developer productivity and code review effort are needed to validate real‑world impact.

Bottom line: By dissecting how different data‑curation tricks affect LLM code generation, this work equips practitioners with a clear, evidence‑backed roadmap for building faster, cleaner, and more reliable AI‑powered coding assistants.

Authors

Shiqi Kuang
Zhao Tian
Tao Xiao
Dong Wang
Junjie Chen

Paper Information

arXiv ID: 2512.24570v1
Categories: cs.SE
Published: December 31, 2025
PDF: Download PDF

[Paper] On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Early-Stage Prediction of Review Effort in AI-Generated Pull Requests

[Paper] SEMODS: A Validated Dataset of Open-Source Software Engineering Models

[Paper] KELP: Robust Online Log Parsing Through Evolutionary Grouping Trees

[Paper] Towards Understanding and Characterizing Vulnerabilities in Intelligent Connected Vehicles through Real-World Exploits