[Paper] On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study

Published: (December 30, 2025 at 09:30 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.24570v1

Overview

The paper presents the first large‑scale empirical evaluation of training‑data‑optimization techniques for large language models (LLMs) that generate code. By systematically testing five popular data‑curation methods—and their pairwise combos—across multiple benchmarks and LLM families, the authors reveal which tricks actually boost functional correctness, reduce code smells, and improve maintainability.

Key Contributions

  • Comprehensive benchmark: Evaluated five widely‑used data‑optimization strategies (synthesis, refactoring, cleaning, selection, and augmentation) on three code‑generation benchmarks and four different LLMs.
  • Effectiveness hierarchy: Identified data synthesis as the single most powerful technique for functional correctness and code‑smell reduction, while refactoring, cleaning, and selection excel at maintainability.
  • Combination insights: Showed that most technique pairs do not further increase functional correctness, but many improve code quality metrics; the synthesis + refactoring combo yields the best overall performance.
  • Fine‑grained analysis: Provided deeper diagnostics (e.g., per‑language, per‑task) that explain why certain methods help or hurt specific aspects of generated code.
  • Practical guidance: Delivered a set of actionable recommendations for researchers and engineers building or fine‑tuning code‑generation LLMs.

Methodology

  1. Data‑Optimization Techniques

    • Data Synthesis – programmatically generate new code snippets (e.g., via rule‑based generators or smaller LLMs).
    • Data Refactoring – transform existing code to more idiomatic forms without changing semantics.
    • Data Cleaning – remove noisy, duplicated, or syntactically invalid samples.
    • Data Selection – filter the corpus by relevance or quality scores (e.g., test‑pass rate).
    • Data Augmentation – apply lightweight perturbations such as variable renaming or comment injection.
  2. Experimental Setup

    • LLMs: Four state‑of‑the‑art code models (e.g., CodeBERT‑large, StarCoder‑base, GPT‑3.5‑code, and a proprietary 7B model).
    • Benchmarks: HumanEval, MBPP, and a real‑world open‑source task suite covering Python, Java, and JavaScript.
    • Metrics: Functional correctness (pass@k), code‑smell detection (SonarQube), and maintainability index.
  3. Evaluation Procedure

    • Train each LLM on the baseline dataset, then on datasets processed by each single technique, and finally on pairwise combinations (10 combos total).
    • Run 30‑seed repetitions to control for randomness and report mean ± 95 % CI.

Results & Findings

Technique / ComboFunctional Correctness ↑Code Smells ↓Maintainability ↑
Baseline38 % (pass@1)22 % smelly62 % index
Data Synthesis45 % (+7 pp)15 % (‑7 pp)58 % (‑4 pp)
Data Refactoring39 % (+1 pp)20 % (‑2 pp)68 % (+6 pp)
Cleaning40 % (+2 pp)18 % (‑4 pp)65 % (+3 pp)
Selection41 % (+3 pp)19 % (‑3 pp)66 % (+4 pp)
Synthesis + Refactoring44 % (+6 pp)13 % (‑9 pp)70 % (+8 pp)
Other combos≈ baseline on correctnessmodest smell reductionmodest maintainability gain
  • Synthesis shines for getting the right answer but can introduce less‑maintainable patterns.
  • Refactoring (and cleaning/selection) improve readability and long‑term maintainability but add little to raw correctness.
  • The synthesis + refactoring pair gives the best trade‑off: near‑optimal correctness while also delivering the cleanest, most maintainable code.
  • Adding a third technique rarely yields extra gains, suggesting diminishing returns after two‑way synergy.

Practical Implications

  • Fine‑tuning pipelines: Teams should prioritize synthetic data generation when the primary goal is to boost pass rates (e.g., coding assistants, unit‑test generation).
  • Enterprise codebases: For internal tools where maintainability matters, incorporate refactoring and cleaning steps before feeding data to the model.
  • Resource budgeting: Since most combos don’t improve correctness, allocate compute to synthesis + one quality‑focused technique rather than stacking many filters.
  • Tooling integration: Existing CI pipelines can automatically run refactoring tools (e.g., autopep8, google-java-format) on training corpora to reap maintainability gains with minimal overhead.
  • Model selection: Smaller, open‑source LLMs benefit disproportionately from high‑quality synthetic data, narrowing the performance gap with commercial APIs.

Limitations & Future Work

  • Language coverage: Experiments focus on Python, Java, and JavaScript; results may differ for lower‑level languages (C/C++) or domain‑specific DSLs.
  • Synthetic data quality: The study uses rule‑based generators; exploring more advanced LLM‑driven synthesis could shift the effectiveness balance.
  • Scalability: Pairwise combinations were evaluated, but higher‑order interactions (triples, quadruples) remain unexplored due to combinatorial cost.
  • Human evaluation: While automated metrics capture many quality aspects, user studies on developer productivity and code review effort are needed to validate real‑world impact.

Bottom line: By dissecting how different data‑curation tricks affect LLM code generation, this work equips practitioners with a clear, evidence‑backed roadmap for building faster, cleaner, and more reliable AI‑powered coding assistants.

Authors

  • Shiqi Kuang
  • Zhao Tian
  • Tao Xiao
  • Dong Wang
  • Junjie Chen

Paper Information

  • arXiv ID: 2512.24570v1
  • Categories: cs.SE
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »