[Paper] Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs

Published: (February 24, 2026 at 06:36 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.20799v1

Overview

The paper tackles a practical pain point for developers adopting brand‑new software frameworks: large language models (LLMs) that power code‑completion or generation tools often stumble when they haven’t seen the framework’s APIs during pre‑training. The authors introduce UCD‑Training, a two‑stage pipeline that automatically synthesises “usage‑aware” training data from the raw source code of an unseen codebase, then fine‑tunes a code LLM so it can reason about API relationships and compose correct calls without hallucination.

Key Contributions

  • Code‑graph construction: Parses an unseen codebase into a rich, file‑level dependency graph that captures import relations, class hierarchies, and API signatures.
  • Dependency‑preserving continued pre‑training (CPT): Keeps the model’s knowledge of the codebase’s structural dependencies while adapting it to the new domain.
  • Graph‑grounded supervised fine‑tuning (SFT) with three novel synthetic data families:
    1. Single‑hop relation reasoning – questions that require the model to identify direct API relationships (e.g., “Which function returns a Tensor?”).
    2. Compositional API reasoning – multi‑step tasks that force the model to chain several calls correctly.
    3. Codebase utilization – realistic usage scenarios (e.g., “Write a data‑loader using the framework’s Dataset class”).
  • Explicit reasoning traces: Each synthetic example includes a step‑by‑step rationale, teaching the model to “think out loud” and reducing hallucinations.
  • UnseenCodeBench: A new benchmark that evaluates code generation on truly unseen frameworks, covering multiple domains (ML libraries, web frameworks, etc.).
  • Comprehensive empirical validation: Shows consistent gains over vanilla LLMs and retrieval‑augmented generation (RAG) baselines across 6 diverse codebases.

Methodology

  1. Parse & Build a Code Graph

    • The raw source tree is fed to a static analyzer that extracts modules, classes, functions, and import statements.
    • Nodes = code entities; edges = “depends‑on” (import), “inherits‑from”, or “calls‑within”.
    • The graph is stored in a lightweight JSON format that can be traversed during data synthesis.
  2. Stage 1 – Dependency‑Preserving Continued Pre‑Training (CPT)

    • Starting from a pre‑trained code LLM (e.g., CodeLlama‑7B), the model is further trained on file‑level snippets paired with their dependency context (e.g., “File A imports B; here is the content of A”).
    • This step aligns the model’s token embeddings with the new codebase’s naming conventions while retaining the global structure.
  3. Stage 2 – Graph‑Grounded Supervised Fine‑Tuning (SFT)

    • Using the code graph, three synthetic corpora are generated automatically:
      • Single‑hop – Randomly pick an edge (e.g., ClassXmethodY) and ask the model to retrieve the target given the source.
      • Compositional – Walk a path of length 2‑3 in the graph, then ask the model to produce a snippet that follows that path, inserting a reasoning trace that explains each hop.
      • Utilization – Sample a realistic “task” (e.g., “Load a CSV using the library’s DataLoader”) and synthesize a full solution, again with an explicit chain‑of‑thought.
    • The model is fine‑tuned on these examples with a standard cross‑entropy loss, encouraging it to emit both code and the accompanying rationale.
  4. Evaluation on UnseenCodeBench

    • Benchmarks consist of prompts that a developer might write when first encountering the framework.
    • Metrics: exact match, functional correctness (via unit tests), and hallucination rate (measured by a static analyzer that flags undefined symbols).

Results & Findings

ModelExact‑Match ↑Pass@1 (unit tests) ↑Hallucination ↓
Base CodeLlama‑7B21.4 %18.7 %34 %
+ Retrieval‑Augmented (RAG)27.9 %24.3 %22 %
UCD‑Training (CPT+SFT)38.6 %35.1 %9 %
  • CPT alone already cuts hallucinations by half, confirming that preserving dependency information matters.
  • SFT with reasoning traces yields the biggest jump in functional correctness, especially on compositional tasks where multi‑step reasoning is required.
  • Across six unseen frameworks (a deep‑learning library, a web‑router, a data‑validation toolkit, etc.), the average improvement over the base model is +17 pp exact‑match and ‑25 pp hallucination.

Practical Implications

  • Faster onboarding: Companies can feed their proprietary SDKs or internal libraries into UCD‑Training and instantly get a code LLM that understands the API surface, reducing the learning curve for new hires.
  • Reduced debugging time: The lower hallucination rate means developers spend less time hunting down “ghost” imports or nonsensical snippets generated by the model.
  • Plug‑and‑play for CI/CD: The pipeline is fully automated—once the source is checked into a repo, a CI job can rebuild the code graph, synthesize data, and fine‑tune the model nightly, keeping the assistant up‑to‑date with evolving APIs.
  • Better RAG alternatives: Instead of relying on brittle retrieval of documentation snippets, the model internalises the relationships, offering smoother autocomplete and fewer latency spikes.
  • Open‑source potential: The authors release the graph‑construction scripts and the UnseenCodeBench benchmark, enabling the community to adapt the approach for niche domains like embedded firmware or scientific computing libraries.

Limitations & Future Work

  • Synthetic bias: The generated reasoning traces follow deterministic patterns; real developers may reason differently, potentially limiting generalisation to out‑of‑distribution prompts.
  • Scalability: Building and storing a full code graph for very large monorepos (hundreds of thousands of files) can be memory‑intensive; the current implementation caps at ~10 k files.
  • Evaluation breadth: While functional correctness is measured via unit tests, the benchmark does not yet cover performance‑critical code (e.g., GPU kernels) where subtle bugs matter.
  • Future directions suggested by the authors include:
    1. Incorporating dynamic execution traces (e.g., runtime call graphs) to enrich the synthetic data.
    2. Exploring multi‑modal inputs such as design docs or UML diagrams.
    3. Extending the framework to support incremental fine‑tuning as the codebase evolves.

Authors

  • Guangsheng Ou
  • Qiming Zhang
  • Sirong Chen
  • Anji Li
  • Dong Xu
  • Tiancheng Luo
  • Dekun Dai
  • Cuiyun Gao
  • Long Wang
  • Jun Zhou
  • Mingwei Liu
  • Zibin Zheng

Paper Information

  • arXiv ID: 2602.20799v1
  • Categories: cs.SE
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »