[Paper] Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs
Source: arXiv - 2602.20799v1
Overview
The paper tackles a practical pain point for developers adopting brand‑new software frameworks: large language models (LLMs) that power code‑completion or generation tools often stumble when they haven’t seen the framework’s APIs during pre‑training. The authors introduce UCD‑Training, a two‑stage pipeline that automatically synthesises “usage‑aware” training data from the raw source code of an unseen codebase, then fine‑tunes a code LLM so it can reason about API relationships and compose correct calls without hallucination.
Key Contributions
- Code‑graph construction: Parses an unseen codebase into a rich, file‑level dependency graph that captures import relations, class hierarchies, and API signatures.
- Dependency‑preserving continued pre‑training (CPT): Keeps the model’s knowledge of the codebase’s structural dependencies while adapting it to the new domain.
- Graph‑grounded supervised fine‑tuning (SFT) with three novel synthetic data families:
- Single‑hop relation reasoning – questions that require the model to identify direct API relationships (e.g., “Which function returns a
Tensor?”). - Compositional API reasoning – multi‑step tasks that force the model to chain several calls correctly.
- Codebase utilization – realistic usage scenarios (e.g., “Write a data‑loader using the framework’s
Datasetclass”).
- Single‑hop relation reasoning – questions that require the model to identify direct API relationships (e.g., “Which function returns a
- Explicit reasoning traces: Each synthetic example includes a step‑by‑step rationale, teaching the model to “think out loud” and reducing hallucinations.
- UnseenCodeBench: A new benchmark that evaluates code generation on truly unseen frameworks, covering multiple domains (ML libraries, web frameworks, etc.).
- Comprehensive empirical validation: Shows consistent gains over vanilla LLMs and retrieval‑augmented generation (RAG) baselines across 6 diverse codebases.
Methodology
-
Parse & Build a Code Graph
- The raw source tree is fed to a static analyzer that extracts modules, classes, functions, and import statements.
- Nodes = code entities; edges = “depends‑on” (import), “inherits‑from”, or “calls‑within”.
- The graph is stored in a lightweight JSON format that can be traversed during data synthesis.
-
Stage 1 – Dependency‑Preserving Continued Pre‑Training (CPT)
- Starting from a pre‑trained code LLM (e.g., CodeLlama‑7B), the model is further trained on file‑level snippets paired with their dependency context (e.g., “File A imports B; here is the content of A”).
- This step aligns the model’s token embeddings with the new codebase’s naming conventions while retaining the global structure.
-
Stage 2 – Graph‑Grounded Supervised Fine‑Tuning (SFT)
- Using the code graph, three synthetic corpora are generated automatically:
- Single‑hop – Randomly pick an edge (e.g.,
ClassX→methodY) and ask the model to retrieve the target given the source. - Compositional – Walk a path of length 2‑3 in the graph, then ask the model to produce a snippet that follows that path, inserting a reasoning trace that explains each hop.
- Utilization – Sample a realistic “task” (e.g., “Load a CSV using the library’s
DataLoader”) and synthesize a full solution, again with an explicit chain‑of‑thought.
- Single‑hop – Randomly pick an edge (e.g.,
- The model is fine‑tuned on these examples with a standard cross‑entropy loss, encouraging it to emit both code and the accompanying rationale.
- Using the code graph, three synthetic corpora are generated automatically:
-
Evaluation on UnseenCodeBench
- Benchmarks consist of prompts that a developer might write when first encountering the framework.
- Metrics: exact match, functional correctness (via unit tests), and hallucination rate (measured by a static analyzer that flags undefined symbols).
Results & Findings
| Model | Exact‑Match ↑ | Pass@1 (unit tests) ↑ | Hallucination ↓ |
|---|---|---|---|
| Base CodeLlama‑7B | 21.4 % | 18.7 % | 34 % |
| + Retrieval‑Augmented (RAG) | 27.9 % | 24.3 % | 22 % |
| UCD‑Training (CPT+SFT) | 38.6 % | 35.1 % | 9 % |
- CPT alone already cuts hallucinations by half, confirming that preserving dependency information matters.
- SFT with reasoning traces yields the biggest jump in functional correctness, especially on compositional tasks where multi‑step reasoning is required.
- Across six unseen frameworks (a deep‑learning library, a web‑router, a data‑validation toolkit, etc.), the average improvement over the base model is +17 pp exact‑match and ‑25 pp hallucination.
Practical Implications
- Faster onboarding: Companies can feed their proprietary SDKs or internal libraries into UCD‑Training and instantly get a code LLM that understands the API surface, reducing the learning curve for new hires.
- Reduced debugging time: The lower hallucination rate means developers spend less time hunting down “ghost” imports or nonsensical snippets generated by the model.
- Plug‑and‑play for CI/CD: The pipeline is fully automated—once the source is checked into a repo, a CI job can rebuild the code graph, synthesize data, and fine‑tune the model nightly, keeping the assistant up‑to‑date with evolving APIs.
- Better RAG alternatives: Instead of relying on brittle retrieval of documentation snippets, the model internalises the relationships, offering smoother autocomplete and fewer latency spikes.
- Open‑source potential: The authors release the graph‑construction scripts and the UnseenCodeBench benchmark, enabling the community to adapt the approach for niche domains like embedded firmware or scientific computing libraries.
Limitations & Future Work
- Synthetic bias: The generated reasoning traces follow deterministic patterns; real developers may reason differently, potentially limiting generalisation to out‑of‑distribution prompts.
- Scalability: Building and storing a full code graph for very large monorepos (hundreds of thousands of files) can be memory‑intensive; the current implementation caps at ~10 k files.
- Evaluation breadth: While functional correctness is measured via unit tests, the benchmark does not yet cover performance‑critical code (e.g., GPU kernels) where subtle bugs matter.
- Future directions suggested by the authors include:
- Incorporating dynamic execution traces (e.g., runtime call graphs) to enrich the synthetic data.
- Exploring multi‑modal inputs such as design docs or UML diagrams.
- Extending the framework to support incremental fine‑tuning as the codebase evolves.
Authors
- Guangsheng Ou
- Qiming Zhang
- Sirong Chen
- Anji Li
- Dong Xu
- Tiancheng Luo
- Dekun Dai
- Cuiyun Gao
- Long Wang
- Jun Zhou
- Mingwei Liu
- Zibin Zheng
Paper Information
- arXiv ID: 2602.20799v1
- Categories: cs.SE
- Published: February 24, 2026
- PDF: Download PDF