[Paper] Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs

Published: 3 days ago (February 24, 2026 at 06:36 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.20799v1

Overview

The paper tackles a practical pain point for developers adopting brand‑new software frameworks: large language models (LLMs) that power code‑completion or generation tools often stumble when they haven’t seen the framework’s APIs during pre‑training. The authors introduce UCD‑Training, a two‑stage pipeline that automatically synthesises “usage‑aware” training data from the raw source code of an unseen codebase, then fine‑tunes a code LLM so it can reason about API relationships and compose correct calls without hallucination.

Key Contributions

Code‑graph construction: Parses an unseen codebase into a rich, file‑level dependency graph that captures import relations, class hierarchies, and API signatures.
Dependency‑preserving continued pre‑training (CPT): Keeps the model’s knowledge of the codebase’s structural dependencies while adapting it to the new domain.
Graph‑grounded supervised fine‑tuning (SFT) with three novel synthetic data families:
1. Single‑hop relation reasoning – questions that require the model to identify direct API relationships (e.g., “Which function returns a Tensor?”).
2. Compositional API reasoning – multi‑step tasks that force the model to chain several calls correctly.
3. Codebase utilization – realistic usage scenarios (e.g., “Write a data‑loader using the framework’s Dataset class”).
Explicit reasoning traces: Each synthetic example includes a step‑by‑step rationale, teaching the model to “think out loud” and reducing hallucinations.
UnseenCodeBench: A new benchmark that evaluates code generation on truly unseen frameworks, covering multiple domains (ML libraries, web frameworks, etc.).
Comprehensive empirical validation: Shows consistent gains over vanilla LLMs and retrieval‑augmented generation (RAG) baselines across 6 diverse codebases.

Methodology

Parse & Build a Code Graph
- The raw source tree is fed to a static analyzer that extracts modules, classes, functions, and import statements.
- Nodes = code entities; edges = “depends‑on” (import), “inherits‑from”, or “calls‑within”.
- The graph is stored in a lightweight JSON format that can be traversed during data synthesis.
Stage 1 – Dependency‑Preserving Continued Pre‑Training (CPT)
- Starting from a pre‑trained code LLM (e.g., CodeLlama‑7B), the model is further trained on file‑level snippets paired with their dependency context (e.g., “File A imports B; here is the content of A”).
- This step aligns the model’s token embeddings with the new codebase’s naming conventions while retaining the global structure.
Stage 2 – Graph‑Grounded Supervised Fine‑Tuning (SFT)
- Using the code graph, three synthetic corpora are generated automatically:
  - Single‑hop – Randomly pick an edge (e.g., ClassX → methodY) and ask the model to retrieve the target given the source.
  - Compositional – Walk a path of length 2‑3 in the graph, then ask the model to produce a snippet that follows that path, inserting a reasoning trace that explains each hop.
  - Utilization – Sample a realistic “task” (e.g., “Load a CSV using the library’s DataLoader”) and synthesize a full solution, again with an explicit chain‑of‑thought.
- The model is fine‑tuned on these examples with a standard cross‑entropy loss, encouraging it to emit both code and the accompanying rationale.
Evaluation on UnseenCodeBench
- Benchmarks consist of prompts that a developer might write when first encountering the framework.
- Metrics: exact match, functional correctness (via unit tests), and hallucination rate (measured by a static analyzer that flags undefined symbols).

Results & Findings

Model	Exact‑Match ↑	Pass@1 (unit tests) ↑	Hallucination ↓
Base CodeLlama‑7B	21.4 %	18.7 %	34 %
+ Retrieval‑Augmented (RAG)	27.9 %	24.3 %	22 %
UCD‑Training (CPT+SFT)	38.6 %	35.1 %	9 %

CPT alone already cuts hallucinations by half, confirming that preserving dependency information matters.
SFT with reasoning traces yields the biggest jump in functional correctness, especially on compositional tasks where multi‑step reasoning is required.
Across six unseen frameworks (a deep‑learning library, a web‑router, a data‑validation toolkit, etc.), the average improvement over the base model is +17 pp exact‑match and ‑25 pp hallucination.

Practical Implications

Faster onboarding: Companies can feed their proprietary SDKs or internal libraries into UCD‑Training and instantly get a code LLM that understands the API surface, reducing the learning curve for new hires.
Reduced debugging time: The lower hallucination rate means developers spend less time hunting down “ghost” imports or nonsensical snippets generated by the model.
Plug‑and‑play for CI/CD: The pipeline is fully automated—once the source is checked into a repo, a CI job can rebuild the code graph, synthesize data, and fine‑tune the model nightly, keeping the assistant up‑to‑date with evolving APIs.
Better RAG alternatives: Instead of relying on brittle retrieval of documentation snippets, the model internalises the relationships, offering smoother autocomplete and fewer latency spikes.
Open‑source potential: The authors release the graph‑construction scripts and the UnseenCodeBench benchmark, enabling the community to adapt the approach for niche domains like embedded firmware or scientific computing libraries.

Limitations & Future Work

Synthetic bias: The generated reasoning traces follow deterministic patterns; real developers may reason differently, potentially limiting generalisation to out‑of‑distribution prompts.
Scalability: Building and storing a full code graph for very large monorepos (hundreds of thousands of files) can be memory‑intensive; the current implementation caps at ~10 k files.
Evaluation breadth: While functional correctness is measured via unit tests, the benchmark does not yet cover performance‑critical code (e.g., GPU kernels) where subtle bugs matter.
Future directions suggested by the authors include:
1. Incorporating dynamic execution traces (e.g., runtime call graphs) to enrich the synthetic data.
2. Exploring multi‑modal inputs such as design docs or UML diagrams.
3. Extending the framework to support incremental fine‑tuning as the codebase evolves.

Authors

Guangsheng Ou
Qiming Zhang
Sirong Chen
Anji Li
Dong Xu
Tiancheng Luo
Dekun Dai
Cuiyun Gao
Long Wang
Jun Zhou
Mingwei Liu
Zibin Zheng

Paper Information

arXiv ID: 2602.20799v1
Categories: cs.SE
Published: February 24, 2026
PDF: Download PDF

[Paper] Unseen-Codebases-Domain Data Synthesis and Training Based on Code Graphs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Array-Carrying Symbolic Execution for Function Contract Generation

[Paper] LLM-Powered Silent Bug Fuzzing in Deep Learning Libraries via Versatile and Controlled Bug Transfer

[Paper] CL4SE: A Context Learning Benchmark For Software Engineering Tasks

[Paper] Managing Uncertainty in LLM-based Multi-Agent System Operation