[Paper] Automated Customization of LLMs for Enterprise Code Repositories Using Semantic Scopes

Published: (February 5, 2026 at 10:38 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.05780v1

Overview

The paper presents a practical pipeline for automatically tailoring large language models (LLMs) to a company’s own codebase. By extracting semantic scopes—logical groupings of related code artifacts—and feeding them into either Retrieval‑Augmented Generation (RAG) or supervised fine‑tuning (FT), the authors show that even modest‑sized models can outperform much larger, generic LLMs on code‑completion tasks inside private repositories.

Key Contributions

  • Semantic‑Scope Ingestion: A systematic method for parsing a repository into meaningful “scopes” (e.g., modules, APIs, domain‑specific patterns) that serve as the backbone of the training data.
  • Dual Customization Strategies: Implementation and comparison of two widely used adaptation techniques—RAG (index‑based retrieval + generation) and supervised fine‑tuning—applied to the same scoped data.
  • Enterprise‑Scale Evaluation: Real‑world experiments on two large, private corporate codebases, demonstrating measurable productivity gains for developers.
  • Benchmark Cross‑Check: Validation on public code‑completion benchmarks (e.g., HumanEval, MBPP) to confirm that the approach does not overfit to a single codebase.
  • Open‑Source Toolkit: Release of the ingestion pipeline and data‑pair generation scripts, enabling other teams to replicate the workflow with minimal effort.

Methodology

  1. Repository Parsing – The tool walks through the entire code tree, extracting syntactic entities (functions, classes, interfaces) and grouping them by semantic scope: a scope can be a package, a microservice, or any logical boundary defined by import graphs and naming conventions.
  2. Training Pair Generation – For each scope, the system creates prompt–completion pairs. The prompt mimics a developer’s partial code snippet (e.g., a function signature or a comment), while the completion is the next logical block of code drawn from the same scope.
  3. Customization Paths
    • RAG: The scoped snippets are indexed with a dense vector store (e.g., FAISS). At inference time, the model first retrieves the most relevant scope documents and then generates the completion conditioned on both the user prompt and the retrieved context.
    • Fine‑Tuning (FT): The same prompt–completion pairs are used to further train a base LLM (e.g., LLaMA‑7B) for a few epochs, allowing the model to internalize repository‑specific idioms.
  4. Evaluation – The authors run automated code‑completion tests (exact match, BLEU, functional correctness) on held‑out files from the private repos and on public benchmark suites. Human developers also performed a short usability study to gauge perceived usefulness.

Results & Findings

Model / StrategyParamsPrivate Repo CC Score ↑Public Bench Score ↓Relative Gain vs. Base LLM
Base LLM (7B)7 B42 %68 %
RAG (7B)7 B58 % (+38 %)70 % (+2 %)Beats 13 B generic LLM
FT (7B)7 B61 % (+45 %)71 % (+3 %)Beats 13 B generic LLM
Base LLM (13B)13 B48 %71 %
FT (13B)13 B63 % (+31 %)73 % (+2 %)Slight edge over 7B‑FT
  • Productivity boost: In a developer survey, participants reported a 23 % reduction in time spent fixing auto‑generated snippets when using the customized models.
  • Model size vs. customization: A 7 B model fine‑tuned with scoped data outperformed an uncustomized 13 B model, highlighting the cost‑effectiveness of the approach.
  • Generalization: Performance on public benchmarks improved modestly, indicating that the scoped fine‑tuning does not catastrophically overfit to the private code.

Practical Implications

  • Faster onboarding: New hires can rely on a model that already “knows” the company’s coding conventions, reducing the learning curve.
  • Lower infrastructure cost: Teams can achieve high‑quality completions with mid‑size models, avoiding the expense of running massive LLMs in production.
  • Security & compliance: Since the customization happens on‑premise and the model never sends proprietary code to external APIs, organizations maintain data confidentiality.
  • Plug‑and‑play integration: The RAG pipeline can be wrapped around existing IDE extensions (e.g., VS Code, JetBrains) with minimal latency (≈150 ms retrieval + generation).
  • Continuous improvement: As the repository evolves, the ingestion pipeline can be rerun nightly, keeping the model up‑to‑date without full retraining.

Limitations & Future Work

  • Scope definition heuristics: The current method relies on static analysis and naming conventions; highly dynamic languages or unconventional project structures may yield suboptimal scopes.
  • Fine‑tuning data quality: Prompt–completion pairs are automatically generated, which can introduce noisy or ambiguous examples that limit gains.
  • Evaluation breadth: The study focuses on two enterprise codebases; broader validation across diverse domains (e.g., embedded systems, data‑science notebooks) is needed.
  • Future directions suggested by the authors include:
    1. Learning adaptive scope boundaries via graph neural networks.
    2. Exploring parameter‑efficient adaptation methods (e.g., LoRA, adapters) to further reduce compute.
    3. Integrating runtime feedback (e.g., test failures) to close the loop between generation and correctness.

Authors

  • Ulrich Finkler
  • Irene Manotas
  • Wei Zhang
  • Geert Janssen
  • Octavian Popescu
  • Shyam Ramji

Paper Information

  • arXiv ID: 2602.05780v1
  • Categories: cs.SE, cs.AI
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »