[Paper] Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition

Published: (March 2, 2026 at 07:50 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.01814v1

Overview

Adding a new feature to an existing codebase is a tough problem for current Large Language Models (LLMs). The models must understand the whole repository’s architecture, locate the right files, and avoid breaking legacy behavior. The paper introduces RAIM, a framework that makes LLM‑driven feature addition architecture‑aware by exploring multiple design alternatives and rigorously validating their impact before committing a patch.

Key Contributions

  • Architecture‑aware localization: A multi‑round graph‑based search that pinpoints all cross‑file locations a new feature touches, even in large repositories.
  • Multi‑design generation: Instead of a single “greedy” patch, RAIM asks the LLM to produce several diverse implementation designs.
  • Impact‑aware selection: Combines static analysis (dependency graphs, type checking) and dynamic testing (automated execution) to automatically pick the safest, most effective design.
  • State‑of‑the‑art results: Achieves a 39.47 % success rate on the NoCode‑bench Verified benchmark—a 36.34 % relative gain over the strongest prior baseline.
  • Model‑agnostic robustness: Works equally well with open‑weight models (e.g., DeepSeek‑v3.2) and large proprietary models, demonstrating strong generalization.

Methodology

  1. Repository Graph Construction – RAIM builds a code graph where nodes are functions, classes, and files, and edges capture import, call, and data‑flow relationships.
  2. Localization Loop – Starting from the feature description, the system performs several rounds of graph traversal, each time refining the set of candidate nodes that need modification. This “localization” step ensures that even dispersed changes (e.g., a new API call that touches UI, backend, and config files) are discovered.
  3. Multi‑Design Prompting – The LLM receives the localized change set and is prompted to generate N distinct design proposals (e.g., different architectural patterns, refactoring strategies).
  4. Impact Evaluation
    • Static: RAIM runs type‑checking, linting, and dependency‑impact analysis on each proposal to flag potential regressions.
    • Dynamic: Each proposal is compiled/executed against a test harness; failing tests or runtime exceptions are recorded.
  5. Selection & Patch Emission – The proposal with the best impact score (fewest static warnings + highest test pass rate) is selected, and the corresponding diff is emitted as the final patch.

The pipeline is fully automated, requiring only the feature description and the repository as input.

Results & Findings

MetricRAIMBest Baseline
Success Rate (feature fully integrated & passes tests)39.47 %28.93 %
Relative Improvement+36.34 %
Model‑agnostic gain (open‑weight vs. proprietary)Open‑weight models surpass proprietary baselines by ~12 %

Key observations:

  • Multi‑design generation contributes ~15 % of the total gain; without it, success drops to ~27 %.
  • Impact validation (static + dynamic) cuts regression errors by >40 % compared to a naïve single‑pass approach.
  • The system scales to repositories with >200 k LOC, maintaining localization precision above 90 %.

Practical Implications

  • Faster Feature Integration: Development teams can offload the tedious “find‑all‑touchpoints” work to RAIM, reducing manual code‑review cycles.
  • Safer Automated Refactoring: By vetting multiple designs, RAIM mitigates the risk of introducing regressions—a critical concern for production systems.
  • Model‑independent Tooling: Companies can adopt RAIM with open‑source LLMs, avoiding costly API fees while still achieving high-quality patches.
  • Continuous Integration (CI) Enhancement: RAIM’s static/dynamic impact checks can be plugged into CI pipelines to automatically validate PRs generated by AI assistants.
  • Educational Aid: New developers can see several plausible implementations for a feature, helping them learn architectural trade‑offs.

Limitations & Future Work

  • Test Dependence: RAIM’s dynamic validation relies on existing test suites; poorly covered codebases may still receive unsafe patches.
  • Computation Overhead: Generating and evaluating multiple designs incurs higher CPU/GPU usage, which could be prohibitive for very large repositories without optimization.
  • Design Diversity Control: The current prompting strategy does not guarantee truly orthogonal designs; future work could incorporate explicit diversity metrics.
  • Beyond Code‑Level Impact: The framework currently focuses on compile‑time and test‑time effects; extending it to performance profiling and security analysis is an open direction.

Overall, RAIM demonstrates that making LLM‑driven code generation architecture‑aware and design‑diverse is a practical path toward reliable, repository‑scale software evolution.

Authors

  • Mingwei Liu
  • Zhenxi Chen
  • Zheng Pei
  • Zihao Wang
  • Yanlin Wang
  • Zibin Zheng

Paper Information

  • arXiv ID: 2603.01814v1
  • Categories: cs.SE
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] ICSE 2022 Sustainability Report

The carbon footprint of academic conferences becomes a topic of increasing debate. It is important to consider whether the benefits derived from attending confe...