[Paper] CodeMEM: AST-Guided Adaptive Memory for Repository-Level Iterative Code Generation
Source: arXiv - 2601.02868v1
Overview
The paper introduces CodeMEM, a novel memory‑management layer that lets large language models (LLMs) keep track of a codebase’s evolving state during multi‑turn, repository‑level coding sessions. By representing the repository as an abstract syntax tree (AST) and using that structure to guide what the model remembers, CodeMEM dramatically reduces “forgetting” and cuts down the number of interaction rounds needed to get correct code.
Key Contributions
- AST‑guided Code Context Memory – a dynamic store that updates the repository’s structural representation after each LLM edit, ensuring the model always works with a current view of the code.
- Code Session Memory – a code‑centric log of the entire interaction, built from AST diffs rather than raw text, which enables precise detection of forgotten information.
- Forgetting mitigation via AST analysis – automatic checks that flag when a previously resolved issue re‑appears, prompting the model to retrieve the relevant context.
- State‑of‑the‑art results on two benchmarks (CodeIF‑Bench and CoderEval), showing >12 % gains in instruction‑following accuracy and a reduction of 2–3 interaction rounds per task.
- Efficiency‑aware design – comparable inference latency and token usage to existing memory approaches despite the richer representation.
Methodology
- AST Extraction – The repository’s source files are parsed into ASTs, providing a language‑agnostic, hierarchical view of code elements (functions, classes, imports, etc.).
- Code Context Memory (CCM) – After each LLM‑generated edit, the system re‑parses the modified files, merges the new AST fragments with the existing CCM, and tags each node with a “freshness” timestamp.
- Code Session Memory (CSM) – Every turn’s prompt, model output, and resulting AST diff are stored as a compact, structured record. Instead of a long chat transcript, the CSM holds a sequence of AST change operations.
- Forgetting Detection – Before generating the next turn, the model queries the CSM to see if any previously fixed AST nodes have been unintentionally altered. If so, a reminder is injected into the prompt.
- Prompt Construction – The LLM receives a hybrid prompt: (a) a short natural‑language summary, (b) a serialized snippet of the current AST (or relevant subtree), and (c) any forgetting alerts. This keeps the token budget low while preserving essential structural context.
The pipeline runs iteratively: LLM → AST diff → memory update → next prompt, enabling “continuous integration” of code changes without re‑reading the whole repository each time.
Results & Findings
| Benchmark | Metric | Baseline (no memory) | CodeMEM | Δ |
|---|---|---|---|---|
| CodeIF‑Bench (instruction‑following, current turn) | Accuracy | 68.4 % | 80.6 % | +12.2 % |
| CodeIF‑Bench (session‑level) | Accuracy | 61.1 % | 72.6 % | +11.5 % |
| CoderEval (code generation) | Pass@1 | 45.3 % | 48.9 % | +3.6 % |
| Interaction rounds (avg.) | – | 7.4 | 5.1 | –2.3 |
| Inference latency (ms) | – | 210 | 225 | ≈ +7 % |
| Tokens per session | – | 3,200 | 3,150 | –50 |
Key takeaways
- Higher accuracy stems from the model always seeing an up‑to‑date AST, which eliminates stale context.
- Fewer rounds mean developers spend less time back‑and‑forth with the model, accelerating the coding loop.
- Token efficiency is maintained because the AST representation is far more compact than raw source code or full chat histories.
Practical Implications
- IDE plugins & Copilot‑style assistants can embed CodeMEM to keep a live AST snapshot, allowing the assistant to suggest edits that respect the whole project’s structure.
- CI/CD automation: Automated code reviewers can use the memory layer to remember past linting or security findings across multiple PRs, reducing duplicate warnings.
- On‑prem LLM deployments: Enterprises can achieve better code generation quality without inflating compute costs, as the memory updates are lightweight AST diffs.
- Cross‑language support: Because ASTs are language‑agnostic, the same memory engine can serve polyglot repositories (e.g., a micro‑service suite with Python, Go, and TypeScript).
- Reduced hallucinations: By grounding the model in concrete syntax trees, the system curtails the tendency of LLMs to fabricate non‑existent APIs or variables.
Limitations & Future Work
- AST parsing overhead for very large monorepos can become a bottleneck; the authors suggest incremental parsing as a mitigation.
- The approach currently assumes syntactically correct code after each turn; handling partial or broken snippets would require more robust error recovery.
- Memory is AST‑centric, which captures structure but not runtime semantics (e.g., dynamic typing effects). Extending the model with type‑inference or execution traces is an open direction.
- Evaluation focused on benchmark datasets; real‑world user studies are needed to confirm usability gains in production IDEs.
Overall, CodeMEM demonstrates that giving LLMs a “structural memory” of code can make iterative, repository‑level generation far more reliable and developer‑friendly.
Authors
- Peiding Wang
- Li Zhang
- Fang Liu
- Chongyang Tao
- Yinghao Zhu
Paper Information
- arXiv ID: 2601.02868v1
- Categories: cs.SE
- Published: January 6, 2026
- PDF: Download PDF