[Paper] How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study
Source: arXiv - 2605.04763v1
Overview
The paper investigates a surprisingly under‑explored piece of the Retrieval‑Augmented Generation (RAG) pipeline for code completion: how we split source code into “chunks” before retrieving relevant snippets. By systematically testing four chunking strategies across dozens of retriever‑generator combos, the authors show that the choice of chunking can materially affect completion quality—and that the intuitive “function‑level” split is actually the worst performer.
Key Contributions
- Large‑scale controlled experiment: 864 distinct RAG configurations (4 chunkers × 4 retrievers × 5 generators × 9 token‑budget settings) evaluated on two public benchmarks (RepoEval, CrossCodeEval).
- Empirical evidence that chunking matters: Statistical analysis confirms a significant impact of chunking strategy on completion accuracy.
- Counter‑intuitive finding: Function‑based chunking underperforms all other methods by 3.5–5.6 % on RepoEval (Cliff’s δ = ‑1.0).
- Dominance of cross‑file context length: Expanding the retrieved context from 2 k to 8 k tokens yields up to a 4.2 % boost, dwarfing the effect of chunk size.
- Pareto‑optimal analysis: Sliding‑Window and cAST chunkers dominate the cost‑quality trade‑off; Function chunking never appears on the Pareto front.
Methodology
Chunking Strategies
- Function – each chunk is a single function definition.
- Declaration – chunks consist of top‑level declarations (imports, class/struct definitions, etc.).
- Sliding Window – a fixed‑size token window moves across the file with overlap, producing overlapping chunks.
- cAST – a syntax‑aware chunker that groups tokens based on abstract‑syntax‑tree (AST) boundaries, preserving logical code units while allowing flexible size.
Retrievers & Generators
Four retrievers (BM25, dense vector search, hybrid, etc.) and five code generators (e.g., Codex, StarCoder, CodeGen) were plugged into the same RAG pipeline.
Parameter Grid
Nine configurations varying cross‑file context length (2 k, 4 k, 8 k tokens) and chunk size (small, medium, large).
Benchmarks
- RepoEval: real‑world repository‑level completion tasks.
- CrossCodeEval: cross‑project code completion with diverse languages.
Evaluation
- Completion accuracy measured with exact‑match and functional correctness metrics.
- Statistical significance assessed via paired t‑tests and Cliff’s delta.
Results & Findings
| Metric | Function | Declaration | Sliding Window | cAST |
|---|---|---|---|---|
| RepoEval accuracy (Δ vs. best) | ‑3.57 % to ‑5.64 % | – | – | – |
| Cross‑file context boost (2 k → 8 k) | up to +4.2 % | similar | similar | similar |
| Chunk‑size effect | non‑monotonic, modest | – | – | – |
| Pareto‑optimality | Never | sometimes | Yes | Yes |
- Chunking matters: The function‑based split consistently lags behind the other three, regardless of retriever or generator.
- Context length dominates: Doubling the token budget yields the biggest gains, suggesting that more retrieved code is more valuable than finer granularity.
- Chunk size is tricky: Larger chunks do not always help; the relationship is non‑linear and depends on the downstream generator’s context window.
- cAST and Sliding Window shine: Both achieve the best trade‑off between retrieval cost (number of chunks to scan) and completion quality, making them strong default choices.
Practical Implications
- Tooling & IDE plugins – When building a RAG‑powered autocomplete (e.g., VS Code extensions), prefer Sliding Window or cAST chunking over naïve function‑level indexing.
- Index size & latency – Sliding Window creates overlapping chunks, increasing index size but often reducing the number of retrieval hops needed for high‑quality context. cAST offers a middle ground with syntax‑aware boundaries and fewer redundant chunks.
- Configuration tuning – Allocate more token budget to the cross‑file context (e.g., 8 k tokens) before fiddling with chunk size; the payoff is larger and more predictable.
- Cost‑aware deployment – On cloud‑based code‑completion services where retrieval cost is billed per query, the Pareto analysis suggests you can cut latency and cost by dropping Function chunking entirely.
- Model‑agnostic benefit – The observed trends hold across a variety of generators, meaning the recommendations are robust even as new LLMs for code emerge.
Limitations & Future Work
- Benchmark scope – Only two benchmarks were used; while they cover multiple languages, they may not capture niche domains (e.g., embedded C, scientific Python).
- Retriever diversity – The study focused on four retrievers; newer graph‑based or multimodal retrievers could interact differently with chunking.
- Static analysis depth – cAST relies on a parser; languages without mature parsers may need alternative syntax‑aware chunkers.
- Real‑world latency – The paper reports quality metrics but does not measure end‑to‑end latency in an IDE setting; future work could profile user‑perceived responsiveness.
- Dynamic code – The experiments assume static source files; handling generated code or notebooks may require adaptive chunking strategies.
Bottom line: If you’re engineering a retrieval‑augmented code completion system, ditch the function‑level chunker, give your model a generous cross‑file context window, and consider a sliding‑window or syntax‑aware (cAST) chunking scheme to hit the sweet spot between speed, cost, and accuracy.
Authors
- Xinjian Wu
- Jingzhi Gong
- Gunel Jahangirova
- Jie Zhang
Paper Information
- arXiv ID: 2605.04763v1
- Categories: cs.SE
- Published: May 6, 2026
- PDF: Download PDF