[Paper] An Evaluation of Context Length Extrapolation in Long Code via Positional Embeddings and Efficient Attention
Source: arXiv - 2602.21800v1
Overview
Large language models (LLMs) have become indispensable for software‑engineering tasks such as code generation, autocompletion, and translation. However, most of these models are trained with a fixed context window (e.g., 2 k–4 k tokens), which makes them struggle when faced with very long source files or multi‑file projects. The paper “An Evaluation of Context Length Extrapolation in Long Code via Positional Embeddings and Efficient Attention” investigates zero‑shot, inference‑only techniques that let existing code‑LLMs see farther into a file without retraining, focusing on positional encodings and attention‑efficiency tricks.
Key Contributions
- Systematic benchmark for long‑code completion on real‑world repositories, covering context lengths up to 32 k tokens.
- Comprehensive survey of positional‑embedding strategies (rotary, ALiBi, NTK‑aware, etc.) and their compatibility with code‑LLMs.
- Evaluation of efficient‑attention kernels (sliding‑window, sparse‑global, FlashAttention) in a pure inference setting.
- Empirical guidelines on which combination of embedding + attention method yields the best extrapolation performance for different model sizes.
- Open‑source toolkit that plugs into popular code‑LLM APIs (e.g., OpenAI, Anthropic) to enable context‑length scaling with a single configuration flag.
Methodology
- Dataset construction – The authors collected ~2 M code snippets from GitHub, spanning languages such as Python, JavaScript, and Go. Each snippet was split into a prompt (the part the model sees) and a target (the next 256 tokens to predict). Prompt lengths were varied from 4 k to 32 k tokens to stress‑test extrapolation.
- Model selection – Off‑the‑shelf code‑LLMs (CodeLlama‑7B, StarCoder‑15B, and GPT‑4‑code) were used without any fine‑tuning.
- Zero‑shot modifications –
- Positional embeddings: swapped the model’s native sinusoidal/learned embeddings with alternatives that are theoretically unbounded (e.g., Rotary Positional Embedding, ALiBi).
- Efficient attention: replaced the default dense attention with kernels that reduce quadratic cost (sliding‑window, block‑sparse, FlashAttention).
- Metrics – Standard code‑completion metrics (Exact Match, BLEU, CodeBLEU) plus latency and memory footprint.
- Ablation studies – Each technique was evaluated in isolation and in combination to isolate interaction effects.
Results & Findings
| Technique | Avg. CodeBLEU (4 k → 32 k) | Latency ↑ | Memory ↑ |
|---|---|---|---|
| Baseline (dense + learned PE) | 0.42 | – | – |
| Rotary PE + dense | 0.48 | +12 % | +8 % |
| ALiBi PE + sliding‑window | 0.51 | +5 % | –4 % |
| NTK‑aware PE + FlashAttention | 0.55 | +3 % | +2 % |
| Best combo (ALiBi + FlashAttention) | 0.54 | +4 % | +1 % |
- Positional embeddings matter: ALiBi and NTK‑aware embeddings consistently outperformed the original learned embeddings when the context exceeded the training window.
- Efficient attention reduces overhead: Sliding‑window and FlashAttention keep memory usage roughly linear with context length, enabling 8× longer prompts on a single GPU.
- Synergy: The strongest results came from pairing a linear‑bias embedding (ALiBi) with a global‑local attention pattern (FlashAttention), suggesting that the model benefits from both a monotonic positional bias and a few global tokens for cross‑file context.
- Model size effect: Larger models (StarCoder‑15B) showed smaller relative gains, indicating they already learn some extrapolation capability, but still profit from the proposed tweaks.
Practical Implications
- IDE plugins can now offer full‑file autocompletion for large codebases without requiring custom fine‑tuning.
- CI/CD code‑review bots can analyze longer diffs (e.g., whole‑module changes) in a single pass, improving detection of subtle bugs or style violations.
- Server‑side inference services can serve more tokens per request, reducing round‑trip latency for developers using cloud‑based code assistants.
- Cost savings: Efficient attention kernels cut GPU memory demand, allowing existing hardware (e.g., a single A100) to handle 32 k‑token contexts that previously needed multi‑GPU sharding.
- Cross‑language tooling: Since the techniques are model‑agnostic, they can be applied to any code‑LLM, making them a low‑effort upgrade path for vendors.
Limitations & Future Work
- Zero‑shot only: The study does not explore fine‑tuning with longer contexts, which could yield even higher gains.
- Language coverage: Benchmarks focus on a handful of popular languages; less‑common languages (Rust, Haskell) may behave differently.
- Attention pattern design: The paper tests a fixed set of efficient kernels; adaptive or learned sparsity patterns could further improve scalability.
- Evaluation on generation quality: While CodeBLEU is a solid proxy, real‑world developer satisfaction (e.g., edit distance, acceptance rate) remains unmeasured.
Future research directions include integrating dynamic context windows that grow as a developer types, combining extrapolation tricks with retrieval‑augmented generation for cross‑project knowledge, and extending the open‑source toolkit to support on‑device inference for edge IDEs.
Authors
- Madhusudan Ghosh
- Rishabh Gupta
Paper Information
- arXiv ID: 2602.21800v1
- Categories: cs.SE, cs.AI
- Published: February 25, 2026
- PDF: Download PDF