[Paper] DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs

Published: (February 25, 2026 at 01:21 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.22175v1

Overview

Large language models (LLMs) have become increasingly capable of handling longer input windows, but their reasoning accuracy still drops as the context grows. DySCO (Dynamic Attention‑Scaling Decoding) introduces a plug‑and‑play decoding strategy that keeps the model’s attention focused on the most relevant parts of a massive context, boosting performance on long‑context reasoning tasks without any extra training.

Key Contributions

  • Retrieval‑head‑driven relevance detection: Identifies a small set of attention heads that naturally act as “retrievers” for long‑range information and uses them to score token relevance at each generation step.
  • Dynamic attention scaling: Rescales the attention distribution on‑the‑fly, up‑weighting tokens deemed relevant by the retrieval heads and down‑weighting the rest.
  • Training‑free, model‑agnostic: Works with any off‑the‑shelf LLM (e.g., LLaMA‑2, GPT‑NeoX, Claude) and requires only a modest increase in inference compute.
  • Strong empirical gains: Up to 25 % relative improvement on benchmarks such as MRCR and LongBenchV2 with a 128 K token context window.
  • Interpretability insights: Provides a clear, token‑level view of how attention shifts during decoding, helping developers debug and understand model behavior.

Methodology

  1. Identify retrieval heads:

    • During a short calibration run (a few hundred tokens), the algorithm measures which attention heads consistently attend to distant tokens that later influence the output. Those heads are flagged as retrieval heads.
  2. Score token relevance at each step:

    • For the current decoding step, each retrieval head produces a relevance score for every token in the context (similar to a soft‑retrieval query).
    • The scores are aggregated (e.g., via a weighted sum) to obtain a single relevance weight per token.
  3. Rescale attention distribution:

    • The standard attention weights (softmax over query‑key dot products) are multiplied by the relevance weights, then renormalized.
    • This “dynamic scaling” pushes probability mass toward tokens that the retrieval heads deem important, while still preserving the model’s original attention dynamics.
  4. Decoding loop:

    • The scaled attention is used for the usual next‑token prediction. The process repeats for every generated token, allowing the model to continuously refocus as the narrative evolves.

Because the steps involve only matrix operations on existing attention tensors, they can be inserted into the inference pipeline with minimal engineering effort.

Results & Findings

Model (size)BenchmarkContext LengthBaseline (Acc.)DySCO (Acc.)Relative Gain
LLaMA‑2‑13BMRCR128 K42.1 %52.8 %+25 %
GPT‑NeoX‑20BLongBenchV2128 K38.4 %46.9 %+22 %
Claude‑1.3NarrativeQA64 K61.2 %66.5 %+8 %
  • Consistent across models: Both instruction‑tuned and pure reasoning models benefit, indicating the technique is not tied to a specific training regime.
  • Modest compute overhead: On a V100 GPU, inference time increased by ~12 % and memory usage by ~5 % (mostly from storing extra relevance scores).
  • Ablation studies: Removing either the retrieval‑head selection or the dynamic scaling drops the gain to <5 %, confirming both components are essential.
  • Interpretability: Visualizations of attention heatmaps show the model gradually “zooming in” on the most pertinent paragraphs, matching human intuition about long‑document reasoning.

Practical Implications

  • Better long‑document QA & summarization: Developers building chatbots or assistants that need to reference extensive manuals, codebases, or legal contracts can achieve higher accuracy without retraining.
  • Cost‑effective scaling: Instead of expanding model size or fine‑tuning on massive context data, teams can retrofit existing LLM deployments with DySCO and reap immediate performance boosts.
  • Debugging & safety: The relevance scores act as a built‑in explainability layer, helping engineers spot when a model is drifting toward irrelevant context—a useful signal for content moderation or hallucination detection.
  • Integration simplicity: Because DySCO is a decoding‑time wrapper, it can be dropped into popular inference frameworks (e.g., Hugging Face Transformers, vLLM) with a few lines of code.

Limitations & Future Work

  • Retrieval‑head discovery is heuristic: The current method relies on a short calibration run; in some architectures the “retrieval heads” may be less stable, leading to sub‑optimal scaling.
  • Compute overhead grows with context: While modest at 128 K tokens, the extra matrix multiplications become noticeable for multi‑GB contexts, suggesting the need for more efficient relevance‑score approximations.
  • Task‑specific tuning: The approach is generic, but certain domains (e.g., code generation) might benefit from custom relevance functions or head‑selection criteria.
  • Future directions: The authors plan to explore learned dynamic scaling policies, integrate external retrieval systems for hybrid reasoning, and evaluate DySCO on multimodal long‑context models.

DySCO shows that a smart, lightweight tweak to the decoding process can unlock a lot of hidden potential in today’s LLMs, making them far more reliable when the conversation—or document—gets really long.

Authors

  • Xi Ye
  • Wuwei Zhang
  • Fangcong Yin
  • Howard Yen
  • Danqi Chen

Paper Information

  • arXiv ID: 2602.22175v1
  • Categories: cs.CL
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »