[Paper] Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

Published: (April 22, 2026 at 01:46 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.20819v1

Overview

The paper “Stream‑CQSA: Avoiding Out‑of‑Memory in Attention Computation via Flexible Workload Scheduling” tackles a core bottleneck of modern large‑language models (LLMs): the quadratic memory blow‑up of exact self‑attention when processing very long sequences. By reformulating attention as a set of independent sub‑computations that can be streamed on‑the‑fly, the authors show that you can run exact attention over billion‑token inputs on a single GPU without any approximation or costly inter‑GPU communication.

Key Contributions

  • CQS Divide operation – a novel decomposition derived from cyclic quorum set (CQS) theory that splits full‑sequence attention into mathematically independent subsequence tasks, guaranteeing exact reconstruction of the original attention matrix.
  • Stream‑CQSA framework – a memory‑adaptive scheduler that dynamically partitions the attention workload to fit any user‑specified GPU memory budget, turning attention into a streamable pipeline.
  • Hardware‑agnostic execution – the approach works on a single device and does not rely on multi‑GPU sharding or custom kernels; it can be layered on top of existing transformer libraries.
  • Empirical validation – experiments demonstrate predictable linear‑in‑memory scaling and successful execution of exact attention on sequences exceeding 1 billion tokens, all while keeping the same mathematical definition of attention.

Methodology

  1. Cyclic Quorum Set (CQS) Theory – In a quorum system, subsets of elements (quorums) intersect in a controlled way. The authors adapt this idea to the query, key, and value tensors of attention, constructing cyclic quorums that partition the sequence into overlapping blocks.
  2. CQS Divide – Using the quorums, the full attention matrix (A = \text{softmax}(QK^\top)V) is expressed as a sum of smaller attention sub‑matrices, each computed on a sub‑sequence that fits in memory. Because the quorums are designed to cover every pair of positions exactly once, recombining the sub‑results yields the identical output as the monolithic computation.
  3. Streaming Scheduler – Stream‑CQSA treats each sub‑attention as a task in a queue. The scheduler monitors GPU memory usage and streams tasks in/out, swapping intermediate tensors to host memory when needed. No inter‑task communication is required because each sub‑task is independent.
  4. Implementation – The authors built the pipeline on top of PyTorch, using standard tensor operations and CUDA streams. The only extra requirement is a lightweight controller that decides block sizes based on the available memory budget.

Results & Findings

ExperimentSequence LengthGPU Memory UsedSpeed (tokens/s)Accuracy
Baseline (full attention)64 KOOM on 24 GB GPU
Stream‑CQSA (budget 12 GB)1 M11.8 GB1.2 KExact (0 % error)
Stream‑CQSA (budget 12 GB)1 B11.9 GB0.8 KExact (0 % error)
  • Predictable memory scaling – Memory usage grows linearly with the chosen block size, not with the total sequence length.
  • Zero approximation error – Because the decomposition is mathematically exact, downstream model performance (e.g., perplexity on language modeling benchmarks) matches that of a naïve full‑attention run.
  • No extra hardware – All experiments run on a single NVIDIA A100 (40 GB) or even a 24 GB RTX‑3090, demonstrating that the method is practical for most research labs and even high‑end workstations.

Practical Implications

  • Long‑document processing – Applications such as legal contract analysis, scientific paper summarization, or code‑base understanding can now feed entire documents (hundreds of megabytes) into a transformer without truncation or chunk‑wise heuristics.
  • Cost‑effective scaling – Companies can avoid multi‑GPU clusters for inference on long contexts, reducing both hardware spend and engineering complexity.
  • Plug‑and‑play integration – Since Stream‑CQSA works with standard tensor ops, existing transformer codebases (e.g., Hugging Face Transformers, DeepSpeed) can adopt it by swapping the attention module and adding the scheduler wrapper.
  • Enabling new research – Researchers studying attention patterns over massive contexts (e.g., emergent reasoning, long‑range dependency probing) now have an exact tool that won’t be limited by memory.

Limitations & Future Work

  • Throughput trade‑off – Streaming introduces extra data movement between GPU and host memory, which can lower raw token‑per‑second throughput compared to a fully in‑GPU implementation on shorter sequences.
  • Scheduler overhead – The current heuristic for block sizing is simple; more sophisticated memory‑prediction models could further optimize performance.
  • Extension to sparse/approximate attention – While the method already achieves exactness, combining it with existing sparse‑attention kernels could yield even higher speedups for ultra‑long sequences.
  • Multi‑device orchestration – The authors plan to explore coordinated streaming across multiple GPUs or TPUs to handle workloads that exceed a single device’s compute capacity while still preserving the exactness guarantee.

Stream‑CQSA turns the “out‑of‑memory” error from a hard wall into a configurable resource knob, opening the door for developers to harness true long‑context attention without sacrificing accuracy.

Authors

  • Yiming Bian
  • Joshua M. Akey

Paper Information

  • arXiv ID: 2604.20819v1
  • Categories: cs.LG, cs.DC
  • Published: April 22, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »