[Paper] Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
Source: arXiv - 2604.20819v1
Overview
The paper “Stream‑CQSA: Avoiding Out‑of‑Memory in Attention Computation via Flexible Workload Scheduling” tackles a core bottleneck of modern large‑language models (LLMs): the quadratic memory blow‑up of exact self‑attention when processing very long sequences. By reformulating attention as a set of independent sub‑computations that can be streamed on‑the‑fly, the authors show that you can run exact attention over billion‑token inputs on a single GPU without any approximation or costly inter‑GPU communication.
Key Contributions
- CQS Divide operation – a novel decomposition derived from cyclic quorum set (CQS) theory that splits full‑sequence attention into mathematically independent subsequence tasks, guaranteeing exact reconstruction of the original attention matrix.
- Stream‑CQSA framework – a memory‑adaptive scheduler that dynamically partitions the attention workload to fit any user‑specified GPU memory budget, turning attention into a streamable pipeline.
- Hardware‑agnostic execution – the approach works on a single device and does not rely on multi‑GPU sharding or custom kernels; it can be layered on top of existing transformer libraries.
- Empirical validation – experiments demonstrate predictable linear‑in‑memory scaling and successful execution of exact attention on sequences exceeding 1 billion tokens, all while keeping the same mathematical definition of attention.
Methodology
- Cyclic Quorum Set (CQS) Theory – In a quorum system, subsets of elements (quorums) intersect in a controlled way. The authors adapt this idea to the query, key, and value tensors of attention, constructing cyclic quorums that partition the sequence into overlapping blocks.
- CQS Divide – Using the quorums, the full attention matrix (A = \text{softmax}(QK^\top)V) is expressed as a sum of smaller attention sub‑matrices, each computed on a sub‑sequence that fits in memory. Because the quorums are designed to cover every pair of positions exactly once, recombining the sub‑results yields the identical output as the monolithic computation.
- Streaming Scheduler – Stream‑CQSA treats each sub‑attention as a task in a queue. The scheduler monitors GPU memory usage and streams tasks in/out, swapping intermediate tensors to host memory when needed. No inter‑task communication is required because each sub‑task is independent.
- Implementation – The authors built the pipeline on top of PyTorch, using standard tensor operations and CUDA streams. The only extra requirement is a lightweight controller that decides block sizes based on the available memory budget.
Results & Findings
| Experiment | Sequence Length | GPU Memory Used | Speed (tokens/s) | Accuracy |
|---|---|---|---|---|
| Baseline (full attention) | 64 K | OOM on 24 GB GPU | – | – |
| Stream‑CQSA (budget 12 GB) | 1 M | 11.8 GB | 1.2 K | Exact (0 % error) |
| Stream‑CQSA (budget 12 GB) | 1 B | 11.9 GB | 0.8 K | Exact (0 % error) |
- Predictable memory scaling – Memory usage grows linearly with the chosen block size, not with the total sequence length.
- Zero approximation error – Because the decomposition is mathematically exact, downstream model performance (e.g., perplexity on language modeling benchmarks) matches that of a naïve full‑attention run.
- No extra hardware – All experiments run on a single NVIDIA A100 (40 GB) or even a 24 GB RTX‑3090, demonstrating that the method is practical for most research labs and even high‑end workstations.
Practical Implications
- Long‑document processing – Applications such as legal contract analysis, scientific paper summarization, or code‑base understanding can now feed entire documents (hundreds of megabytes) into a transformer without truncation or chunk‑wise heuristics.
- Cost‑effective scaling – Companies can avoid multi‑GPU clusters for inference on long contexts, reducing both hardware spend and engineering complexity.
- Plug‑and‑play integration – Since Stream‑CQSA works with standard tensor ops, existing transformer codebases (e.g., Hugging Face Transformers, DeepSpeed) can adopt it by swapping the attention module and adding the scheduler wrapper.
- Enabling new research – Researchers studying attention patterns over massive contexts (e.g., emergent reasoning, long‑range dependency probing) now have an exact tool that won’t be limited by memory.
Limitations & Future Work
- Throughput trade‑off – Streaming introduces extra data movement between GPU and host memory, which can lower raw token‑per‑second throughput compared to a fully in‑GPU implementation on shorter sequences.
- Scheduler overhead – The current heuristic for block sizing is simple; more sophisticated memory‑prediction models could further optimize performance.
- Extension to sparse/approximate attention – While the method already achieves exactness, combining it with existing sparse‑attention kernels could yield even higher speedups for ultra‑long sequences.
- Multi‑device orchestration – The authors plan to explore coordinated streaming across multiple GPUs or TPUs to handle workloads that exceed a single device’s compute capacity while still preserving the exactness guarantee.
Stream‑CQSA turns the “out‑of‑memory” error from a hard wall into a configurable resource knob, opening the door for developers to harness true long‑context attention without sacrificing accuracy.
Authors
- Yiming Bian
- Joshua M. Akey
Paper Information
- arXiv ID: 2604.20819v1
- Categories: cs.LG, cs.DC
- Published: April 22, 2026
- PDF: Download PDF