[Paper] AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
As large language models scale to longer contexts, loading the growing KV cache during attention computation becomes a critical bottleneck. Previous work has sh...