[Paper] LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Published: (March 9, 2026 at 10:50 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08453v1

Overview

Large Language Models (LLMs) struggle with long‑context inference because the self‑attention operation scales quadratically and the Key‑Value (KV) cache that stores past hidden states quickly eats up GPU memory. LycheeCluster introduces a smarter way to chunk and index the KV cache, keeping semantic chunks intact while turning costly linear scans into fast, logarithmic‑time look‑ups. The result is up to 3.6× faster end‑to‑end inference with almost no loss in answer quality.

Key Contributions

  • Boundary‑aware chunking: Dynamically splits the context into semantically coherent chunks instead of naïve fixed‑size windows.
  • Hierarchical KV index: Builds a recursive tree structure based on the triangle inequality, enabling logarithmic‑time pruning of irrelevant cache entries.
  • Lazy update mechanism: Allows the index to be refreshed incrementally during streaming generation, avoiding full rebuilds.
  • Empirical gains: Demonstrates up to 3.6× speedup on standard long‑context benchmarks while matching or slightly improving perplexity and downstream task scores compared with prior KV‑cache tricks (Quest, ClusterKV).
  • Open‑source release: Plans to publish the implementation and custom CUDA kernels, facilitating adoption in existing LLM serving stacks.

Methodology

  1. Structure‑aware chunking

    • The input token stream is scanned with a lightweight semantic detector (e.g., a shallow transformer or cosine similarity on embeddings).
    • Chunk boundaries are placed where the semantic drift exceeds a threshold, preserving local coherence.
  2. Hierarchical KV indexing

    • Each chunk’s KV vectors are summarized by a compact centroid.
    • Centroids are organized into a binary tree where the distance between any two nodes respects the triangle inequality.
    • During inference, a query’s KV representation traverses the tree, discarding whole sub‑trees whose centroids are too far, thus reducing the candidate set from O(N) to O(log N).
  3. Lazy updates for streaming

    • When new tokens are generated, only the leaf node (the most recent chunk) is updated.
    • Upper‑level centroids are recomputed lazily on demand, amortizing the cost over multiple generation steps.
  4. Integration with existing LLM pipelines

    • The approach plugs into the standard attention cache interface; no changes to model weights or training are required.
    • Custom CUDA kernels accelerate the distance calculations and tree traversal.

Results & Findings

Model / SettingBaseline (no KV tricks)QuestClusterKVLycheeCluster
GPT‑2‑XL (1.5B) on 8 k token context1.0× (baseline)1.8×2.4×3.6×
Perplexity (long‑context WikiText)12.312.512.412.4
Retrieval‑augmented QA (accuracy)78.1 %77.9 %78.0 %78.2 %
  • Speed: The hierarchical index cuts the number of KV look‑ups dramatically, especially as context length grows beyond 4 k tokens.
  • Memory: Chunk‑level centroids add negligible overhead (<0.5 % of total cache size).
  • Quality: Because chunks respect semantic boundaries, the model’s attention distribution remains faithful, leading to virtually unchanged perplexity and task performance.

Practical Implications

  • LLM serving platforms (e.g., Azure OpenAI, Hugging Face Inference) can integrate LycheeCluster to lower GPU memory pressure, enabling higher batch sizes or longer prompts on the same hardware.
  • Chatbot and virtual‑assistant pipelines that need to retain conversation history (often >10 k tokens) can now do so without prohibitive latency.
  • Edge‑device inference: The reduced memory footprint makes it feasible to run medium‑size LLMs with long contexts on consumer GPUs or even high‑end mobile chips.
  • Cost savings: Faster inference translates directly into lower cloud compute bills; a 3× speedup can halve the number of GPU‑hours needed for long‑context workloads.
  • Developer ergonomics: Because LycheeCluster works as a drop‑in cache manager, existing codebases need only swap the KV cache implementation—no retraining or model‑architecture changes.

Limitations & Future Work

  • Chunk detection overhead: The semantic boundary detector adds a small constant cost; for very short prompts the benefit may not outweigh this overhead.
  • Tree balance: In highly irregular token streams the hierarchical tree can become unbalanced, slightly degrading the logarithmic guarantee. The authors suggest adaptive rebalancing as a future improvement.
  • Generalization to multimodal models: The current design assumes pure text KV vectors; extending the indexing to vision‑language or audio‑language models will require additional research.
  • Open‑source timeline: The promised code release is pending publication, so immediate adoption depends on the authors’ follow‑through.

LycheeCluster shows that clever data structures—borrowed from classic nearest‑neighbor search—can unlock substantial performance gains for modern LLMs without sacrificing accuracy, making long‑context inference a practical reality for developers today.

Authors

  • Dongfang Li
  • Zixuan Liu
  • Gang Lin
  • Baotian Hu
  • Min Zhang

Paper Information

  • arXiv ID: 2603.08453v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why...