[Paper] DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management

Published: (December 8, 2025 at 03:56 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.07312v1

Overview

Large language models (LLMs) are driving AI accelerators toward ever‑more complex memory hierarchies. This paper flips the script: instead of adding deeper, hard‑to‑manage scratchpad memories, the authors propose a shared system‑level cache that is dynamically orchestrated using information from the software stack. By making the cache “aware” of the dataflow in LLM inference workloads, they achieve up to 1.8× speedup with modest hardware overhead, offering a practical path for next‑generation AI chips.

Key Contributions

  • Predictive cache replacement that leverages compile‑time dataflow graphs to anticipate dead blocks and evict them early.
  • Application‑aware bypass logic that decides, on a per‑access basis, whether data should skip the cache entirely, reducing unnecessary traffic.
  • Thrashing mitigation mechanisms that detect and break harmful access patterns across cores, preserving cache usefulness even under high contention.
  • Cycle‑accurate simulation + analytical model that together validate the approach on both small‑scale benchmarks and extrapolated large‑scale LLM workloads.
  • RTL prototype fabricated in 15 nm (0.064 mm², 2 GHz) demonstrating that the added control logic fits comfortably within a modern accelerator floorplan.

Methodology

  1. Dataflow Extraction – The compiler (or a lightweight runtime) emits a graph describing which tensors are produced, consumed, and for how long they remain useful.
  2. Cache Policy Engine – A small hardware unit reads the graph metadata at runtime and:
    • Marks blocks that will never be reused (dead‑block prediction) for immediate eviction.
    • Issues bypass signals for streams that are either one‑time reads or have a predictable reuse distance beyond the cache capacity.
    • Monitors per‑core access counters to spot thrashing (e.g., two cores repeatedly evicting each other’s hot lines) and temporarily pins hot lines.
  3. Simulation Framework – A cycle‑accurate accelerator simulator models a multi‑core LLM inference engine with a shared L2 cache. The authors compare three baselines: (a) vanilla LRU, (b) LRU + bypass, (c) full DCO (bypass + thrashing mitigation + dead‑block prediction).
  4. Analytical Extension – Using measured miss/hit rates, they build a queuing‑theoretic model that predicts performance for larger models (e.g., 175 B parameters) where full simulation would be prohibitive.
  5. RTL Implementation – The policy engine is synthesized to verify area, timing, and power impact on a real silicon process.

Results & Findings

ConfigurationSpeedup vs. Baseline LRUCache Miss ReductionArea Overhead
Bypass only1.22×15 %0.018 mm²
Thrashing mitigation1.35×22 %0.025 mm²
Full DCO (bypass + thrashing + dead‑block)1.80×38 %0.064 mm²
  • Dead‑block prediction alone cuts unnecessary evictions by ~12 % on average.
  • Bypass decisions dramatically lower bandwidth pressure on the shared cache, especially for large embedding look‑ups that are streamed once.
  • Thrashing mitigation shines when multiple cores share intermediate activations (e.g., transformer layers), preventing ping‑pong evictions.
  • The analytical model predicts >1.5× speedup for 100‑B‑parameter models, confirming scalability.
  • Power impact is modest: the extra control logic adds < 2 % to total accelerator power at 2 GHz.

Practical Implications

  • Simpler Software Stacks – Developers can keep a single shared cache instead of juggling multiple private scratchpads, reducing the need for hand‑tuned memory tiling.
  • Portability Across Accelerators – The policy engine is lightweight enough to be integrated into existing GPU‑like or TPU‑like cores without redesigning the memory hierarchy.
  • Better Multi‑Tenant Utilization – In cloud inference services where many requests share the same hardware, DCO’s dynamic throttling of thrashing improves overall throughput and latency predictability.
  • Compiler‑Driven Optimizations – Existing ML compilers (TVM, XLA) can emit the required dataflow hints with minimal changes, enabling automatic adoption.
  • Future Chip‑Scale Designs – The modest area cost (≈ 0.064 mm²) leaves room for additional compute units or larger caches, making DCO a viable building block for next‑gen AI ASICs.

Limitations & Future Work

  • Static Dataflow Assumptions – The approach relies on accurate compile‑time graphs; highly dynamic models (e.g., runtime‑generated control flow) may reduce prediction accuracy.
  • Scalability to Hundreds of Cores – The current evaluation caps at a modest core count; further study is needed to ensure the policy engine doesn’t become a bottleneck in massive many‑core chips.
  • Energy Modeling – While area and timing are measured, a full energy‑per‑operation analysis (especially for bypass paths) is left for future silicon validation.
  • Integration with Existing Cache Coherence Protocols – The paper focuses on a single shared cache; extending DCO to hierarchical or coherent multi‑level caches will require additional protocol tweaks.

Overall, DCO demonstrates that smarter, software‑aware cache management can deliver big performance wins for LLM accelerators without the engineering overhead of deep scratchpad hierarchies—a compelling direction for both chip designers and AI developers.

Authors

  • Zhongchun Zhou
  • Chengtao Lai
  • Yuhang Gu
  • Wei Zhang

Paper Information

  • arXiv ID: 2512.07312v1
  • Categories: cs.AR, cs.AI, cs.DC
  • Published: December 8, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Relational Visual Similarity

Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also...