[Paper] Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling

Published: (December 26, 2025 at 10:42 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.22066v1

Overview

Large Language Model (LLM) inference is notoriously power‑hungry, and the design of on‑chip memory (SRAM) and clock speed can make a huge difference in both cost and carbon footprint. This paper dissects how SRAM capacity and operating frequency affect the two distinct phases of LLM inference—prefill (compute‑heavy) and decode (memory‑heavy)—and pinpoints a sweet spot that minimizes the energy‑delay product for datacenter‑scale accelerators.

Key Contributions

  • Dual‑phase analysis: Separates the energy‑performance trade‑offs of the compute‑bound prefill stage from the memory‑bound decode stage.
  • SRAM‑size impact: Shows that larger on‑chip buffers increase static (leakage) energy far more than they reduce latency, making small buffers (32‑64 KB) optimal.
  • Frequency‑bandwidth ceiling: Demonstrates that raising the compute clock helps prefill latency but quickly hits a ceiling in decode because external memory bandwidth becomes the bottleneck.
  • Energy‑delay product (EDP) optimum: Identifies a hardware configuration (1200‑1400 MHz, 32‑64 KB SRAM) that delivers the lowest EDP for the evaluated workloads.
  • Methodology integration: Combines OpenRAM (energy), LLMCompass (latency), and ScaleSIM (systolic‑array intensity) into a unified simulation stack, enabling reproducible architectural exploration.

Methodology

  1. Energy modeling with OpenRAM

    • Parameterized SRAM cells (size, voltage, temperature) to estimate dynamic switching energy and static leakage.
  2. Latency simulation via LLMCompass

    • Executes representative transformer workloads (prefill and decode) on a cycle‑accurate model of a systolic array, capturing compute stalls and memory accesses.
  3. Operational intensity from ScaleSIM

    • Calculates the ratio of arithmetic ops to memory traffic for each layer, feeding this into the roofline model to locate compute‑ vs. memory‑bound regimes.
  4. Design space sweep

    • Varies SRAM capacity (8 KB–256 KB) and clock frequency (800 MHz–1500 MHz) across the two phases, recording total energy, latency, and the resulting EDP.

All simulations run on a fixed external DRAM bandwidth (≈ 400 GB/s), mirroring typical datacenter GPU/TPU interconnects.

Results & Findings

ConfigurationPrefill LatencyDecode LatencyTotal EnergyEDP (Energy × Delay)
32 KB SRAM, 1300 MHz↓ 18 % vs. 256 KBNear‑optimal (bandwidth‑limited)Minimal (leakage cut)Best
256 KB SRAM, 1300 MHzSlightly lower latencyNegligible gain (still bandwidth‑bound)↑ 45 % (leakage)Worse
64 KB SRAM, 900 MHzHigher latencyBandwidth ceiling reached earlier↑ 30 %Worse
  • Static energy dominates: Larger buffers add up to 40 % more leakage without proportionate latency reduction.
  • Frequency benefits plateau: Above ~1.2 GHz, prefill speeds up, but decode latency flattens because the external memory cannot feed data any faster.
  • Counter‑intuitive energy win: The higher dynamic power from a faster clock is outweighed by the reduction in static energy (shorter execution → less leakage).

The authors also plotted a roofline diagram confirming that decode quickly becomes memory‑bound, regardless of compute frequency.

Practical Implications

  • Accelerator designers: When sizing on‑chip SRAM for LLM inference, aim for the 32‑64 KB range instead of the commonly used megabyte‑scale buffers. This cuts leakage power dramatically while keeping latency acceptable.
  • Datacenter operators: Deploying chips that run at ~1.3 GHz can lower the overall energy bill, even though they consume more instantaneous power, because jobs finish sooner and the system spends less time in idle/leakage mode.
  • Software stack: Frameworks can expose a “prefill‑decode” mode switch, allowing the scheduler to boost frequency only during the prefill phase and throttle back during decode, extracting the same EDP gains without hardware changes.
  • Memory subsystem planning: Since external bandwidth is the ultimate ceiling, investing in higher‑bandwidth DRAM (e.g., HBM2e) or smarter data‑reuse schemes (e.g., activation recomputation) yields larger performance returns than simply cranking up the compute clock.

Overall, the paper provides a concrete rule‑of‑thumb: “small SRAM + high frequency = best energy‑delay trade‑off for LLM inference.”

Limitations & Future Work

  • Fixed external bandwidth: The study assumes a single DRAM bandwidth value; real systems may have heterogeneous memory hierarchies (HBM, DDR, NVRAM) that could shift the decode bottleneck.
  • Model‑specific workloads: Experiments focus on transformer‑style LLMs; other architectures (e.g., retrieval‑augmented models) might exhibit different compute‑memory balances.
  • Thermal constraints ignored: Running at 1.4 GHz continuously could trigger thermal throttling in practice, which the current simulation does not capture.
  • Future directions: Extending the framework to explore mixed‑precision compute, on‑chip compression of activations, and adaptive frequency scaling across inference phases would deepen the architectural insights.

Authors

  • Hannah Atmer
  • Yuan Yao
  • Thiemo Voigt
  • Stefanos Kaxiras

Paper Information

  • arXiv ID: 2512.22066v1
  • Categories: cs.AR, cs.LG, cs.PF
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »