[Paper] Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling

Published: 1 month ago (December 26, 2025 at 10:42 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.22066v1

Overview

Large Language Model (LLM) inference is notoriously power‑hungry, and the design of on‑chip memory (SRAM) and clock speed can make a huge difference in both cost and carbon footprint. This paper dissects how SRAM capacity and operating frequency affect the two distinct phases of LLM inference—prefill (compute‑heavy) and decode (memory‑heavy)—and pinpoints a sweet spot that minimizes the energy‑delay product for datacenter‑scale accelerators.

Key Contributions

Dual‑phase analysis: Separates the energy‑performance trade‑offs of the compute‑bound prefill stage from the memory‑bound decode stage.
SRAM‑size impact: Shows that larger on‑chip buffers increase static (leakage) energy far more than they reduce latency, making small buffers (32‑64 KB) optimal.
Frequency‑bandwidth ceiling: Demonstrates that raising the compute clock helps prefill latency but quickly hits a ceiling in decode because external memory bandwidth becomes the bottleneck.
Energy‑delay product (EDP) optimum: Identifies a hardware configuration (1200‑1400 MHz, 32‑64 KB SRAM) that delivers the lowest EDP for the evaluated workloads.
Methodology integration: Combines OpenRAM (energy), LLMCompass (latency), and ScaleSIM (systolic‑array intensity) into a unified simulation stack, enabling reproducible architectural exploration.

Methodology

Energy modeling with OpenRAM
- Parameterized SRAM cells (size, voltage, temperature) to estimate dynamic switching energy and static leakage.
Latency simulation via LLMCompass
- Executes representative transformer workloads (prefill and decode) on a cycle‑accurate model of a systolic array, capturing compute stalls and memory accesses.
Operational intensity from ScaleSIM
- Calculates the ratio of arithmetic ops to memory traffic for each layer, feeding this into the roofline model to locate compute‑ vs. memory‑bound regimes.
Design space sweep
- Varies SRAM capacity (8 KB–256 KB) and clock frequency (800 MHz–1500 MHz) across the two phases, recording total energy, latency, and the resulting EDP.

All simulations run on a fixed external DRAM bandwidth (≈ 400 GB/s), mirroring typical datacenter GPU/TPU interconnects.

Results & Findings

Configuration	Prefill Latency	Decode Latency	Total Energy	EDP (Energy × Delay)
32 KB SRAM, 1300 MHz	↓ 18 % vs. 256 KB	Near‑optimal (bandwidth‑limited)	Minimal (leakage cut)	Best
256 KB SRAM, 1300 MHz	Slightly lower latency	Negligible gain (still bandwidth‑bound)	↑ 45 % (leakage)	Worse
64 KB SRAM, 900 MHz	Higher latency	Bandwidth ceiling reached earlier	↑ 30 %	Worse

Static energy dominates: Larger buffers add up to 40 % more leakage without proportionate latency reduction.
Frequency benefits plateau: Above ~1.2 GHz, prefill speeds up, but decode latency flattens because the external memory cannot feed data any faster.
Counter‑intuitive energy win: The higher dynamic power from a faster clock is outweighed by the reduction in static energy (shorter execution → less leakage).

The authors also plotted a roofline diagram confirming that decode quickly becomes memory‑bound, regardless of compute frequency.

Practical Implications

Accelerator designers: When sizing on‑chip SRAM for LLM inference, aim for the 32‑64 KB range instead of the commonly used megabyte‑scale buffers. This cuts leakage power dramatically while keeping latency acceptable.
Datacenter operators: Deploying chips that run at ~1.3 GHz can lower the overall energy bill, even though they consume more instantaneous power, because jobs finish sooner and the system spends less time in idle/leakage mode.
Software stack: Frameworks can expose a “prefill‑decode” mode switch, allowing the scheduler to boost frequency only during the prefill phase and throttle back during decode, extracting the same EDP gains without hardware changes.
Memory subsystem planning: Since external bandwidth is the ultimate ceiling, investing in higher‑bandwidth DRAM (e.g., HBM2e) or smarter data‑reuse schemes (e.g., activation recomputation) yields larger performance returns than simply cranking up the compute clock.

Overall, the paper provides a concrete rule‑of‑thumb: “small SRAM + high frequency = best energy‑delay trade‑off for LLM inference.”

Limitations & Future Work

Fixed external bandwidth: The study assumes a single DRAM bandwidth value; real systems may have heterogeneous memory hierarchies (HBM, DDR, NVRAM) that could shift the decode bottleneck.
Model‑specific workloads: Experiments focus on transformer‑style LLMs; other architectures (e.g., retrieval‑augmented models) might exhibit different compute‑memory balances.
Thermal constraints ignored: Running at 1.4 GHz continuously could trigger thermal throttling in practice, which the current simulation does not capture.
Future directions: Extending the framework to explore mixed‑precision compute, on‑chip compression of activations, and adaptive frequency scaling across inference phases would deepen the architectural insights.

Authors

Hannah Atmer
Yuan Yao
Thiemo Voigt
Stefanos Kaxiras

Paper Information

arXiv ID: 2512.22066v1
Categories: cs.AR, cs.LG, cs.PF
Published: December 26, 2025
PDF: Download PDF

[Paper] Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting