[Paper] PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies
Source: arXiv - 2603.09216v1
Overview
The paper PIM‑SHERPA tackles a practical roadblock that’s been holding back on‑device large language model (LLM) inference: the clash between how memory is handled for the prefill (compute‑heavy) and decode (memory‑bandwidth‑heavy) phases when using processing‑in‑memory (PIM) accelerators. By introducing a pure‑software solution that reconciles these memory‑attribute and layout mismatches, the authors demonstrate near‑optimal performance on real hardware while cutting memory usage by almost half.
Key Contributions
- Identify the “memory attribute inconsistency”: prefill wants weights in a cacheable region for reuse, whereas decode needs them in a non‑cacheable region to trigger PIM operations.
- Expose the “weight layout inconsistency”: host‑friendly weight ordering differs from the swizzled (PIM‑aware) layout required for efficient in‑memory computation.
- Propose PIM‑SHERPA, a software‑only framework that resolves both inconsistencies via two complementary techniques:
- DRAM Double Buffering (DDB) – keeps a single PIM‑ready copy of weights in non‑cacheable DRAM while prefetching the next layer’s swizzled weights into small cacheable buffers.
- Online Weight Rearrangement (OWR) – performs an on‑the‑fly swizzled memory copy just before each GEMM, eliminating the need for a permanent PIM‑aware copy.
- Demonstrate up to ~50 % DRAM capacity savings on Llama 3.2 with performance within a few percent of the theoretical PIM optimum.
- Show that the solution works on a product‑level PIM‑enabled system, not just a simulated environment, marking the first practical software‑only fix for this class of hardware.
Methodology
- System Characterization – The authors profile a typical on‑device LLM inference pipeline, separating the prefill (large matrix multiplications) from the decode (repeated token generation). They measure how each phase interacts with the memory hierarchy of a PIM‑capable DRAM module.
- Design of DDB – A double‑buffer scheme is built where:
- The active layer’s weights reside permanently in the non‑cacheable region (required for PIM).
- The next layer’s weights are streamed into a small, cacheable buffer ahead of time, allowing the prefill stage to reuse them without violating the PIM trigger condition.
- Design of OWR – Instead of maintaining two copies, OWR performs a lightweight “swizzle” copy right before each GEMM call. The copy rearranges weights into the PIM‑friendly layout on demand, leveraging the CPU’s cache to keep the overhead low.
- Integration with Existing LLM Runtime – Both techniques are injected into a standard inference stack (tokenizer → embedding → transformer layers → output) with minimal changes to the high‑level code.
- Evaluation – Experiments run on a real PIM‑enabled DRAM platform using the Llama 3.2 model (7 B parameters). Baselines include a naïve PIM emulation that stores every layer’s weights in both cacheable and non‑cacheable regions. Metrics captured: DRAM usage, latency per token, and overall throughput.
Results & Findings
| Metric | Baseline (naïve) | PIM‑SHERPA (DDB) | PIM‑SHERPA (OWR) |
|---|---|---|---|
| DRAM footprint (GB) | ~12.0 | ~6.3 (≈ 47.8 % reduction) | ~6.5 (≈ 49.7 % reduction) |
| Token‑generation latency | 12.4 ms | 12.6 ms | 12.8 ms |
| Throughput (tokens/s) | 80 | ≈ 78 | ≈ 77 |
| Deviation from theoretical PIM max | – | +2 % | +3 % |
What it means: Both DDB and OWR slash memory consumption by roughly half while keeping latency within a few percent of the ideal PIM performance. The small overhead of the on‑the‑fly copy in OWR is offset by the simplicity of not maintaining a double buffer.
Practical Implications
- Edge devices can host larger LLMs: By freeing up DRAM, manufacturers can fit higher‑parameter models on the same silicon, enabling richer conversational AI on phones, wearables, or IoT gateways.
- Zero‑hardware‑change deployment: Since PIM‑SHERPA is purely software, existing PIM‑enabled DRAM modules can be upgraded via firmware or driver updates, shortening time‑to‑market.
- Simplified memory management for developers: The double‑buffering and on‑demand swizzle are abstracted away from the application layer, meaning developers can continue using familiar frameworks (e.g., PyTorch, TensorFlow) with only a thin runtime shim.
- Cost‑effective scaling: Reducing DRAM capacity requirements translates directly into lower BOM costs and power consumption—critical for battery‑operated devices.
- Potential for other PIM workloads: The same attribute‑layout reconciliation ideas could be adapted for vision transformers, recommendation models, or any workload that alternates between compute‑heavy and memory‑heavy phases.
Limitations & Future Work
- Model size ceiling: While memory savings are substantial, the approach still requires the entire model (or at least the active layer) to fit in the non‑cacheable region; ultra‑large models (>30 B) may still be out of reach on current edge DRAM capacities.
- Hardware specificity: The techniques exploit particular PIM trigger semantics of the evaluated DRAM; different vendors may expose different attribute bits, requiring adaptation.
- Runtime overhead on highly parallel workloads: In scenarios with many concurrent inference streams, the on‑the‑fly swizzle could become a bottleneck; future work could explore asynchronous copy pipelines or hardware‑assisted swizzling.
- Integration with quantization and sparsity: The paper focuses on FP16 weights; extending the method to 4‑bit quantized or sparsified representations could further shrink memory footprints.
Overall, PIM‑SHERPA opens a pragmatic path for bringing powerful LLMs to the edge without redesigning the hardware, and it sets the stage for broader software‑driven optimizations in the emerging PIM ecosystem.
Authors
- Sunjung Lee
- Sanghoon Cha
- Hyeonsu Kim
- Seungwoo Seo
- Yuhwan Ro
- Sukhan Lee
- Byeongho Kim
- Yongjun Park
- Kyomin Sohn
- Seungwon Lee
- Jaehoon Yu
Paper Information
- arXiv ID: 2603.09216v1
- Categories: cs.DC
- Published: March 10, 2026
- PDF: Download PDF