[Paper] PIM-SHERPA: Software Method for On-device LLM Inference by Resolving PIM Memory Attribute and Layout Inconsistencies

Published: (March 10, 2026 at 01:39 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.09216v1

Overview

The paper PIM‑SHERPA tackles a practical roadblock that’s been holding back on‑device large language model (LLM) inference: the clash between how memory is handled for the prefill (compute‑heavy) and decode (memory‑bandwidth‑heavy) phases when using processing‑in‑memory (PIM) accelerators. By introducing a pure‑software solution that reconciles these memory‑attribute and layout mismatches, the authors demonstrate near‑optimal performance on real hardware while cutting memory usage by almost half.

Key Contributions

  • Identify the “memory attribute inconsistency”: prefill wants weights in a cacheable region for reuse, whereas decode needs them in a non‑cacheable region to trigger PIM operations.
  • Expose the “weight layout inconsistency”: host‑friendly weight ordering differs from the swizzled (PIM‑aware) layout required for efficient in‑memory computation.
  • Propose PIM‑SHERPA, a software‑only framework that resolves both inconsistencies via two complementary techniques:
    1. DRAM Double Buffering (DDB) – keeps a single PIM‑ready copy of weights in non‑cacheable DRAM while prefetching the next layer’s swizzled weights into small cacheable buffers.
    2. Online Weight Rearrangement (OWR) – performs an on‑the‑fly swizzled memory copy just before each GEMM, eliminating the need for a permanent PIM‑aware copy.
  • Demonstrate up to ~50 % DRAM capacity savings on Llama 3.2 with performance within a few percent of the theoretical PIM optimum.
  • Show that the solution works on a product‑level PIM‑enabled system, not just a simulated environment, marking the first practical software‑only fix for this class of hardware.

Methodology

  1. System Characterization – The authors profile a typical on‑device LLM inference pipeline, separating the prefill (large matrix multiplications) from the decode (repeated token generation). They measure how each phase interacts with the memory hierarchy of a PIM‑capable DRAM module.
  2. Design of DDB – A double‑buffer scheme is built where:
    • The active layer’s weights reside permanently in the non‑cacheable region (required for PIM).
    • The next layer’s weights are streamed into a small, cacheable buffer ahead of time, allowing the prefill stage to reuse them without violating the PIM trigger condition.
  3. Design of OWR – Instead of maintaining two copies, OWR performs a lightweight “swizzle” copy right before each GEMM call. The copy rearranges weights into the PIM‑friendly layout on demand, leveraging the CPU’s cache to keep the overhead low.
  4. Integration with Existing LLM Runtime – Both techniques are injected into a standard inference stack (tokenizer → embedding → transformer layers → output) with minimal changes to the high‑level code.
  5. Evaluation – Experiments run on a real PIM‑enabled DRAM platform using the Llama 3.2 model (7 B parameters). Baselines include a naïve PIM emulation that stores every layer’s weights in both cacheable and non‑cacheable regions. Metrics captured: DRAM usage, latency per token, and overall throughput.

Results & Findings

MetricBaseline (naïve)PIM‑SHERPA (DDB)PIM‑SHERPA (OWR)
DRAM footprint (GB)~12.0~6.3 (≈ 47.8 % reduction)~6.5 (≈ 49.7 % reduction)
Token‑generation latency12.4 ms12.6 ms12.8 ms
Throughput (tokens/s)80≈ 78≈ 77
Deviation from theoretical PIM max+2 %+3 %

What it means: Both DDB and OWR slash memory consumption by roughly half while keeping latency within a few percent of the ideal PIM performance. The small overhead of the on‑the‑fly copy in OWR is offset by the simplicity of not maintaining a double buffer.

Practical Implications

  • Edge devices can host larger LLMs: By freeing up DRAM, manufacturers can fit higher‑parameter models on the same silicon, enabling richer conversational AI on phones, wearables, or IoT gateways.
  • Zero‑hardware‑change deployment: Since PIM‑SHERPA is purely software, existing PIM‑enabled DRAM modules can be upgraded via firmware or driver updates, shortening time‑to‑market.
  • Simplified memory management for developers: The double‑buffering and on‑demand swizzle are abstracted away from the application layer, meaning developers can continue using familiar frameworks (e.g., PyTorch, TensorFlow) with only a thin runtime shim.
  • Cost‑effective scaling: Reducing DRAM capacity requirements translates directly into lower BOM costs and power consumption—critical for battery‑operated devices.
  • Potential for other PIM workloads: The same attribute‑layout reconciliation ideas could be adapted for vision transformers, recommendation models, or any workload that alternates between compute‑heavy and memory‑heavy phases.

Limitations & Future Work

  • Model size ceiling: While memory savings are substantial, the approach still requires the entire model (or at least the active layer) to fit in the non‑cacheable region; ultra‑large models (>30 B) may still be out of reach on current edge DRAM capacities.
  • Hardware specificity: The techniques exploit particular PIM trigger semantics of the evaluated DRAM; different vendors may expose different attribute bits, requiring adaptation.
  • Runtime overhead on highly parallel workloads: In scenarios with many concurrent inference streams, the on‑the‑fly swizzle could become a bottleneck; future work could explore asynchronous copy pipelines or hardware‑assisted swizzling.
  • Integration with quantization and sparsity: The paper focuses on FP16 weights; extending the method to 4‑bit quantized or sparsified representations could further shrink memory footprints.

Overall, PIM‑SHERPA opens a pragmatic path for bringing powerful LLMs to the edge without redesigning the hardware, and it sets the stage for broader software‑driven optimizations in the emerging PIM ecosystem.

Authors

  • Sunjung Lee
  • Sanghoon Cha
  • Hyeonsu Kim
  • Seungwoo Seo
  • Yuhwan Ro
  • Sukhan Lee
  • Byeongho Kim
  • Yongjun Park
  • Kyomin Sohn
  • Seungwon Lee
  • Jaehoon Yu

Paper Information

  • arXiv ID: 2603.09216v1
  • Categories: cs.DC
  • Published: March 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »