[Paper] Offloading to CXL-based Computational Memory
Source: arXiv - 2512.04449v1
Overview
The paper introduces KAI, a system that lets CPUs offload compute‑intensive tasks to CXL‑based Computational Memory (CCM) devices. By designing a new “asynchronous back‑streaming” protocol, the authors show how to cut data‑movement overhead in disaggregated memory architectures and boost overall application performance.
Key Contributions
- Trade‑off analysis of existing CXL operation offloading models across different CXL protocol versions (CXL.io, CXL.cache, CXL.mem).
- Asynchronous Back‑Streaming protocol, which layers data and control transfers to maximize parallelism while keeping hardware changes minimal.
- KAI runtime that implements the protocol, providing lightweight pipelining and asynchronous host‑CCM interaction.
- Empirical evaluation demonstrating up to 50.4 % reduction in end‑to‑end runtime, with host idle time cut by 3.85× and CCM idle time by 22.11× on a suite of heterogeneous workloads.
Methodology
- Characterizing CXL Protocols – The authors first map out the capabilities and latency/throughput characteristics of the three CXL protocol flavors, identifying where each excels or stalls for compute offload.
- Designing the Protocol – Building on this analysis, they devise an “asynchronous back‑streaming” scheme that decouples data movement from control signaling. The host pushes input data to the CCM, the CCM processes it, and results stream back without requiring the host to wait for each step.
- Implementing KAI – KAI sits in the host OS kernel and in the CCM firmware. It orchestrates command queues, buffers, and completion notifications, enabling pipelined execution of multiple offloaded kernels.
- Benchmark Suite – They evaluate KAI on a mix of memory‑bound (graph analytics, key‑value stores) and compute‑bound (matrix multiplication, encryption) kernels, comparing against baseline CXL offload approaches that use synchronous, lock‑step transfers.
Results & Findings
| Metric | Baseline (sync) | KAI (async back‑stream) | Improvement |
|---|---|---|---|
| End‑to‑end runtime (average) | 1.00× | 0.50× | ‑50.4 % |
| Host idle time | 1.00× | 0.26× | ‑3.85× |
| CCM idle time | 1.00× | 0.045× | ‑22.11× |
| Throughput (GB/s) | 12.3 | 19.8 | +61 % |
Key takeaways
- Asynchrony removes the “stop‑and‑wait” bottleneck, allowing the host to continue issuing work while the CCM streams results back.
- Pipelining across multiple kernels yields near‑linear scaling up to the bandwidth limits of the underlying CXL link.
- The protocol works across all three CXL flavors, but gains are most pronounced on CXL.mem, where larger payloads can be moved without cache coherence overhead.
Practical Implications
- Accelerated Disaggregated Systems – Cloud providers can pack more compute into memory‑only nodes, reducing the need for costly CPU cycles and lowering latency for data‑intensive services (e.g., real‑time analytics, AI inference).
- Simplified Offload APIs – KAI’s runtime can be wrapped in familiar programming models (e.g., OpenCL, CUDA‑like kernels), letting developers target CCM without rewriting low‑level CXL drivers.
- Energy Savings – By keeping both host and CCM busy, idle power draw drops dramatically, which is attractive for hyperscale data centers aiming for greener operations.
- Hardware‑agnostic Benefits – Since the protocol is built on top of standard CXL transactions, existing CXL‑compatible devices can be upgraded via firmware to enjoy KAI’s performance boost without redesigning silicon.
Limitations & Future Work
- Prototype Scope – The evaluation runs on a limited set of CCM prototypes; results may vary with commercial‑grade memory‑compute chips that have different latency characteristics.
- Memory Consistency – KAI assumes a relaxed consistency model; workloads requiring strict ordering may need additional synchronization, potentially eroding some gains.
- Scalability Beyond a Single Link – The paper focuses on a single host‑CCM connection; extending the protocol to multi‑host, multi‑CCM topologies (e.g., fabric‑wide offload) remains an open challenge.
- Tooling & Debug Support – Debugging asynchronous offloads across the CXL boundary is non‑trivial; future work could integrate tracing and profiling hooks into the runtime.
Overall, KAI demonstrates that thoughtful protocol design can unlock the latent performance of CXL‑based computational memory, offering a practical path for developers to harness near‑memory processing in next‑generation disaggregated architectures.
Authors
- Suyeon Lee
- Kangkyu Park
- Kwangsik Shin
- Ada Gavrilovska
Paper Information
- arXiv ID: 2512.04449v1
- Categories: cs.DC
- Published: December 4, 2025
- PDF: Download PDF