[Paper] DOLMA: A Data Object Level Memory Disaggregation Framework for HPC Applications

Published: (December 1, 2025 at 07:39 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.02300v1

Overview

The paper introduces DOLMA, a framework that lets high‑performance computing (HPC) applications tap into disaggregated (remote) memory without a dramatic slowdown. By moving whole data objects to remote pools and intelligently pre‑fetching them, DOLMA cuts local memory footprints by up to two‑thirds while keeping the performance hit under 16 % for a suite of real HPC workloads.

Key Contributions

  • Object‑level memory placement – DOLMA automatically decides which data structures belong in local RAM and which can be offloaded to remote memory.
  • Quantitative sizing model – A lightweight analysis predicts the minimum local memory size needed to meet a target slowdown, enabling system‑wide memory budgeting.
  • Dual‑buffer prefetch engine – Exploits the regular access patterns of many HPC codes to overlap remote fetches with computation, hiding latency.
  • Thread‑aware concurrency – Keeps multi‑threaded applications scalable by coordinating local/remote buffers per thread.
  • Comprehensive evaluation – Tested on eight representative HPC kernels, showing average memory savings of 63 % with ≤ 16 % runtime overhead.

Methodology

  1. Profiling & Object Classification – DOLMA runs a short profiling phase to collect access frequency, stride, and reuse distance for each major data object (arrays, matrices, etc.).
  2. Cost‑Benefit Decision – Using the collected metrics, a simple cost model estimates the latency penalty of remote access versus the memory saved locally. Objects whose remote cost is acceptable are marked for offloading.
  3. Dual‑Buffer Design – For each offloaded object, DOLMA allocates two buffers in local RAM: one holds the currently needed chunk, while the other pre‑fetches the next chunk from remote memory. A background thread issues RDMA reads ahead of the compute thread, so when the algorithm steps to the next region the data is already resident.
  4. Thread‑Level Coordination – In multi‑threaded runs, each thread gets its own pair of buffers, avoiding contention and allowing independent prefetch streams.
  5. Runtime Adaptation – If the observed slowdown exceeds the user‑specified budget, DOLMA can pull additional objects back into local memory on‑the‑fly.

The whole pipeline is implemented as a lightweight library that can be linked to existing MPI or OpenMP HPC codes with minimal code changes.

Results & Findings

MetricLocal‑only baselineDOLMA (remote)
Memory usage reduction0 %63 % average (up to 78 % for some kernels)
Runtime overhead0 %≤ 16 % (average 9 %)
Scalability (threads)Linear up to 32 threadsNear‑linear, < 5 % extra sync cost
Prefetch effectivenessN/AHides ~70 % of remote‑access latency

Key take‑aways

  • The dual‑buffer prefetch hides most of the network latency, especially for regular, stride‑based accesses common in stencil codes and dense linear algebra.
  • Even for more irregular kernels (e.g., graph traversals), DOLMA still stays within the 16 % slowdown budget by falling back to a more conservative local placement.
  • The quantitative sizing model proved accurate within ±5 % of the optimal local memory size determined by exhaustive search.

Practical Implications

  • Data‑center operators can over‑subscribe memory resources across nodes, reducing hardware costs while still supporting memory‑hungry HPC jobs.
  • Application developers gain a drop‑in library that automatically balances memory locality vs. capacity, freeing them from manual data placement or custom paging schemes.
  • System architects can design cheaper nodes with smaller DRAM modules, relying on high‑speed RDMA fabrics (e.g., InfiniBand, RoCE) to provide the bulk of the memory pool.
  • Cloud‑based HPC services could offer “elastic memory” tiers where users pay for extra remote memory only when needed, with DOLMA handling the migration transparently.

Overall, DOLMA demonstrates that memory disaggregation is not just a theoretical scaling trick—it can be made practical for latency‑sensitive scientific codes.

Limitations & Future Work

  • Irregular access patterns still incur higher overhead; the current model assumes fairly predictable strides.
  • The framework relies on a fast RDMA network; performance on commodity Ethernet may be insufficient.
  • DOLMA currently targets C/C++/Fortran codes compiled with MPI/OpenMP; extending support to GPU‑offloaded workloads remains an open challenge.
  • Future research directions include adaptive learning‑based placement (e.g., reinforcement learning) and integration with container orchestration platforms for seamless cloud deployment.

Authors

  • Haoyu Zheng
  • Shouwei Gao
  • Jie Ren
  • Wenqian Dong

Paper Information

  • arXiv ID: 2512.02300v1
  • Categories: cs.DC
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »