[Paper] DOLMA: A Data Object Level Memory Disaggregation Framework for HPC Applications

Published: 2 months ago (December 1, 2025 at 07:39 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.02300v1

Overview

The paper introduces DOLMA, a framework that lets high‑performance computing (HPC) applications tap into disaggregated (remote) memory without a dramatic slowdown. By moving whole data objects to remote pools and intelligently pre‑fetching them, DOLMA cuts local memory footprints by up to two‑thirds while keeping the performance hit under 16 % for a suite of real HPC workloads.

Key Contributions

Object‑level memory placement – DOLMA automatically decides which data structures belong in local RAM and which can be offloaded to remote memory.
Quantitative sizing model – A lightweight analysis predicts the minimum local memory size needed to meet a target slowdown, enabling system‑wide memory budgeting.
Dual‑buffer prefetch engine – Exploits the regular access patterns of many HPC codes to overlap remote fetches with computation, hiding latency.
Thread‑aware concurrency – Keeps multi‑threaded applications scalable by coordinating local/remote buffers per thread.
Comprehensive evaluation – Tested on eight representative HPC kernels, showing average memory savings of 63 % with ≤ 16 % runtime overhead.

Methodology

Profiling & Object Classification – DOLMA runs a short profiling phase to collect access frequency, stride, and reuse distance for each major data object (arrays, matrices, etc.).
Cost‑Benefit Decision – Using the collected metrics, a simple cost model estimates the latency penalty of remote access versus the memory saved locally. Objects whose remote cost is acceptable are marked for offloading.
Dual‑Buffer Design – For each offloaded object, DOLMA allocates two buffers in local RAM: one holds the currently needed chunk, while the other pre‑fetches the next chunk from remote memory. A background thread issues RDMA reads ahead of the compute thread, so when the algorithm steps to the next region the data is already resident.
Thread‑Level Coordination – In multi‑threaded runs, each thread gets its own pair of buffers, avoiding contention and allowing independent prefetch streams.
Runtime Adaptation – If the observed slowdown exceeds the user‑specified budget, DOLMA can pull additional objects back into local memory on‑the‑fly.

The whole pipeline is implemented as a lightweight library that can be linked to existing MPI or OpenMP HPC codes with minimal code changes.

Results & Findings

Metric	Local‑only baseline	DOLMA (remote)
Memory usage reduction	0 %	63 % average (up to 78 % for some kernels)
Runtime overhead	0 %	≤ 16 % (average 9 %)
Scalability (threads)	Linear up to 32 threads	Near‑linear, < 5 % extra sync cost
Prefetch effectiveness	N/A	Hides ~70 % of remote‑access latency

Key take‑aways

The dual‑buffer prefetch hides most of the network latency, especially for regular, stride‑based accesses common in stencil codes and dense linear algebra.
Even for more irregular kernels (e.g., graph traversals), DOLMA still stays within the 16 % slowdown budget by falling back to a more conservative local placement.
The quantitative sizing model proved accurate within ±5 % of the optimal local memory size determined by exhaustive search.

Practical Implications

Data‑center operators can over‑subscribe memory resources across nodes, reducing hardware costs while still supporting memory‑hungry HPC jobs.
Application developers gain a drop‑in library that automatically balances memory locality vs. capacity, freeing them from manual data placement or custom paging schemes.
System architects can design cheaper nodes with smaller DRAM modules, relying on high‑speed RDMA fabrics (e.g., InfiniBand, RoCE) to provide the bulk of the memory pool.
Cloud‑based HPC services could offer “elastic memory” tiers where users pay for extra remote memory only when needed, with DOLMA handling the migration transparently.

Overall, DOLMA demonstrates that memory disaggregation is not just a theoretical scaling trick—it can be made practical for latency‑sensitive scientific codes.

Limitations & Future Work

Irregular access patterns still incur higher overhead; the current model assumes fairly predictable strides.
The framework relies on a fast RDMA network; performance on commodity Ethernet may be insufficient.
DOLMA currently targets C/C++/Fortran codes compiled with MPI/OpenMP; extending support to GPU‑offloaded workloads remains an open challenge.
Future research directions include adaptive learning‑based placement (e.g., reinforcement learning) and integration with container orchestration platforms for seamless cloud deployment.

Authors

Haoyu Zheng
Shouwei Gao
Jie Ren
Wenqian Dong

Paper Information

arXiv ID: 2512.02300v1
Categories: cs.DC
Published: December 2, 2025
PDF: Download PDF

[Paper] DOLMA: A Data Object Level Memory Disaggregation Framework for HPC Applications

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity