[Paper] IOMMU Support for Virtual-Address Remote DMA in an ARMv8 environment

Published: 1 week ago (November 24, 2025 at 11:11 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.19258v2

Overview

This thesis demonstrates that ARM’s System Memory Management Unit (SMMU)—the platform’s I/O MMU—can be used to translate virtual addresses for remote DMA transfers, a capability required by the Unimem “virtualized global address space” vision. By building and exercising custom Linux kernel modules on a Xilinx Zynq UltraScale+ MPSoC, the author proves that DMA engines in both the processing system (PS) and programmable logic (PL) can safely and transparently operate on virtual addresses.

Key Contributions

First‑hand validation of SMMU‑based virtual‑address DMA on a real ARMv8 MPSoC platform.
Kernel‑module test harnesses that program SMMU translation entries and verify end‑to‑end DMA flows from both PS and PL.
Dynamic page‑table sharing: a module that points the SMMU at a user‑space page table, eliminating the need for static virtual‑to‑physical mappings.
Comprehensive documentation of the SMMU programming model, filling gaps in the sparse Linux upstream docs.

Methodology

Platform selection – The Xilinx Zynq UltraScale+ MPSoC was chosen because it integrates an ARMv8 CPU cluster (PS) with programmable logic (PL) and an on‑chip SMMU.
Kernel‑module development – Two custom modules were written:
- Mapping module: inserts explicit virtual‑to‑physical entries into the SMMU and triggers a DMA write to the virtual address.
- Dynamic‑translation module: configures the SMMU to use the page‑table base of a user process, letting the hardware walk the page tables on‑the‑fly.
DMA test patterns – Simple memory buffers were allocated, filled with known data, and transferred via the PS‑DMA engine or a PL‑based DMA IP. After each transfer, the destination buffer was inspected to confirm correct data movement.
Verification – The author logged SMMU translation faults, inspected the IOMMU page‑walk hardware counters, and compared physical addresses derived from /proc/iomem with those observed during DMA.

All experiments were performed on a single node; the multi‑node coherence aspect of Unimem is left for later work.

Results & Findings

Scenario	How the address was supplied	SMMU behavior	Outcome
PS‑initiated DMA with static mapping	Virtual address → pre‑programmed SMMU entry	Translation succeeded, no faults	Data landed correctly in target buffer
PL‑initiated DMA with static mapping	Virtual address → pre‑programmed SMMU entry	Same translation path as PS	Correct data transfer confirmed
PL‑initiated DMA with dynamic page‑table pointer	Virtual address → user‑process page table	SMMU performed on‑the‑fly page walks	Successful transfer without any manual mapping

The experiments prove that the SMMU can act as a true IOMMU for virtual‑address remote DMA, handling both CPU‑side and FPGA‑side DMA engines. The dynamic approach shows that a single SMMU configuration can serve an entire user process, dramatically simplifying software stacks.

Practical ImpImplications

Simplified programming model – Developers can issue DMA to virtual pointers just like regular memory accesses, removing the need for manual pinning and physical address bookkeeping.
Security & isolation – Because the SMMU enforces per‑process page tables, rogue DMA cannot accidentally access another process’s memory, aligning with modern zero‑trust designs.
Accelerator integration – FPGA‑based accelerators (e.g., AI inference engines) can be wired to the PL DMA engine and operate directly on user‑space buffers, cutting latency and CPU overhead.
Foundation for distributed shared memory – The ability to translate virtual addresses across nodes is a prerequisite for systems like Unimem, which aim to present a single address space over a cluster of heterogeneous compute nodes.
Tooling boost – The kernel modules and documentation created in this work can serve as a starting point for open‑source drivers that need SMMU support (e.g., RDMA NICs, high‑speed storage).

Limitations & Future Work

Single‑node scope – The thesis validates SMMU behavior only on one MPSoC; scaling to multi‑node coherence (the ultimate Unimem goal) remains untested.
Feature coverage – Advanced SMMU capabilities such as stream IDs, address‑space identifiers, and fault‑handling callbacks were not explored.
Performance analysis – The work focuses on correctness; quantitative latency/throughput impact of virtual‑address translation versus traditional physical DMA is left for later benchmarking.
Portability – The custom modules target Xilinx’s Zynq UltraScale+; adapting the approach to other ARMv8 platforms (e.g., Qualcomm, NXP) may require additional driver work.

Future research can extend the proof‑of‑concept to a full distributed memory system, evaluate performance trade‑offs, and integrate the SMMU‑enabled DMA path into mainstream Linux drivers.

Authors

Antonis Psistakis

Paper Information

arXiv ID: 2511.19258v2
Categories: cs.DC, cs.AR
Published: November 24, 2025
PDF: Download PDF

[Paper] IOMMU Support for Virtual-Address Remote DMA in an ARMv8 environment

Overview

Key Contributions

Methodology

Results & Findings

Practical ImpImplications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Federated Learning for Terahertz Wireless Communication

[Paper] FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration

[Paper] Offloading to CXL-based Computational Memory

[Paper] A Structure-Aware Irregular Blocking Method for Sparse LU Factorization