[Paper] IOMMU Support for Virtual-Address Remote DMA in an ARMv8 environment
Source: arXiv - 2511.19258v2
Overview
This thesis demonstrates that ARM’s System Memory Management Unit (SMMU)—the platform’s I/O MMU—can be used to translate virtual addresses for remote DMA transfers, a capability required by the Unimem “virtualized global address space” vision. By building and exercising custom Linux kernel modules on a Xilinx Zynq UltraScale+ MPSoC, the author proves that DMA engines in both the processing system (PS) and programmable logic (PL) can safely and transparently operate on virtual addresses.
Key Contributions
- First‑hand validation of SMMU‑based virtual‑address DMA on a real ARMv8 MPSoC platform.
- Kernel‑module test harnesses that program SMMU translation entries and verify end‑to‑end DMA flows from both PS and PL.
- Dynamic page‑table sharing: a module that points the SMMU at a user‑space page table, eliminating the need for static virtual‑to‑physical mappings.
- Comprehensive documentation of the SMMU programming model, filling gaps in the sparse Linux upstream docs.
Methodology
- Platform selection – The Xilinx Zynq UltraScale+ MPSoC was chosen because it integrates an ARMv8 CPU cluster (PS) with programmable logic (PL) and an on‑chip SMMU.
- Kernel‑module development – Two custom modules were written:
- Mapping module: inserts explicit virtual‑to‑physical entries into the SMMU and triggers a DMA write to the virtual address.
- Dynamic‑translation module: configures the SMMU to use the page‑table base of a user process, letting the hardware walk the page tables on‑the‑fly.
- DMA test patterns – Simple memory buffers were allocated, filled with known data, and transferred via the PS‑DMA engine or a PL‑based DMA IP. After each transfer, the destination buffer was inspected to confirm correct data movement.
- Verification – The author logged SMMU translation faults, inspected the IOMMU page‑walk hardware counters, and compared physical addresses derived from
/proc/iomemwith those observed during DMA.
All experiments were performed on a single node; the multi‑node coherence aspect of Unimem is left for later work.
Results & Findings
| Scenario | How the address was supplied | SMMU behavior | Outcome |
|---|---|---|---|
| PS‑initiated DMA with static mapping | Virtual address → pre‑programmed SMMU entry | Translation succeeded, no faults | Data landed correctly in target buffer |
| PL‑initiated DMA with static mapping | Virtual address → pre‑programmed SMMU entry | Same translation path as PS | Correct data transfer confirmed |
| PL‑initiated DMA with dynamic page‑table pointer | Virtual address → user‑process page table | SMMU performed on‑the‑fly page walks | Successful transfer without any manual mapping |
The experiments prove that the SMMU can act as a true IOMMU for virtual‑address remote DMA, handling both CPU‑side and FPGA‑side DMA engines. The dynamic approach shows that a single SMMU configuration can serve an entire user process, dramatically simplifying software stacks.
Practical ImpImplications
- Simplified programming model – Developers can issue DMA to virtual pointers just like regular memory accesses, removing the need for manual pinning and physical address bookkeeping.
- Security & isolation – Because the SMMU enforces per‑process page tables, rogue DMA cannot accidentally access another process’s memory, aligning with modern zero‑trust designs.
- Accelerator integration – FPGA‑based accelerators (e.g., AI inference engines) can be wired to the PL DMA engine and operate directly on user‑space buffers, cutting latency and CPU overhead.
- Foundation for distributed shared memory – The ability to translate virtual addresses across nodes is a prerequisite for systems like Unimem, which aim to present a single address space over a cluster of heterogeneous compute nodes.
- Tooling boost – The kernel modules and documentation created in this work can serve as a starting point for open‑source drivers that need SMMU support (e.g., RDMA NICs, high‑speed storage).
Limitations & Future Work
- Single‑node scope – The thesis validates SMMU behavior only on one MPSoC; scaling to multi‑node coherence (the ultimate Unimem goal) remains untested.
- Feature coverage – Advanced SMMU capabilities such as stream IDs, address‑space identifiers, and fault‑handling callbacks were not explored.
- Performance analysis – The work focuses on correctness; quantitative latency/throughput impact of virtual‑address translation versus traditional physical DMA is left for later benchmarking.
- Portability – The custom modules target Xilinx’s Zynq UltraScale+; adapting the approach to other ARMv8 platforms (e.g., Qualcomm, NXP) may require additional driver work.
Future research can extend the proof‑of‑concept to a full distributed memory system, evaluate performance trade‑offs, and integrate the SMMU‑enabled DMA path into mainstream Linux drivers.
Authors
- Antonis Psistakis
Paper Information
- arXiv ID: 2511.19258v2
- Categories: cs.DC, cs.AR
- Published: November 24, 2025
- PDF: Download PDF