[Paper] IOMMU Support for Virtual-Address Remote DMA in an ARMv8 environment

Published: (November 24, 2025 at 11:11 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.19258v2

Overview

This thesis demonstrates that ARM’s System Memory Management Unit (SMMU)—the platform’s I/O MMU—can be used to translate virtual addresses for remote DMA transfers, a capability required by the Unimem “virtualized global address space” vision. By building and exercising custom Linux kernel modules on a Xilinx Zynq UltraScale+ MPSoC, the author proves that DMA engines in both the processing system (PS) and programmable logic (PL) can safely and transparently operate on virtual addresses.

Key Contributions

  • First‑hand validation of SMMU‑based virtual‑address DMA on a real ARMv8 MPSoC platform.
  • Kernel‑module test harnesses that program SMMU translation entries and verify end‑to‑end DMA flows from both PS and PL.
  • Dynamic page‑table sharing: a module that points the SMMU at a user‑space page table, eliminating the need for static virtual‑to‑physical mappings.
  • Comprehensive documentation of the SMMU programming model, filling gaps in the sparse Linux upstream docs.

Methodology

  1. Platform selection – The Xilinx Zynq UltraScale+ MPSoC was chosen because it integrates an ARMv8 CPU cluster (PS) with programmable logic (PL) and an on‑chip SMMU.
  2. Kernel‑module development – Two custom modules were written:
    • Mapping module: inserts explicit virtual‑to‑physical entries into the SMMU and triggers a DMA write to the virtual address.
    • Dynamic‑translation module: configures the SMMU to use the page‑table base of a user process, letting the hardware walk the page tables on‑the‑fly.
  3. DMA test patterns – Simple memory buffers were allocated, filled with known data, and transferred via the PS‑DMA engine or a PL‑based DMA IP. After each transfer, the destination buffer was inspected to confirm correct data movement.
  4. Verification – The author logged SMMU translation faults, inspected the IOMMU page‑walk hardware counters, and compared physical addresses derived from /proc/iomem with those observed during DMA.

All experiments were performed on a single node; the multi‑node coherence aspect of Unimem is left for later work.

Results & Findings

ScenarioHow the address was suppliedSMMU behaviorOutcome
PS‑initiated DMA with static mappingVirtual address → pre‑programmed SMMU entryTranslation succeeded, no faultsData landed correctly in target buffer
PL‑initiated DMA with static mappingVirtual address → pre‑programmed SMMU entrySame translation path as PSCorrect data transfer confirmed
PL‑initiated DMA with dynamic page‑table pointerVirtual address → user‑process page tableSMMU performed on‑the‑fly page walksSuccessful transfer without any manual mapping

The experiments prove that the SMMU can act as a true IOMMU for virtual‑address remote DMA, handling both CPU‑side and FPGA‑side DMA engines. The dynamic approach shows that a single SMMU configuration can serve an entire user process, dramatically simplifying software stacks.

Practical ImpImplications

  • Simplified programming model – Developers can issue DMA to virtual pointers just like regular memory accesses, removing the need for manual pinning and physical address bookkeeping.
  • Security & isolation – Because the SMMU enforces per‑process page tables, rogue DMA cannot accidentally access another process’s memory, aligning with modern zero‑trust designs.
  • Accelerator integration – FPGA‑based accelerators (e.g., AI inference engines) can be wired to the PL DMA engine and operate directly on user‑space buffers, cutting latency and CPU overhead.
  • Foundation for distributed shared memory – The ability to translate virtual addresses across nodes is a prerequisite for systems like Unimem, which aim to present a single address space over a cluster of heterogeneous compute nodes.
  • Tooling boost – The kernel modules and documentation created in this work can serve as a starting point for open‑source drivers that need SMMU support (e.g., RDMA NICs, high‑speed storage).

Limitations & Future Work

  • Single‑node scope – The thesis validates SMMU behavior only on one MPSoC; scaling to multi‑node coherence (the ultimate Unimem goal) remains untested.
  • Feature coverage – Advanced SMMU capabilities such as stream IDs, address‑space identifiers, and fault‑handling callbacks were not explored.
  • Performance analysis – The work focuses on correctness; quantitative latency/throughput impact of virtual‑address translation versus traditional physical DMA is left for later benchmarking.
  • Portability – The custom modules target Xilinx’s Zynq UltraScale+; adapting the approach to other ARMv8 platforms (e.g., Qualcomm, NXP) may require additional driver work.

Future research can extend the proof‑of‑concept to a full distributed memory system, evaluate performance trade‑offs, and integrate the SMMU‑enabled DMA path into mainstream Linux drivers.

Authors

  • Antonis Psistakis

Paper Information

  • arXiv ID: 2511.19258v2
  • Categories: cs.DC, cs.AR
  • Published: November 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »