[Paper] Torrent: A Distributed DMA for Efficient and Flexible Point-to-Multipoint Data Movement

Published: (December 19, 2025 at 08:57 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.17589v1

Overview

The paper presents Torrent, a distributed Direct‑Memory‑Access (DMA) engine that enables fast point‑to‑multipoint (P2MP) data movement on a System‑on‑Chip (SoC) without any changes to the underlying Network‑on‑Chip (NoC) hardware or protocol. By chaining destinations together in a logical “linked‑list” across the NoC, Torrent turns a multicast operation into a series of point‑to‑point transfers that are both bandwidth‑efficient and highly scalable—addressing a key bottleneck for data‑parallel workloads such as AI inference and training.

Key Contributions

  • Chainwrite Mechanism – Introduces a novel logical‑chain approach that routes a single data stream through an arbitrary number of destinations while preserving the native point‑to‑point nature of the NoC.
  • Hardware‑Lightweight DMA Architecture – Implements Torrent as a distributed DMA block that can be added to existing cores with only ~1.2 % area overhead and ~2.3 % power increase in a 16 nm ASIC.
  • Topology‑Aware Scheduling Algorithms – Provides two algorithms that automatically order the chain to minimize hop count and contention, adapting to any mesh, torus, or custom NoC topology.
  • Comprehensive Evaluation – Demonstrates up to 7.88× speedup over a naïve unicast baseline and superior flexibility compared with network‑layer multicast, validated on RTL simulations, FPGA prototypes, and ASIC synthesis.
  • Scalability Guarantees – Shows that the per‑destination overhead is only 82 clock cycles and 207 µm², enabling “unlimited” multicast groups without exponential hardware cost.

Methodology

  1. Logical Chain Construction – When a P2MP transfer is requested, Torrent’s controller builds a chain of destination nodes based on the current NoC topology. The data packet is sent from the source to the first destination, which forwards it to the next, and so on, much like a linked list in memory.
  2. Distributed DMA Units – Each core (or memory controller) embeds a small DMA engine that can act as a source, intermediate forwarder, or sink, allowing the chain to be formed entirely in software/firmware without hardware modifications to routers.
  3. Scheduling Algorithms
    • Greedy Hop‑Minimizer: selects the next destination that adds the fewest additional hops.
    • Load‑Balanced Planner: considers both hop count and current router utilization to avoid hotspots.
      The chosen order is programmed into the DMA units before the transfer starts.
  4. Prototype & Synthesis – The authors implemented Torrent in RTL, mapped it onto a Xilinx FPGA for functional validation, and performed ASIC synthesis in 16 nm to measure area, power, and timing. Synthetic benchmarks (random multicast patterns) and real AI workloads (tensor reshapes, weight broadcasting) were used to quantify performance.

Results & Findings

MetricBaseline (Unicast)Network‑Layer MulticastTorrent
Speedup2.1–3.5×up to 7.88×
Area Overhead– (requires router changes)1.2 % of total chip area
Power Overhead– (extra router logic)2.3 % of total chip power
Latency per Destination150 CC*120 CC*82 CC (fixed)
ScalabilityLinear with #destinationsLimited by multicast tree depthUnlimited destinations, constant per‑dest cost

*CC = clock cycles, measured on a 200 MHz reference NoC.

The results show that Torrent not only outperforms traditional unicast replication but also beats specialized multicast NoCs, all while keeping the hardware footprint tiny. The scheduling algorithms reduced average chain length by ~15 % compared with a naïve ordering, translating directly into lower latency and energy.

Practical Implications

  • AI Accelerators – Weight and activation broadcasting, common in CNNs and Transformers, can be handled by a single Torrent DMA launch, cutting memory traffic and freeing bandwidth for compute.
  • Edge SoCs – Devices with strict area and power budgets (e.g., smartphones, IoT gateways) can adopt Torrent without redesigning their NoC, gaining multicast capability “for free.”
  • Software‑Defined Multicast – Since the chain is built at runtime, developers can dynamically adjust groups based on workload characteristics, enabling adaptive data distribution strategies.
  • Legacy Compatibility – Existing IP blocks can be retrofitted with the lightweight DMA, making Torrent a drop‑in upgrade path for current silicon generations.
  • Energy Savings – Fewer packet injections and reduced router contention lower dynamic power, which is especially valuable in data‑center accelerators where energy per operation is a key metric.

Limitations & Future Work

  • Chain Latency Accumulation – While the per‑destination overhead is small, very long chains (hundreds of nodes) could still introduce noticeable end‑to‑end latency; hierarchical chaining could mitigate this.
  • Topology Dependence – The scheduling algorithms assume knowledge of static NoC topology; dynamic re‑routing or irregular topologies may require more sophisticated heuristics.
  • Fault Tolerance – A broken node in the chain would halt the entire multicast; future extensions could incorporate redundant paths or recovery mechanisms.
  • Software Tooling – The current prototype relies on manual chain generation; integrating Torrent into compiler or runtime libraries (e.g., TVM, LLVM) would streamline adoption.

Overall, Torrent opens a pragmatic path to high‑performance, flexible multicast on today’s NoC‑based SoCs, offering a compelling blend of speed, scalability, and minimal hardware impact.

Authors

  • Yunhao Deng
  • Fanchen Kong
  • Xiaoling Yi
  • Ryan Antonio
  • Marian Verhelst

Paper Information

  • arXiv ID: 2512.17589v1
  • Categories: cs.AR, cs.DC
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] The HEAL Data Platform

Objective: The objective was to develop a cloud-based, federated system to serve as a single point of search, discovery and analysis for data generated under th...