[Paper] Horizon-LM: A RAM-Centric Architecture for LLM Training

Published: (February 4, 2026 at 01:04 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.04816v1

Overview

The paper introduces Horizon‑LM, a new training system that flips the traditional GPU‑centric view of large‑language‑model (LLM) training on its head. By treating the host’s RAM as the primary parameter store and using GPUs only as short‑lived compute workers, Horizon‑LM can train models that were previously impossible on a single node, dramatically reducing the reliance on multi‑GPU clusters.

Key Contributions

  • Memory‑centric architecture: Host memory becomes the authoritative parameter repository; GPUs act as transient compute engines.
  • CPU‑master / GPU‑template execution model: Eliminates persistent GPU‑resident model copies and autograd graphs, cutting down on GPU memory pressure.
  • Explicit recomputation & manual gradient propagation: Replaces automatic differentiation with a lightweight, programmer‑controlled pipeline that keeps memory usage bounded to the model’s parameter size.
  • Double‑buffered pipelined engine: Overlaps data movement, forward, and backward passes to keep the GPU busy despite frequent host‑GPU transfers.
  • Scalable single‑node training: Demonstrates training of up to 120 B‑parameter models on a single NVIDIA H200 GPU equipped with 1.5 TB of host RAM.
  • Performance gains: Achieves up to 12.2× higher throughput than DeepSpeed ZeRO‑3 with CPU offloading on a standard A100‑based workstation, while preserving numerical correctness.

Methodology

  1. Parameter Store Relocation – All model weights reside in host RAM. The system maintains a single, coherent copy, avoiding the need for each GPU to hold its own shard.
  2. Transient GPU Execution – For each training step, a template of the model is streamed onto the GPU, executed, and then discarded. No persistent autograd graph lives on the device.
  3. Manual Gradient Flow – Instead of relying on the deep learning framework’s automatic differentiation, Horizon‑LM explicitly recomputes activations during the backward pass and manually accumulates gradients in host memory.
  4. Double‑Buffering – Two buffers per GPU stage allow the next micro‑batch to be loaded while the current one is still being processed, hiding PCIe/NVLink transfer latency.
  5. Pipeline Scheduling – The system schedules forward, backward, and weight‑update phases across the buffers so that the GPU is never idle, even though most data lives on the CPU side.

The overall design is deliberately simple: the CPU orchestrates data movement and gradient aggregation, while the GPU focuses on raw matrix multiplications.

Results & Findings

PlatformHost RAMGPUMax Model Size TrainedThroughput vs. DeepSpeed ZeRO‑3
NVIDIA H200 (1.5 TB RAM)1.5 TBH200120 B parameters
NVIDIA A100 (standard workstation)256 GBA10030 B parameters12.2× faster
NVIDIA A100 (256 GB RAM)256 GBA10045 B parameters8.5× faster
  • Memory predictability: Peak GPU memory never exceeds the theoretical minimum needed for a single micro‑batch, independent of model size.
  • Device utilization: GPU occupancy stays above 85 % across all tested configurations, confirming that the double‑buffered pipeline effectively hides data‑transfer overhead.
  • Numerical fidelity: Training loss curves match those of ZeRO‑3 to within 0.1 % across all experiments, demonstrating that the manual recomputation does not degrade model quality.

Practical Implications

  • Node‑scale fine‑tuning: Researchers and engineers can now perform instruction tuning, alignment, or domain adaptation on 100 B‑scale models without provisioning multi‑node clusters.
  • Cost reduction: By leveraging cheap host RAM (e.g., DDR4/DDR5) instead of expensive GPU memory, organizations can repurpose existing high‑memory servers for LLM work.
  • Simplified stack: Eliminating complex distributed runtimes (e.g., NCCL‑based all‑reduce) reduces operational overhead and debugging complexity.
  • Hardware flexibility: The approach works on any GPU with sufficient PCIe/NVLink bandwidth; even consumer‑grade GPUs become viable for large‑model experiments when paired with ample RAM.
  • Future hardware design: Signals a shift toward “memory‑first” accelerators where the accelerator’s role is pure compute, while the system memory hierarchy handles capacity.

Limitations & Future Work

  • CPU‑GPU bandwidth bound: The approach hinges on high‑throughput interconnects; on systems with slower PCIe links, the double‑buffered pipeline may become a bottleneck.
  • Manual gradient handling: While the paper provides a framework, developers must adapt their training loops to the explicit recomputation model, which could increase code complexity.
  • Scalability beyond a single node: Horizon‑LM focuses on node‑scale training; extending the memory‑centric model to multi‑node clusters (e.g., across multiple servers) remains an open challenge.
  • Support for exotic operators: Custom kernels or non‑tensor operations may need additional engineering to fit into the transient GPU execution model.

The authors suggest exploring adaptive buffering strategies, tighter integration with existing deep‑learning frameworks, and hybrid multi‑node extensions as next steps.

Authors

  • Zhengqing Yuan
  • Lichao Sun
  • Yanfang
  • Ye

Paper Information

  • arXiv ID: 2602.04816v1
  • Categories: cs.OS, cs.CL, cs.DC
  • Published: February 4, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »