[Paper] Horizon-LM: A RAM-Centric Architecture for LLM Training

Published: 2 months ago (February 4, 2026 at 01:04 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.04816v1

Overview

The paper introduces Horizon‑LM, a new training system that flips the traditional GPU‑centric view of large‑language‑model (LLM) training on its head. By treating the host’s RAM as the primary parameter store and using GPUs only as short‑lived compute workers, Horizon‑LM can train models that were previously impossible on a single node, dramatically reducing the reliance on multi‑GPU clusters.

Key Contributions

Memory‑centric architecture: Host memory becomes the authoritative parameter repository; GPUs act as transient compute engines.
CPU‑master / GPU‑template execution model: Eliminates persistent GPU‑resident model copies and autograd graphs, cutting down on GPU memory pressure.
Explicit recomputation & manual gradient propagation: Replaces automatic differentiation with a lightweight, programmer‑controlled pipeline that keeps memory usage bounded to the model’s parameter size.
Double‑buffered pipelined engine: Overlaps data movement, forward, and backward passes to keep the GPU busy despite frequent host‑GPU transfers.
Scalable single‑node training: Demonstrates training of up to 120 B‑parameter models on a single NVIDIA H200 GPU equipped with 1.5 TB of host RAM.
Performance gains: Achieves up to 12.2× higher throughput than DeepSpeed ZeRO‑3 with CPU offloading on a standard A100‑based workstation, while preserving numerical correctness.

Methodology

Parameter Store Relocation – All model weights reside in host RAM. The system maintains a single, coherent copy, avoiding the need for each GPU to hold its own shard.
Transient GPU Execution – For each training step, a template of the model is streamed onto the GPU, executed, and then discarded. No persistent autograd graph lives on the device.
Manual Gradient Flow – Instead of relying on the deep learning framework’s automatic differentiation, Horizon‑LM explicitly recomputes activations during the backward pass and manually accumulates gradients in host memory.
Double‑Buffering – Two buffers per GPU stage allow the next micro‑batch to be loaded while the current one is still being processed, hiding PCIe/NVLink transfer latency.
Pipeline Scheduling – The system schedules forward, backward, and weight‑update phases across the buffers so that the GPU is never idle, even though most data lives on the CPU side.

The overall design is deliberately simple: the CPU orchestrates data movement and gradient aggregation, while the GPU focuses on raw matrix multiplications.

Results & Findings

Platform	Host RAM	GPU	Max Model Size Trained	Throughput vs. DeepSpeed ZeRO‑3
NVIDIA H200 (1.5 TB RAM)	1.5 TB	H200	120 B parameters	—
NVIDIA A100 (standard workstation)	256 GB	A100	30 B parameters	12.2× faster
NVIDIA A100 (256 GB RAM)	256 GB	A100	45 B parameters	8.5× faster

Memory predictability: Peak GPU memory never exceeds the theoretical minimum needed for a single micro‑batch, independent of model size.
Device utilization: GPU occupancy stays above 85 % across all tested configurations, confirming that the double‑buffered pipeline effectively hides data‑transfer overhead.
Numerical fidelity: Training loss curves match those of ZeRO‑3 to within 0.1 % across all experiments, demonstrating that the manual recomputation does not degrade model quality.

Practical Implications

Node‑scale fine‑tuning: Researchers and engineers can now perform instruction tuning, alignment, or domain adaptation on 100 B‑scale models without provisioning multi‑node clusters.
Cost reduction: By leveraging cheap host RAM (e.g., DDR4/DDR5) instead of expensive GPU memory, organizations can repurpose existing high‑memory servers for LLM work.
Simplified stack: Eliminating complex distributed runtimes (e.g., NCCL‑based all‑reduce) reduces operational overhead and debugging complexity.
Hardware flexibility: The approach works on any GPU with sufficient PCIe/NVLink bandwidth; even consumer‑grade GPUs become viable for large‑model experiments when paired with ample RAM.
Future hardware design: Signals a shift toward “memory‑first” accelerators where the accelerator’s role is pure compute, while the system memory hierarchy handles capacity.

Limitations & Future Work

CPU‑GPU bandwidth bound: The approach hinges on high‑throughput interconnects; on systems with slower PCIe links, the double‑buffered pipeline may become a bottleneck.
Manual gradient handling: While the paper provides a framework, developers must adapt their training loops to the explicit recomputation model, which could increase code complexity.
Scalability beyond a single node: Horizon‑LM focuses on node‑scale training; extending the memory‑centric model to multi‑node clusters (e.g., across multiple servers) remains an open challenge.
Support for exotic operators: Custom kernels or non‑tensor operations may need additional engineering to fit into the transient GPU execution model.

The authors suggest exploring adaptive buffering strategies, tighter integration with existing deep‑learning frameworks, and hybrid multi‑node extensions as next steps.

Authors

Zhengqing Yuan
Lichao Sun
Yanfang
Ye

Paper Information

arXiv ID: 2602.04816v1
Categories: cs.OS, cs.CL, cs.DC
Published: February 4, 2026
PDF: Download PDF

[Paper] Horizon-LM: A RAM-Centric Architecture for LLM Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks