[Paper] HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

Published: 3 days ago (May 8, 2026 at 06:41 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.07569v1

Overview

Training large language models (LLMs) with very long context windows (hundreds of thousands to a million tokens) is becoming essential for next‑generation AI applications. Existing systems rely on Context Parallelism (CP) and Head Parallelism (HP) but assume a homogeneous GPU fleet—same GPU model, identical memory, and uniform interconnect bandwidth. HexiSeq breaks that assumption, enabling CP‑HP training on heterogeneous clusters that mix, for example, H100 and A100 GPUs and have uneven network links. The paper shows that with smart workload placement, developers can squeeze more throughput out of existing, non‑uniform hardware without buying a brand‑new homogeneous rack.

Key Contributions

Asymmetric CP‑HP Partitioning: Extends the classic CP and HP schemes to allow arbitrary splits of sequence shards and attention heads across devices with different compute, memory, and bandwidth.
Formal Optimization Model: Casts heterogeneous CP‑HP allocation as a constrained optimization problem that respects per‑GPU memory limits, compute capacity, and communication costs.
Hierarchical Scheduler: Introduces an efficient two‑level scheduler (global coarse‑grained placement + local fine‑grained refinement) that finds near‑optimal schedules in milliseconds, even for clusters with dozens of GPU types.
Comprehensive Evaluation: Benchmarks HexiSeq on real mixed H100–A100 clusters and on a large simulation suite (32–128 GPUs, up to four GPU models). Shows average throughput gains of 1.11×–1.36× and peak improvements up to 1.72× over homogeneous baselines.
FLOP‑Comparable Parity: Demonstrates that heterogeneous clusters, when orchestrated by HexiSeq, can achieve throughput within a few percent of the best homogeneous configuration, proving that “mixed‑hardware” is not a performance penalty.

Methodology

Model the Resources – Each GPU is described by three numbers: compute throughput (TFLOPs), memory capacity (GB), and network bandwidth (GB/s).
Define the Workload – A long‑context LLM training step is split into:
- Sequence shards (chunks of the input token sequence) for CP.
- Attention heads for HP.
  Both dimensions can be partitioned independently.
Optimization Formulation – The goal is to maximize overall training throughput (tokens processed per second) while satisfying:
- Memory constraints per GPU (shard + head data must fit).
- Compute constraints (no GPU overloaded beyond its TFLOP rating).
- Communication constraints (data transferred across the mesh must respect link bandwidth).
  This yields a mixed‑integer linear program (MILP).
Hierarchical Scheduler – Solving the MILP exactly is too slow for large clusters. HexiSeq therefore:
- Stage 1 (Global): Uses a greedy heuristic to assign large “chunks” of shards/heads to groups of similar GPUs.
- Stage 2 (Local): Refines each group with a lightweight integer solver that fine‑tunes the exact split to respect the remaining constraints.
  The scheduler runs in < 0.5 s for a 128‑GPU cluster.
Implementation – Built on top of an existing CP/HP training stack (e.g., DeepSpeed or Megatron‑LM), HexiSeq adds a thin abstraction layer that intercepts tensor placement calls and injects the schedule computed by the optimizer.

Results & Findings

Setup	Model Size	Context Length	Throughput (tokens/s)	Speed‑up vs. Homogeneous Baseline
Mixed H100 + A100 (8 + 8 GPUs)	30 B	512 k	1.19× higher	—
Simulated 32‑GPU (4 models)	70 B	1 M	1.36× average, 1.72× peak	—
3 B‑70 B range, 128‑GPU cluster	Various	Up to 1 M	1.11×–1.19× on real hardware	—

Memory Utilization: HexiSeq keeps each GPU’s memory within 95 % of capacity, avoiding out‑of‑memory crashes that plague naïve CP/HP on heterogeneous meshes.
Communication Overhead: By aligning high‑bandwidth links with the heaviest data transfers (large shards), the scheduler reduces cross‑model traffic by ~30 % compared to naïve round‑robin placement.
Scalability: Throughput gains grow with the number of distinct GPU types; the more heterogeneity, the larger the relative benefit.
Parity with Homogeneous FLOP‑Match: When matching total FLOPs (e.g., swapping two A100s for one H100), HexiSeq’s throughput is within 3 % of the best homogeneous configuration, confirming that the optimizer extracts near‑optimal performance.

Practical Implications

Cost‑Effective Scaling: Companies can repurpose older GPUs (A100, V100) alongside newer H100s without sacrificing training speed, extending the ROI of existing hardware.
Cloud Flexibility: In multi‑tenant cloud environments where instance types vary, HexiSeq can automatically stitch together a heterogeneous pod, reducing the need for custom VM selection scripts.
Long‑Context Applications: Researchers building retrieval‑augmented generation, code‑completion, or scientific reasoning models that need million‑token windows can now train at scale without building a dedicated homogeneous super‑cluster.
Tooling Integration: Because HexiSeq sits as a scheduler layer, it can be dropped into popular LLM training pipelines (PyTorch, DeepSpeed, Megatron‑LM) with minimal code changes—just a config file describing each GPU’s specs.
Energy & Utilization: By matching workloads to the most capable GPUs, idle power on weaker devices is reduced, leading to greener training runs.

Limitations & Future Work

Scheduler Overhead on Very Large Meshes: While sub‑second for up to 128 GPUs, the hierarchical approach may need further scaling tricks (e.g., distributed scheduling) for thousands of GPUs.
Static Resource Profiles: HexiSeq assumes static compute/memory/bandwidth numbers; dynamic variations (thermal throttling, network congestion) are not yet modeled.
Limited to CP & HP: Other parallelism strategies (tensor, pipeline) are not covered; integrating them could unlock further gains for extremely large models.
Fault Tolerance: The current prototype does not handle GPU failures mid‑training; future work could incorporate checkpoint‑aware re‑balancing.
Broader Benchmarks: Evaluation focused on transformer‑style LLMs; applying HexiSeq to vision‑language or multimodal models remains an open question.

Bottom line: HexiSeq shows that with a smart scheduler, heterogeneous GPU clusters are no longer a bottleneck for long‑context LLM training, opening the door for more flexible, cost‑effective AI development pipelines.

Authors

Yan Liang
Youhe Jiang
Ran Yan
Binhang Yuan
Wei Wang
Chuan Wu

Paper Information

arXiv ID: 2605.07569v1
Categories: cs.DC
Published: May 8, 2026
PDF: Download PDF

[Paper] HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Stencil Computations on Cerebras Wafer-Scale Engine

[Paper] Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

[Paper] A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

[Paper] Stencil Computations on Tenstorrent Wormhole