[Paper] Equilibria: Fair Multi-Tenant CXL Memory Tiering At Scale
Source: arXiv - 2602.08800v1
Overview
Memory is the biggest cost and power driver in modern datacenters, and the emergence of Compute Express Link (CXL) promises cheaper, low‑power memory expansion. However, turning raw CXL capacity into predictable performance for dozens of co‑running services is surprisingly hard. The paper Equilibria: Fair Multi‑Tenant CXL Memory Tiering At Scale introduces an operating‑system framework that lets cloud operators allocate, monitor, and enforce fair‑share policies for tiered CXL memory across many containers, while keeping latency‑sensitive workloads on track.
Key Contributions
- Per‑container fair‑share control – a new OS interface that lets admins specify how much CXL memory each container may use, independent of the host’s global memory manager.
- Fine‑grained observability – lightweight metrics and tracing hooks that expose promotion (fast → slow tier) and demotion activity per tenant, enabling root‑cause analysis at scale.
- Policy‑driven promotion/demotion – a flexible regulator that can enforce arbitrary fairness policies (e.g., proportional share, min‑max) while throttling aggressive thrashing that would otherwise cause noisy‑neighbor effects.
- Production‑grade implementation – patches integrated into the mainline Linux kernel (released to the community) and evaluated on a hyperscaler’s fleet with real workloads.
- Performance gains – up to 52 % improvement over the existing Linux tiering solution (TPP) on production services and 1.7× on benchmark mixes.
Methodology
- Design of a new memory tiering layer – built on top of the Linux page‑fault path, the layer intercepts allocation requests and decides whether a page lives in local DRAM or remote CXL memory.
- Container‑aware accounting – each cgroup gets a “fair‑share quota” that the tiering layer consults before promoting a page to the slower CXL tier.
- Regulated promotion engine – instead of naïvely moving pages whenever DRAM pressure rises, the engine applies a token‑bucket‑style regulator that respects the per‑tenant quota and caps promotion rates.
- Observability hooks – the authors added per‑cgroup counters (promotions, demotions, thrash events) and exposed them via
procfs/sysfsand eBPF maps, allowing operators to build dashboards without heavy tracing overhead. - Evaluation – the system was deployed on a real hyperscaler cluster (hundreds of nodes, each with several terabytes of CXL memory). Workloads included production micro‑services, batch jobs, and standard memory‑intensive benchmarks (e.g., Memcached, Redis, SPEC‑CPU). Metrics collected: SLO compliance (latency tail), overall throughput, and fairness indices (Jain’s fairness).
Results & Findings
| Metric | Linux TPP (baseline) | Equilibria | Improvement |
|---|---|---|---|
| 99th‑percentile latency (prod micro‑service) | 12 ms | 7 ms | +42 % |
| Throughput (Redis workload) | 1.2 M ops/s | 1.8 M ops/s | +52 % |
| Benchmark (memory‑bandwidth bound) | 0.9× baseline | 1.7× baseline | +1.7× |
| Fairness (Jain index) | 0.71 | 0.94 | +0.23 |
| Promotion thrash events | 3.4 k /h | 0.9 k /h | ‑73 % |
Key takeaways:
- By preventing a single tenant from monopolizing the CXL tier, overall latency tails shrink dramatically.
- The regulated promotion logic cuts down on “ping‑pong” page migrations that otherwise waste bandwidth and increase power.
- Operators can now pinpoint which container is causing excessive promotions, something that was impossible with the prior kernel implementation.
Practical Implications
- Cloud providers can roll out CXL‑backed memory pools without fearing that a noisy tenant will degrade the entire node’s performance, enabling cheaper hardware refresh cycles.
- DevOps teams gain a programmable API (
cgroupextensions) to enforce memory budgets per service, aligning resource usage with business‑level SLAs. - Application architects can design workloads that deliberately spill to CXL for large, cold data structures, knowing the OS will keep hot paths in DRAM and prevent surprise latency spikes.
- Observability platforms (Prometheus, Grafana, etc.) can ingest the new metrics with minimal changes, providing real‑time dashboards for memory tier health and fairness compliance.
- The open‑source patches mean any Linux‑based stack—from edge servers to hyperscalers—can adopt the framework without waiting for a vendor‑specific fork.
Limitations & Future Work
- Hardware dependency: The current prototype assumes CXL‑type‑3 devices with predictable latency; performance on newer CXL‑type‑4 or heterogeneous memory (e.g., NVDIMM) remains untested.
- Policy complexity: While the regulator supports proportional‑share policies, more sophisticated QoS models (e.g., deadline‑aware or burstable memory) would require additional kernel extensions.
- Scalability of counters: Per‑cgroup counters scale well up to a few thousand containers per node, but ultra‑dense workloads (tens of thousands) may need hierarchical aggregation to avoid overhead.
- Cross‑node tiering: The work focuses on intra‑node memory tiering; extending fairness guarantees across a cluster of nodes with shared CXL pools is an open research direction.
Overall, Equilibria demonstrates that with the right OS abstractions, CXL memory can be turned into a practical, fair, and observable resource for modern multi‑tenant datacenters.
Authors
- Kaiyang Zhao
- Neha Gholkar
- Hasan Maruf
- Abhishek Dhanotia
- Johannes Weiner
- Gregory Price
- Ning Sun
- Bhavya Dwivedi
- Stuart Clark
- Dimitrios Skarlatos
Paper Information
- arXiv ID: 2602.08800v1
- Categories: cs.OS, cs.DC
- Published: February 9, 2026
- PDF: Download PDF