[Paper] Hestia: Hyperthread-Level Scheduling for Cloud Microservices with Interference-Aware Attention

Published: 3 days ago (February 27, 2026 at 02:36 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23758v1

Overview

Modern cloud platforms pack dozens of latency‑sensitive microservices onto a single physical server to squeeze out every ounce of compute power. While this improves utilization, the simultaneous‑multithreading (SMT) feature that lets two logical hyperthreads share a physical core introduces subtle, asymmetric interference that can dramatically hurt tail latency. The paper “Hestia: Hyperthread‑Level Scheduling for Cloud Microservices with Interference‑Aware Attention” proposes a new scheduler that works at the hyperthread granularity, using a self‑attention model to predict contention and make smarter placement decisions.

Key Contributions

Empirical discovery of two dominant contention patterns – Sharing‑Core (SC) and Sharing‑Socket (SS) – across >32 k microservice instances in production, showing that interference is highly asymmetric.
Self‑attention‑based CPU‑usage predictor that captures both SC/SS contention and hardware heterogeneity (different core speeds, cache sizes, etc.).
Interference scoring model that quantifies pairwise contention risk, enabling the scheduler to avoid harmful hyperthread pairings.
Hestia scheduling framework that operates at the hyperthread level, integrating the predictor and scoring model to place microservice instances dynamically.
Extensive evaluation: large‑scale trace‑driven simulation and a real‑world deployment demonstrate up to 80 % reduction in 95th‑percentile latency, a 2.3 % CPU saving, and up to 30.65 % improvement over five state‑of‑the‑art schedulers.

Methodology

Trace Collection & Analysis – The authors mined production logs from 3,132 servers, extracting per‑instance CPU usage, latency, and hardware topology. Statistical clustering revealed that most interference could be explained by two patterns:
- SC: two hyperthreads on the same physical core compete for execution units and L1/L2 cache.
- SS: hyperthreads on different cores but sharing the same CPU socket contend for shared resources (LLC, memory bandwidth).
Self‑Attention Predictor – Inspired by transformer models, a lightweight self‑attention network ingests a microservice’s recent CPU usage vector together with a resource‑profile (core frequency, cache size, SMT state). The attention mechanism learns how the usage of one hyperthread influences another, effectively modeling the asymmetric SC/SS effects without hand‑crafted rules.
Interference Scoring – For every candidate hyperthread pair, Hestia computes a score = predicted CPU slowdown × latency sensitivity weight. Lower scores indicate safer pairings.
Scheduler Loop – When a new microservice instance is launched or an existing one is scaled, Hestia queries the scoring matrix, selects the hyperthread with the lowest interference risk, and updates the predictor with the new placement’s observed metrics.
Evaluation –
- Simulation: Replay of the collected traces under different schedulers (including bin‑packing, core‑level interference‑aware, and static partitioning).
- Production: Deployment on a live microservice platform handling a mix of web, database, and cache services, measuring tail latency and CPU utilization.

Results & Findings

Metric	Hestia vs. Baseline (core‑level)	vs. Best Prior Scheduler
95th‑percentile latency reduction	up to 80 %	+30.65 %
Overall CPU consumption (same workload)	‑2.3 %	—
Scheduling overhead (per decision)	< 0.5 ms (negligible)	—
Prediction MAE (CPU usage)	4.1 %	—

SC vs. SS asymmetry: SC interference caused up to 3× higher latency spikes than SS, confirming the need for hyperthread‑aware decisions.
Self‑attention accuracy: The predictor outperformed linear regression and LSTM baselines by 12–18 % in MAE, thanks to its ability to weigh recent usage spikes differently for each hyperthread.
Robustness: Hestia maintained its gains across varying workload mixes (CPU‑bound, I/O‑bound, mixed) and hardware generations (Intel Xeon, AMD EPYC).

Practical Implications

For Cloud Operators – Deploying Hestia can dramatically improve SLA compliance for latency‑critical services (e.g., API gateways, real‑time analytics) without adding hardware.
For DevOps Engineers – The framework integrates with existing Kubernetes or Mesos schedulers via a plug‑in, requiring only the exposure of per‑pod CPU usage and topology metadata.
Cost Savings – A 2 % reduction in CPU usage translates to lower power consumption and the ability to host more microservice instances per server, directly impacting operational expenditure.
Performance‑Sensitive Applications – Gaming back‑ends, fintech transaction processors, and edge‑cloud workloads can benefit from tighter tail‑latency guarantees.
Tooling – The self‑attention model is lightweight (≈ 200 KB) and can run on the same control plane that makes scheduling decisions, avoiding the need for heavyweight ML infrastructure.

Limitations & Future Work

Model Generalization – Hestia’s predictor was trained on traces from a specific data‑center configuration; retraining may be needed for drastically different hardware (e.g., ARM‑based servers).
Scope of Resources – The current interference score focuses on CPU and cache contention; memory bandwidth and I/O interference are not explicitly modeled.
Dynamic Workloads – Extremely bursty workloads that change behavior faster than the predictor’s update interval could still suffer brief latency spikes.
Future Directions – Extending the attention model to jointly predict memory and network contention, exploring reinforcement‑learning‑based placement policies, and open‑sourcing the scheduler plug‑in for broader community adoption.

Authors

Dingyu Yang
Fanyong Kong
Jie Dai
Shiyou Qian
Shuangwei Li
Jian Cao
Guangtao Xue
Gang Chen

Paper Information

arXiv ID: 2602.23758v1
Categories: cs.DC
Published: February 27, 2026
PDF: Download PDF

[Paper] Hestia: Hyperthread-Level Scheduling for Cloud Microservices with Interference-Aware Attention

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] nvidia-pcm: A D-Bus-Driven Platform Configuration Manager for OpenBMC Environments

[Paper] Advanced Scheduling Strategies for Distributed Quantum Computing Jobs

[Paper] Mixed Choice in Asynchronous Multiparty Session Types

[Paper] QoSFlow: Ensuring Service Quality of Distributed Workflows Using Interpretable Sensitivity Models