[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions
Source: arXiv - 2512.05703v1
Overview
Serverless platforms (FaaS) promise “pay‑as‑you‑go” compute without the headache of provisioning servers, but the underlying scheduler still struggles to place functions efficiently. The paper Metronome: Differentiated Delay Scheduling for Serverless Functions investigates why classic delay‑scheduling tricks from Hadoop‑style clusters don’t translate directly to serverless workloads, and introduces a machine‑learning‑driven scheduler that adapts the delay per‑function to improve latency while respecting SLAs.
Key Contributions
- Empirical study of delay scheduling in serverless – systematic evaluation of existing techniques reveals three non‑obvious properties of serverless workloads (input‑dependent locality benefits, dual data + infrastructure locality, and heterogeneous execution times).
- Differentiated delay concept – instead of a one‑size‑fits‑all delay threshold, Metronome decides per‑function how long to wait for a “better” node.
- Online Random Forest regression model – predicts a function’s execution time on each candidate node using lightweight runtime features, enabling the scheduler to pick the node that minimizes overall latency.
- SLA‑aware delay control – the model feeds a safety margin that guarantees the chosen delay never pushes the function past its deadline.
- Prototype on OpenLambda – real‑world experiments show mean execution‑time reductions of 64.9 %–95.8 % versus vanilla OpenLambda and other baselines, even under high concurrency.
Methodology
- Workload Characterization – the authors run a diverse set of benchmark functions (CPU‑bound, I/O‑bound, data‑intensive) on OpenLambda and record where latency improvements appear when a function waits for a “local” node.
- Observation Extraction – three patterns emerge:
- (a) locality gains depend on input size,
- (b) “local” can mean data stored on the same storage node or the same physical host (infrastructure locality), and
- (c) execution times vary wildly, making static delay thresholds ineffective.
- Predictive Scheduler Design – Metronome maintains an online Random Forest model that ingests:
- Function metadata (size of input, language runtime, memory request)
- Node state (CPU load, network bandwidth, cached data presence)
- Historical execution times on that node
The model outputs an estimated runtime for each candidate node.
- Delay Decision Logic – for a newly arrived function, Metronome:
- Computes the expected finish time on the currently available node.
- Checks whether waiting for a “better” node (according to the model) would finish earlier and stay within the SLA.
- If yes, it delays the dispatch; otherwise it schedules immediately.
- Implementation & Evaluation – integrated into OpenLambda’s dispatcher, the system is tested with varying concurrency levels (up to 500 simultaneous invocations) and compared against:
- No‑delay (immediate dispatch)
- Fixed‑threshold delay scheduling (classic approach)
- Random placement baseline
Results & Findings
| Metric | Metronome | Fixed‑Delay | Immediate | Random |
|---|---|---|---|---|
| Mean execution time reduction (vs. Immediate) | 64.9 % – 95.8 % | 12 % – 28 % | – | 5 % – 10 % |
| SLA violation rate | 0 % (within 95 % confidence) | 3 % – 7 % | 0 % (but higher latency) | 0 % (but much slower) |
| Throughput under high concurrency (500 req/s) | +22 % over Fixed‑Delay | – | – | – |
| Overhead (model inference + bookkeeping) | ~2 ms per request | negligible | – | – |
Key takeaways
- Predictive delay beats static thresholds because it tailors the wait time to each function’s characteristics and the current cluster state.
- Dual locality matters – caching data and co‑locating on the same physical host can cut network latency dramatically for data‑heavy functions.
- SLA safety is preserved – the model’s confidence interval is used to bound the maximum allowed delay, preventing missed deadlines.
Practical Implications
- Serverless providers can embed a lightweight predictor in their dispatchers to squeeze out latency gains without adding hardware.
- Developers can benefit from lower cold‑start times for data‑intensive functions simply by enabling the platform’s “locality‑aware” mode; no code changes are required.
- Edge‑oriented FaaS (e.g., Cloudflare Workers, AWS Lambda@Edge) can adopt the differentiated delay idea to decide whether to run a function on a nearby edge node or fall back to a central data center, balancing latency vs. resource availability.
- Cost optimization – shorter execution times translate directly into lower billable compute seconds, especially for high‑frequency micro‑services.
- Observability tooling – the same runtime features used for prediction (input size, cache hit/miss) are valuable signals for debugging and capacity planning.
Limitations & Future Work
- Model freshness – Random Forests are retrained periodically; rapid workload shifts could temporarily degrade prediction accuracy.
- Feature set is platform‑specific; porting Metronome to other FaaS stacks (AWS Lambda, Azure Functions) will require re‑engineering the telemetry collection.
- Scalability of the predictor – while inference is cheap, maintaining per‑node models at massive scale (tens of thousands of nodes) may need hierarchical or federated learning approaches.
- Security & multi‑tenant isolation – exposing node‑level load information to the scheduler could raise isolation concerns; future designs must ensure no cross‑tenant leakage.
- The authors suggest exploring reinforcement‑learning schedulers that can directly optimize a composite objective (latency + cost + SLA) and investigating heterogeneous hardware (GPU/FPGA) where locality decisions become even more nuanced.
Authors
- Zhuangbin Chen
- Juzheng Zheng
- Zibin Zheng
Paper Information
- arXiv ID: 2512.05703v1
- Categories: cs.SE, cs.DC
- Published: December 5, 2025
- PDF: Download PDF