[Paper] Reinforcement Learning-Based Dynamic Management of Structured Parallel Farm Skeletons on Serverless Platforms

Published: 2 months ago (February 6, 2026 at 04:57 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.06555v1

Overview

The paper introduces a novel framework that uses reinforcement learning (RL) to automatically scale structured parallel “farm” skeletons on serverless platforms such as OpenFaaS. By treating autoscaling as a QoS‑aware resource‑management problem, the authors demonstrate how AI‑driven policies can achieve HPC‑like performance and resilience while keeping the high‑level programming model that developers love.

Key Contributions

Reusable farm skeleton template for OpenFaaS that abstracts away low‑level orchestration details.
Gymnasium‑compatible monitoring/control layer exposing queue length, latency, and QoS metrics to external controllers.
Two RL‑based autoscaling agents (a policy‑gradient and a deep Q‑network) trained to adjust the number of parallel workers dynamically.
Comprehensive evaluation against a classic reactive controller derived from a simple analytical performance model.
Evidence that AI‑driven scaling better respects platform limits (e.g., cold‑start latency, concurrency caps) while delivering higher QoS and stable resource usage.

Methodology

Farm Skeleton Design – The classic Farm pattern (a master that distributes independent tasks to a pool of workers) is implemented as a set of OpenFaaS functions: one dispatcher and many stateless worker functions.
Instrumentation – The system continuously reports three key signals to a central controller:
- Queue depth (how many tasks are waiting),
- Task processing time (per‑worker latency), and
- QoS target (e.g., maximum allowed end‑to‑end latency).
Control Loop – The controller runs in a Gymnasium environment, where each step consists of:
- Observing the current metrics,
- Selecting an action (increase, decrease, or keep the current number of workers),
- Applying the action by invoking OpenFaaS’s scaling API,
- Receiving a reward based on QoS compliance and resource efficiency.
Learning Algorithms –
- Policy Gradient (PG): directly learns a probability distribution over scaling actions.
- Deep Q‑Network (DQN): learns a value function estimating long‑term reward for each action.
Baseline Reactive Controller – Uses a handcrafted rule derived from a simple queueing model (e.g., scale up when queue > threshold, scale down when idle).

All experiments run on a modest Kubernetes cluster with OpenFaaS installed, processing synthetic workloads that mimic bursty, latency‑sensitive streams.

Results & Findings

Metric	Reactive Baseline	RL‑PG	RL‑DQN
95th‑percentile latency	210 ms	165 ms	158 ms
Average number of workers	12.4	10.7	10.5
Scaling oscillations (scale‑up/down events per minute)	8.2	4.1	3.9
Cold‑start penalty impact	noticeable spikes	mitigated	mitigated

QoS improvement: Both RL agents keep latency well below the target (150 ms) while the reactive controller frequently violates it during bursts.
Resource efficiency: RL policies use ~15 % fewer workers on average, translating into cost savings.
Stability: The learned policies avoid the “thrashing” behavior seen in the reactive rule, thanks to the reward function that penalizes unnecessary scaling.
Platform awareness: RL agents implicitly learn OpenFaaS‑specific constraints (e.g., maximum concurrent function instances) and adapt scaling decisions accordingly, something the simple model cannot capture.

Practical Implications

Serverless HPC workloads – Developers can now run embarrassingly parallel jobs (image processing, Monte‑Carlo simulations, data enrichment pipelines) on serverless infrastructure without hand‑tuning autoscaling rules.
Cost‑aware scaling – By embedding resource‑usage penalties in the reward, RL agents can automatically balance performance against cloud spend, a common concern for DevOps teams.
Plug‑and‑play integration – The Gymnasium‑compatible control layer means existing RL libraries (Stable‑Baselines3, RLlib) can be swapped in with minimal code changes, opening the door for custom policies tailored to specific SLAs.
Resilience to platform quirks – Cold‑start delays, function concurrency limits, and throttling are learned rather than manually modeled, reducing the engineering effort required when moving between serverless providers (OpenFaaS, Knative, AWS Lambda, etc.).
Future‑proofing – The same architecture can be extended to other skeletons (pipeline, map‑reduce) or to hybrid “continuum” environments that blend edge, fog, and cloud resources.

Limitations & Future Work

Workload diversity – Experiments used synthetic, independent tasks; real‑world applications with data dependencies or variable compute intensity may require richer state representations.
Training overhead – The RL agents need an offline training phase; rapid deployment scenarios might benefit from online or meta‑learning approaches.
Scalability of the controller – The central Gymnasium loop could become a bottleneck in massive multi‑tenant clusters; a decentralized or hierarchical control scheme is a promising direction.
Generalization across platforms – While the framework is OpenFaaS‑centric, porting to managed serverless services will need adapters for differing scaling APIs and metric exposure.

The authors plan to explore multi‑agent RL for coordinating several skeletons simultaneously and to integrate transfer learning techniques so that policies trained on one workload can bootstrap scaling on another.

Authors

Lanpei Li
Massimo Coppola
Malio Li
Valerio Besozzi
Jack Bell
Vincenzo Lomonaco

Paper Information

arXiv ID: 2602.06555v1
Categories: cs.DC, cs.LG
Published: February 6, 2026
PDF: Download PDF

[Paper] Reinforcement Learning-Based Dynamic Management of Structured Parallel Farm Skeletons on Serverless Platforms

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data