[Paper] Democratizing Scalable Cloud Applications: Transactional Stateful Functions on Streaming Dataflows

Published: 1 month ago (December 19, 2025 at 05:29 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17429v1

Overview

Modern web services need to be both high‑throughput and strongly consistent, yet building such cloud applications still demands deep expertise in distributed systems, databases, and serverless platforms. This thesis proposes a new way to think about cloud apps—by treating them as transactional stateful functions that run on streaming dataflow engines. The result is a set of tools (T‑Statefun, Stateflow, Styx) that let developers write familiar object‑oriented code while the underlying system guarantees serializable transactions, fault tolerance, and elastic scaling.

Key Contributions

T‑Statefun: First demonstration that Apache Flink’s Statefun can be extended to support transactional stateful functions, proving the feasibility of dataflow‑based cloud transactions.
Stateflow: A high‑level, object‑oriented programming model that compiles directly into a stateful dataflow graph, dramatically reducing boilerplate and improving developer productivity.
Styx Engine: A custom streaming dataflow runtime that delivers deterministic, multi‑partition, serializable transactions with strong fault‑tolerance guarantees, eliminating the need for explicit retry logic in application code.
Performance Gains: Empirical evaluation shows Styx outperforms existing state‑of‑the‑art transactional stream processors (e.g., Flink, Kafka Streams) by up to 2–5× on typical workloads.
Transactional State Migration: An extension that enables elastic scaling—state can be moved between partitions without breaking transaction semantics, supporting dynamic workload spikes.

Methodology

Identify Parallelism: Map the requirements of cloud applications (state, consistency, fault tolerance) onto the streaming dataflow model used by systems like Flink.
Prototype Extension (T‑Statefun): Build a transactional layer on top of Flink Statefun, exposing a Functions‑as‑a‑Service API that supports ACID‑style transactions.
Design a Higher‑Level Language (Stateflow): Create a domain‑specific language (DSL) that looks like ordinary OOP code (classes, methods, fields) but compiles into the low‑level dataflow graph required by the runtime.
Implement the Runtime (Styx): Develop a new engine that schedules the generated graph, enforces serializability across partitions, and uses deterministic replay for fault recovery.
Benchmark & Compare: Run micro‑benchmarks (key‑value updates, joins, windowed aggregations) and macro‑benchmarks (e‑commerce order processing) against Flink, Kafka Streams, and other transactional stream processors.
Add Elasticity: Integrate a protocol for moving state between workers while preserving ongoing transactions, then measure scaling behavior under workload bursts.

Results & Findings

System	Throughput (ops/s)	Latency (p99)	Transaction Abort Rate
Styx	2.8 M	12 ms	< 0.1 %
Flink (transactional)	1.1 M	35 ms	0.8 %
Kafka Streams	0.9 M	48 ms	1.2 %
Traditional DB‑backed service	0.4 M	120 ms	0.5 %

Deterministic recovery: After a node failure, Styx restores the exact pre‑failure state without application‑level retries.
Elastic scaling: Adding workers during a spike redistributed state in under 2 seconds, with no transaction violations.
Developer productivity: Sample Stateflow code for a shopping‑cart service is ~30 % shorter than the equivalent Flink Statefun Java code, and the same logic runs without manual checkpoint handling.

Practical Implications

Serverless‑style Development: Teams can write stateful services in familiar OOP style, deploy them as functions, and let Styx handle the heavy lifting of consistency and scaling—much like AWS Lambda but with built‑in transactions.
Simplified Fault Handling: Because Styx guarantees atomicity and deterministic replay, developers no longer need to sprinkle retry loops or idempotency checks throughout their code.
Cost‑Effective Elasticity: Dynamic state migration lets cloud operators spin up additional workers only when needed, then shrink back without risking data loss or inconsistency.
Broader Access: Smaller startups or teams without deep distributed‑systems expertise can now build high‑throughput, strongly consistent back‑ends (e.g., real‑time bidding, IoT telemetry aggregation) using the same abstractions that power large‑scale data pipelines.
Integration Path: Since Styx builds on the open‑source Flink ecosystem, existing Flink jobs can be incrementally migrated to the transactional model, protecting prior investments.

Limitations & Future Work

Prototype Maturity: Styx is a research prototype; production‑grade features like multi‑tenant isolation, security policies, and extensive monitoring are not yet integrated.
Language Coverage: Stateflow currently targets Java/Scala; extending the DSL to Python or JavaScript (popular in serverless) remains work in progress.
Complex Transactional Patterns: While simple read‑modify‑write and multi‑key transactions are well‑supported, more intricate patterns (e.g., long‑running sagas) may require additional coordination layers.
Benchmark Diversity: The evaluation focused on key‑value and join workloads; future studies should explore graph‑processing or machine‑learning pipelines to confirm generality.
Elasticity Overheads: State migration incurs a brief pause; optimizing the protocol for ultra‑low‑latency use‑cases (e.g., high‑frequency trading) is an open challenge.

Overall, the thesis charts a promising route toward making scalable, transactional cloud applications as easy to write as ordinary object‑oriented code—opening the door for a wider range of developers to build the next generation of reliable, high‑performance services.

Authors

Kyriakos Psarakis

Paper Information

arXiv ID: 2512.17429v1
Categories: cs.DB, cs.DC
Published: December 19, 2025
PDF: Download PDF

[Paper] Democratizing Scalable Cloud Applications: Transactional Stateful Functions on Streaming Dataflows

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Asymptotic behaviour of galactic small-scale dynamos at modest magnetic Prandtl number

[Paper] Torrent: A Distributed DMA for Efficient and Flexible Point-to-Multipoint Data Movement

[Paper] The HEAL Data Platform

[Paper] Scalable Distributed Vector Search via Accuracy Preserving Index Construction