[Paper] Democratizing Scalable Cloud Applications: Transactional Stateful Functions on Streaming Dataflows

Published: (December 19, 2025 at 05:29 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.17429v1

Overview

Modern web services need to be both high‑throughput and strongly consistent, yet building such cloud applications still demands deep expertise in distributed systems, databases, and serverless platforms. This thesis proposes a new way to think about cloud apps—by treating them as transactional stateful functions that run on streaming dataflow engines. The result is a set of tools (T‑Statefun, Stateflow, Styx) that let developers write familiar object‑oriented code while the underlying system guarantees serializable transactions, fault tolerance, and elastic scaling.

Key Contributions

  • T‑Statefun: First demonstration that Apache Flink’s Statefun can be extended to support transactional stateful functions, proving the feasibility of dataflow‑based cloud transactions.
  • Stateflow: A high‑level, object‑oriented programming model that compiles directly into a stateful dataflow graph, dramatically reducing boilerplate and improving developer productivity.
  • Styx Engine: A custom streaming dataflow runtime that delivers deterministic, multi‑partition, serializable transactions with strong fault‑tolerance guarantees, eliminating the need for explicit retry logic in application code.
  • Performance Gains: Empirical evaluation shows Styx outperforms existing state‑of‑the‑art transactional stream processors (e.g., Flink, Kafka Streams) by up to 2–5× on typical workloads.
  • Transactional State Migration: An extension that enables elastic scaling—state can be moved between partitions without breaking transaction semantics, supporting dynamic workload spikes.

Methodology

  1. Identify Parallelism: Map the requirements of cloud applications (state, consistency, fault tolerance) onto the streaming dataflow model used by systems like Flink.
  2. Prototype Extension (T‑Statefun): Build a transactional layer on top of Flink Statefun, exposing a Functions‑as‑a‑Service API that supports ACID‑style transactions.
  3. Design a Higher‑Level Language (Stateflow): Create a domain‑specific language (DSL) that looks like ordinary OOP code (classes, methods, fields) but compiles into the low‑level dataflow graph required by the runtime.
  4. Implement the Runtime (Styx): Develop a new engine that schedules the generated graph, enforces serializability across partitions, and uses deterministic replay for fault recovery.
  5. Benchmark & Compare: Run micro‑benchmarks (key‑value updates, joins, windowed aggregations) and macro‑benchmarks (e‑commerce order processing) against Flink, Kafka Streams, and other transactional stream processors.
  6. Add Elasticity: Integrate a protocol for moving state between workers while preserving ongoing transactions, then measure scaling behavior under workload bursts.

Results & Findings

SystemThroughput (ops/s)Latency (p99)Transaction Abort Rate
Styx2.8 M12 ms< 0.1 %
Flink (transactional)1.1 M35 ms0.8 %
Kafka Streams0.9 M48 ms1.2 %
Traditional DB‑backed service0.4 M120 ms0.5 %
  • Deterministic recovery: After a node failure, Styx restores the exact pre‑failure state without application‑level retries.
  • Elastic scaling: Adding workers during a spike redistributed state in under 2 seconds, with no transaction violations.
  • Developer productivity: Sample Stateflow code for a shopping‑cart service is ~30 % shorter than the equivalent Flink Statefun Java code, and the same logic runs without manual checkpoint handling.

Practical Implications

  • Serverless‑style Development: Teams can write stateful services in familiar OOP style, deploy them as functions, and let Styx handle the heavy lifting of consistency and scaling—much like AWS Lambda but with built‑in transactions.
  • Simplified Fault Handling: Because Styx guarantees atomicity and deterministic replay, developers no longer need to sprinkle retry loops or idempotency checks throughout their code.
  • Cost‑Effective Elasticity: Dynamic state migration lets cloud operators spin up additional workers only when needed, then shrink back without risking data loss or inconsistency.
  • Broader Access: Smaller startups or teams without deep distributed‑systems expertise can now build high‑throughput, strongly consistent back‑ends (e.g., real‑time bidding, IoT telemetry aggregation) using the same abstractions that power large‑scale data pipelines.
  • Integration Path: Since Styx builds on the open‑source Flink ecosystem, existing Flink jobs can be incrementally migrated to the transactional model, protecting prior investments.

Limitations & Future Work

  • Prototype Maturity: Styx is a research prototype; production‑grade features like multi‑tenant isolation, security policies, and extensive monitoring are not yet integrated.
  • Language Coverage: Stateflow currently targets Java/Scala; extending the DSL to Python or JavaScript (popular in serverless) remains work in progress.
  • Complex Transactional Patterns: While simple read‑modify‑write and multi‑key transactions are well‑supported, more intricate patterns (e.g., long‑running sagas) may require additional coordination layers.
  • Benchmark Diversity: The evaluation focused on key‑value and join workloads; future studies should explore graph‑processing or machine‑learning pipelines to confirm generality.
  • Elasticity Overheads: State migration incurs a brief pause; optimizing the protocol for ultra‑low‑latency use‑cases (e.g., high‑frequency trading) is an open challenge.

Overall, the thesis charts a promising route toward making scalable, transactional cloud applications as easy to write as ordinary object‑oriented code—opening the door for a wider range of developers to build the next generation of reliable, high‑performance services.

Authors

  • Kyriakos Psarakis

Paper Information

  • arXiv ID: 2512.17429v1
  • Categories: cs.DB, cs.DC
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] The HEAL Data Platform

Objective: The objective was to develop a cloud-based, federated system to serve as a single point of search, discovery and analysis for data generated under th...