[Paper] Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark
Source: arXiv - 2604.25061v1
Overview
The paper introduces Spark Policy Toolkit, a set‑of Spark‑native primitives that make it possible to train and serve custom decision‑policy models at the scale of modern data‑lake workloads. By eliminating costly row‑wise Python inference and driver‑side candidate collection, the toolkit preserves the exact semantics of a policy while delivering multi‑million‑row‑per‑second throughput on a 40‑node Databricks cluster.
Key Contributions
- Two new Spark primitives:
- Vectorized inference via
mapInPandasandmapInArrow, enabling partition‑level, batch‑wise scoring without leaving the JVM. - Collect‑less split search that evaluates split candidates directly on executors, removing the need to materialize large candidate sets on the driver.
- Vectorized inference via
- Fixed‑input semantic contract that guarantees identical per‑row score vectors, split decisions, and final policy outputs as long as the input rows, feature order, treatment vocab, preprocessing manifest, and split boundaries stay unchanged.
- Comprehensive evaluation framework covering baseline ladders, backend parity checks, scale‑up split‑search experiments, synthetic and real‑world (Hillstrom) end‑to‑end policy preservation, missing‑value stress tests, and adversarial failure catalogs.
- Performance benchmarks showing up to 7.23 M rows/s inference with
mapInArrowand robust split‑search validity for candidate sets ranging from 10 to 1 000 features and 124 k rows. - Empirical guidance on when to prefer
mapInArrowvs.mapInPandas(18/24 backend‑ablation settings favor Arrow, 6 favor Pandas), establishing that the optimal backend is workload‑dependent.
Methodology
- Semantic Contract Definition – The authors formalize a “fixed‑input lock” that ties together all components of a policy pipeline (raw rows, feature ordering, treatment encoding, preprocessing steps, and split boundaries). If any of these change, the contract is broken and results may drift.
- Primitive Implementation –
mapInPandas/mapInArrowreceive an entire partition as a Pandas DataFrame or Arrow Table, run the model’s inference in a vectorized fashion, and emit a new DataFrame with score vectors.- The collect‑less split search replaces the classic driver‑side
collect()of candidate rows with an executor‑sidemapPartitionsthat scores each candidate locally and returns only the best split decision.
- Experimental Design – The authors construct a “baseline ladder” (from naïve row‑wise UDFs to the new primitives) and run a battery of tests:
- Throughput (rows per second) across varying data sizes (10 M–50 M rows).
- Scale (number of features F from 10 to 1 000).
- Semantic preservation under perturbations (repartition, coalesce, shuffle, column order changes).
- Stress tests for missing values and quantile boundary sensitivity.
- Adversarial scenarios (e.g., malformed manifests) to catalog failure modes.
Results & Findings
| Metric | mapInPandas | mapInArrow |
|---|---|---|
| Throughput (10 M rows) | 4.72 M rows/s | 4.72 M rows/s |
| Throughput (50 M rows) | 5.31 M rows/s | 7.23 M rows/s |
| Split‑search validity | Works up to F = 500 | Works up to F = 1 000 |
| Backend win‑rate (24 settings) | 6 wins | 18 wins |
Key takeaways
- Vectorized inference eliminates the Python‑UDF bottleneck, delivering 5‑7× speed‑ups over traditional row‑wise approaches.
- Collect‑less split search scales gracefully; the driver never becomes a choke point, enabling policy learning on datasets that would otherwise overflow driver memory.
- Enforcing the fixed‑input contract eliminates drift: all six tested repartition/shuffle perturbations produce identical policy signatures once the lock is in place, whereas they diverge without it.
- The performance advantage of Arrow vs. Pandas is workload‑dependent; for some data types (e.g., complex nested structs) Pandas still wins.
Practical Implications
- Production‑ready policy pipelines – Teams can now embed sophisticated, per‑row decision policies (e.g., credit‑risk scoring, personalized recommendation rules) directly into Spark jobs without sacrificing latency or correctness.
- Cost savings – By keeping candidate evaluation on executors, memory pressure on the driver is dramatically reduced, allowing smaller driver instances and fewer Spark job restarts.
- Simplified engineering – The semantic contract provides a clear contract for data engineers: as long as the input schema and preprocessing manifest stay stable, downstream model updates are guaranteed to be reproducible.
- Framework integration – The primitives are built on standard Spark APIs (
mapInPandas,mapInArrow), meaning they can be dropped into existing pipelines with minimal code changes and work on Databricks, EMR, or self‑managed Spark clusters. - Performance tuning guidance – The paper’s backend‑ablation results give practitioners a decision tree: start with
mapInArrow; if you hit data‑type incompatibilities or memory spikes, fall back tomapInPandas.
Limitations & Future Work
- Fixed‑input lock rigidity – The contract assumes no change in feature order or preprocessing manifest; in highly dynamic feature stores this may require additional version‑control tooling.
- Arrow compatibility – Certain complex data types (e.g., variable‑length binary blobs) still perform better with Pandas, limiting the universal applicability of Arrow.
- Scalability beyond 40 workers – Experiments stop at a 40‑node Databricks cluster; the authors note that network‑topology effects and executor‑side memory fragmentation could surface at larger scales.
- Extending to streaming – The current work focuses on batch pipelines; future research could explore how the semantic contract and collect‑less split search translate to Structured Streaming workloads.
Bottom line: Spark Policy Toolkit bridges the gap between the expressive power of custom policy learning and the performance demands of production‑scale Spark, giving developers a practical, semantics‑preserving path to deploy intelligent decision systems at data‑lake speed.
Authors
- Zeyu Bai
Paper Information
- arXiv ID: 2604.25061v1
- Categories: cs.DC, cs.DB, cs.LG, cs.PF, eess.SY
- Published: April 27, 2026
- PDF: Download PDF