[Paper] Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

Published: (April 27, 2026 at 07:23 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.25061v1

Overview

The paper introduces Spark Policy Toolkit, a set‑of Spark‑native primitives that make it possible to train and serve custom decision‑policy models at the scale of modern data‑lake workloads. By eliminating costly row‑wise Python inference and driver‑side candidate collection, the toolkit preserves the exact semantics of a policy while delivering multi‑million‑row‑per‑second throughput on a 40‑node Databricks cluster.

Key Contributions

  • Two new Spark primitives:
    1. Vectorized inference via mapInPandas and mapInArrow, enabling partition‑level, batch‑wise scoring without leaving the JVM.
    2. Collect‑less split search that evaluates split candidates directly on executors, removing the need to materialize large candidate sets on the driver.
  • Fixed‑input semantic contract that guarantees identical per‑row score vectors, split decisions, and final policy outputs as long as the input rows, feature order, treatment vocab, preprocessing manifest, and split boundaries stay unchanged.
  • Comprehensive evaluation framework covering baseline ladders, backend parity checks, scale‑up split‑search experiments, synthetic and real‑world (Hillstrom) end‑to‑end policy preservation, missing‑value stress tests, and adversarial failure catalogs.
  • Performance benchmarks showing up to 7.23 M rows/s inference with mapInArrow and robust split‑search validity for candidate sets ranging from 10 to 1 000 features and 124 k rows.
  • Empirical guidance on when to prefer mapInArrow vs. mapInPandas (18/24 backend‑ablation settings favor Arrow, 6 favor Pandas), establishing that the optimal backend is workload‑dependent.

Methodology

  1. Semantic Contract Definition – The authors formalize a “fixed‑input lock” that ties together all components of a policy pipeline (raw rows, feature ordering, treatment encoding, preprocessing steps, and split boundaries). If any of these change, the contract is broken and results may drift.
  2. Primitive Implementation
    • mapInPandas/mapInArrow receive an entire partition as a Pandas DataFrame or Arrow Table, run the model’s inference in a vectorized fashion, and emit a new DataFrame with score vectors.
    • The collect‑less split search replaces the classic driver‑side collect() of candidate rows with an executor‑side mapPartitions that scores each candidate locally and returns only the best split decision.
  3. Experimental Design – The authors construct a “baseline ladder” (from naïve row‑wise UDFs to the new primitives) and run a battery of tests:
    • Throughput (rows per second) across varying data sizes (10 M–50 M rows).
    • Scale (number of features F from 10 to 1 000).
    • Semantic preservation under perturbations (repartition, coalesce, shuffle, column order changes).
    • Stress tests for missing values and quantile boundary sensitivity.
    • Adversarial scenarios (e.g., malformed manifests) to catalog failure modes.

Results & Findings

MetricmapInPandasmapInArrow
Throughput (10 M rows)4.72 M rows/s4.72 M rows/s
Throughput (50 M rows)5.31 M rows/s7.23 M rows/s
Split‑search validityWorks up to F = 500Works up to F = 1 000
Backend win‑rate (24 settings)6 wins18 wins

Key takeaways

  • Vectorized inference eliminates the Python‑UDF bottleneck, delivering 5‑7× speed‑ups over traditional row‑wise approaches.
  • Collect‑less split search scales gracefully; the driver never becomes a choke point, enabling policy learning on datasets that would otherwise overflow driver memory.
  • Enforcing the fixed‑input contract eliminates drift: all six tested repartition/shuffle perturbations produce identical policy signatures once the lock is in place, whereas they diverge without it.
  • The performance advantage of Arrow vs. Pandas is workload‑dependent; for some data types (e.g., complex nested structs) Pandas still wins.

Practical Implications

  • Production‑ready policy pipelines – Teams can now embed sophisticated, per‑row decision policies (e.g., credit‑risk scoring, personalized recommendation rules) directly into Spark jobs without sacrificing latency or correctness.
  • Cost savings – By keeping candidate evaluation on executors, memory pressure on the driver is dramatically reduced, allowing smaller driver instances and fewer Spark job restarts.
  • Simplified engineering – The semantic contract provides a clear contract for data engineers: as long as the input schema and preprocessing manifest stay stable, downstream model updates are guaranteed to be reproducible.
  • Framework integration – The primitives are built on standard Spark APIs (mapInPandas, mapInArrow), meaning they can be dropped into existing pipelines with minimal code changes and work on Databricks, EMR, or self‑managed Spark clusters.
  • Performance tuning guidance – The paper’s backend‑ablation results give practitioners a decision tree: start with mapInArrow; if you hit data‑type incompatibilities or memory spikes, fall back to mapInPandas.

Limitations & Future Work

  • Fixed‑input lock rigidity – The contract assumes no change in feature order or preprocessing manifest; in highly dynamic feature stores this may require additional version‑control tooling.
  • Arrow compatibility – Certain complex data types (e.g., variable‑length binary blobs) still perform better with Pandas, limiting the universal applicability of Arrow.
  • Scalability beyond 40 workers – Experiments stop at a 40‑node Databricks cluster; the authors note that network‑topology effects and executor‑side memory fragmentation could surface at larger scales.
  • Extending to streaming – The current work focuses on batch pipelines; future research could explore how the semantic contract and collect‑less split search translate to Structured Streaming workloads.

Bottom line: Spark Policy Toolkit bridges the gap between the expressive power of custom policy learning and the performance demands of production‑scale Spark, giving developers a practical, semantics‑preserving path to deploy intelligent decision systems at data‑lake speed.

Authors

  • Zeyu Bai

Paper Information

  • arXiv ID: 2604.25061v1
  • Categories: cs.DC, cs.DB, cs.LG, cs.PF, eess.SY
  • Published: April 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...