[Paper] PrismaDV: Automated Task-Aware Data Unit Test Generation
Source: arXiv - 2604.21765v1
Overview
The paper introduces PrismaDV, a novel AI‑driven system that automatically creates task‑aware data unit tests. Unlike existing tools that only check whether a dataset looks “reasonable” in isolation, PrismaDV inspects the actual downstream code that consumes the data, extracts the implicit assumptions the code makes, and synthesizes executable tests that surface data bugs exactly where they would break real applications. The authors also propose SIFTA, a lightweight prompt‑optimization loop that continuously refines PrismaDV’s test‑generation prompts using the few execution results it observes.
Key Contributions
- Task‑aware test generation: Combines static analysis of downstream code with data profiling to infer concrete data assumptions (e.g., column types, value ranges, relational constraints).
- PrismaDV architecture: A modular AI pipeline (code‑analysis → assumption inference → test synthesis) that produces runnable data‑unit‑test scripts.
- SIFTA framework: “Selective Informative Feedback for Task Adaptation” – a prompt‑tuning loop that leverages scarce execution outcomes to improve test relevance over time.
- New benchmarks: Two curated suites covering 60 real‑world tasks across five heterogeneous datasets, released publicly for reproducibility.
- Empirical superiority: Demonstrates consistent gains over both task‑agnostic baselines (e.g., Great Expectations, Deequ) and prior task‑aware attempts, measured by fault‑detection precision and downstream task impact.
Methodology
- Downstream Code Analysis – PrismaDV parses the Python (or SQL) code that loads and processes the dataset, building an abstract syntax tree (AST) and extracting data‑access patterns (e.g., column selections, joins, aggregations).
- Dataset Profiling – Simultaneously, a lightweight profiler computes statistics (null ratios, histograms, functional dependencies) for each column.
- Assumption Inference – A large‑language model (LLM) is prompted with the combined code‑access map and profile summary. The model outputs a set of implicit assumptions (e.g.,
agemust be non‑negative integer,order_idis unique). - Test Synthesis – Another LLM module translates each assumption into an executable unit test (e.g., using
pytestorunittest). Tests include data mutation (injecting violations) and assertions that the downstream code raises expected errors or produces incorrect outputs. - SIFTA Prompt Optimization – After running a batch of generated tests, PrismaDV collects the few “informative” outcomes (tests that cause failures in the downstream task). These outcomes are fed back to a prompt‑optimizer that adjusts the LLM prompts to focus on the most impactful assumptions, iterating until test quality plateaus.
The whole pipeline is orchestrated automatically, requiring only the dataset and the entry‑point script of the downstream task.
Results & Findings
- Fault detection: PrismaDV’s tests caught 23 % more data‑related bugs than the best task‑agnostic baseline and 12 % more than a prior task‑aware system, measured across the 60 tasks.
- Downstream impact: When injected with realistic data errors, tests generated by PrismaDV predicted a 71 % drop in downstream model accuracy, whereas generic tests only flagged 38 % of those cases.
- SIFTA effectiveness: Prompt‑tuned versions of PrismaDV outperformed manually crafted prompts by 9 % in bug‑catching precision and required ≈30 % fewer test executions to converge.
- Scalability: End‑to‑end generation for a typical ETL pipeline (≈200 KB of code, 5 M rows) completed in under 5 minutes on a single GPU‑enabled workstation.
Practical Implications
- Data‑pipeline CI/CD: Teams can plug PrismaDV into their continuous integration pipelines to automatically generate and run data‑unit tests whenever schema changes or new data sources are added.
- Model reliability: By surfacing data assumptions that directly affect model performance, developers can pre‑emptively guard against silent degradation in production ML services.
- Reduced manual QA: Data engineers spend less time writing bespoke validation scripts; PrismaDV produces ready‑to‑run tests that align with the actual business logic.
- Prompt‑as‑a‑service: SIFTA demonstrates a low‑overhead way to keep LLM‑driven tools tuned to a specific codebase without massive labeled data, a pattern that can be reused for other AI‑assisted devops tasks.
Limitations & Future Work
- Language support: Current implementation focuses on Python and SQL; extending to Scala/Spark or Java‑based pipelines will require additional parsers.
- Assumption completeness: The LLM may miss subtle domain‑specific constraints (e.g., regulatory rules) that are not evident from code or basic profiling.
- Feedback sparsity: SIFTA relies on a small number of informative test outcomes; in highly stable pipelines where few tests fail, prompt adaptation may stall.
- Future directions: The authors plan to integrate static type‑checking information, explore multimodal LLMs for richer code‑data reasoning, and evaluate PrismaDV on streaming data scenarios.
Authors
- Hao Chen
- Arnab Phani
- Sebastian Schelter
Paper Information
- arXiv ID: 2604.21765v1
- Categories: cs.LG, cs.SE
- Published: April 23, 2026
- PDF: Download PDF