I built a data-contract validator in pure Python (no pandas, no PyYAML) and it caught a 30% revenue ghost
Source: Dev.to
A few months ago I spent the better part of a day chasing a bug that turned out not to be a bug at all. A downstream dashboard showed revenue had jumped 30% overnight. No deploys, no schema changes, nothing in the logs. After far too long I found it: an upstream system had started sending a total column that no longer equaled subtotal + tax. The pipeline didn’t crash. The data just lied, quietly, and everything downstream believed it.
That’s the thing about data bugs. They rarely throw exceptions. A status field grows a new typo’d value. A join key starts producing orphans. A nullable column that was “never actually null in practice” suddenly is. None of it crashes anything — it just rots the numbers people make decisions on.
So I built DataPact: a small framework for writing down what your data is supposed to look like, and then enforcing it. It’s a data quality and data-contract validation tool, and the whole thing runs on the Python standard library. No pandas, no PyYAML, no network calls.
Live demo report: https://hajirufai.github.io/datapact/report.html
Landing page: https://hajirufai.github.io/datapact/
Source: https://github.com/hajirufai/datapact
The idea: contracts, not assertions scattered everywhere
Most teams already validate data — but it’s usually a pile of ad-hoc assert df["x"].notna().all() lines buried in notebooks and DAGs. Nobody can answer “what are the rules for the orders table?” without grepping three repos.
A data contract flips that. You write the rules down in one declarative document — column types, null rules, ranges, allowed sets, regexes, cross-column math, referential integrity — version it in git, and let producers and consumers share it. DataPact then validates any batch against that contract and tells you, precisely, what broke.
Here’s a contract in DataPact’s YAML-lite format:
name: orders
version: 1.0
strictness: lenient
columns:
- name: order_id
type: int
nullable: false
checks:
- kind: column_values_unique
severity: error
- name: status
type: str
checks:
- kind: column_values_in_set
kwargs: { values: [new, paid, shipped, refunded] }
expectations:
- kind: multicolumn_sum_to_equal
kwargs: { columns: [subtotal, tax], total_column: total, tolerance: 0.01 }
Enter fullscreen mode
Exit fullscreen mode
That last expectation is the exact rule that would have caught my 30% revenue ghost. subtotal + tax must equal total, within a cent.
“Zero dependencies” wasn’t a vanity thing
I want to be honest about why this is stdlib-only, because it sounds like a flex and it mostly isn’t. Two real reasons:
First, a lot of data platforms are locked down. You can’t always pip install half of PyPI on the box where the pipeline runs. A validation tool that drops in with nothing but Python is genuinely easier to adopt than one that drags pandas + pyarrow + a YAML parser behind it.
Second, I wanted to actually understand the problem instead of gluing libraries together. Writing my own YAML reader and type-inference ladder taught me more about the messy reality of “what type is this column” than any wrapper would have.
The downside is I had to write a YAML parser. Which brings me to the most annoying bug of the whole project.
The escape-sequence bug that broke every email
DataPact ships its own tiny YAML reader — a strict subset: maps, lists, scalars, comments, quotes, flow lists. No arbitrary-object deserialization, which is a nice security property for free.
My email validation regex in the contract looked like this:
checks:
- kind: column_values_match_regex
kwargs:
pattern: "^[^@ ]+@[^@ ]+\\.[^@ ]+$"
Enter fullscreen mode
Exit fullscreen mode
When I ran it, every single email failed, including obviously valid ones. 14 out of 14, 100%. My first naive parser just stripped the surrounding quotes and handed back the raw string — so \\. stayed as a literal backslash-backslash-dot. In the compiled regex that means “a literal backslash followed by any character,” and no email on earth has a backslash in it.
The fix was to make the parser do what real YAML does: process escape sequences inside double-quoted strings, while leaving single-quoted strings literal.
_ESCAPES = {"n": "\n", "t": "\t", "r": "\r", '"': '"', "\\": "\\", "/": "/", "0": "\0"}
def _unescape_double(s: str) -> str:
out, i = [], 0
while i B[Dataset
+ type inference]
C[Contract
YAML / JSON / builder] --> D[Validation Engine]
B --> D
D --> E[ValidationReport]
E --> F[CLI exit code]
E --> G[HTML report]
E --> H[guard / raise]
B --> P[Profiler] --> C
Enter fullscreen mode
Exit fullscreen mode
Sources normalize CSV, JSON, JSONL, SQLite and plain lists-of-dicts into one Dataset view.
Expectations are a registry of pure functions — one per check kind. There are 23 of them across column-level (not_null, unique, between, in_set, match_regex, mean_between…), table-level (row_count_between, compound_columns_unique…) and cross-column (a > b, sum_to_equal, referential integrity).
The engine runs every expectation, applies strictness rules for unexpected columns, and builds a structured ValidationReport.
Every column-level check supports a mostly= tolerance, so you can say “this should be non-null in at least 99% of rows” instead of demanding perfection — real data is messy and a single bad row shouldn’t always fail a 10-million-row batch.
Using it: three ways
As a library, validating against a contract file:
import datapact as dp
report = dp.validate("orders.csv", dp.load_contract("orders.yaml"))
print(report.success, report.passed, report.failed)
for r in report.results:
if not r.success:
print(r.expectation.label(), "→", r.message)
Enter fullscreen mode
Exit fullscreen mode
As a pipeline gate, with a decorator that raises before bad data escapes:
from datapact import guard, DataContractError
@guard(contract)
def load_orders():
return fetch_rows_from_somewhere()
try:
rows = load_orders()
except DataContractError as exc:
alert(exc.report) # the full report is attached to the exception
Enter fullscreen mode
Exit fullscreen mode
As a CI check, where a contract breach fails the build like a unit test:
datapact validate orders.csv --contract orders.yaml --fail-on error
echo $? # 1 on breach
Enter fullscreen mode
Exit fullscreen mode
That --fail-on flag is the part I’m most happy with. It makes data quality a gate, not a dashboard nobody looks at. My GitHub Actions workflow actually runs two jobs: one proves the clean sample passes, and one proves the dirty sample fails — because a gate that never rejects anything is worse than no gate at all.
The report is the part people actually see
A validation result is only useful if a human can read it. So every run renders to a single self-contained HTML file — no external assets, no JavaScript framework — with failures sorted to the top and a full data profile (null rates, distinct counts, distributions) underneath.
I built the live demo report from a deliberately dirty orders file with ten injected problems: a duplicate order ID, a null email, a malformed email, an unknown status value, a bad date format, a negative subtotal, a quantity of 40, an invalid country code, and three rows where the totals don’t add up. DataPact catches all ten — 41 expectations, 31 passed, 10 failed — and the report lays them out so you can see exactly which rows and values broke each rule.
Where it fits
DataPact is rule-based contract validation. It answers one question well: “does this batch obey the rules we agreed on?” That’s deliberately different from statistical drift detection (is the distribution shifting?) or ETL orchestration (move the data around). It complements both — you’d run DataPact as the gate between an extract step and a load step, or in CI on your fixtures.
It’s about 3,000 lines of pure Python with 141 tests, all stdlib unittest, no test dependencies either. If you’ve ever lost an afternoon to data that lied to you, give it a look — and if you find a rule it can’t express yet, that’s exactly the kind of issue I want to see.