Testing Data Pipelines: What to Validate and When
Source: Dev.to

Ask an application developer how they test their code and they’ll describe unit tests, integration tests, CI/CD pipelines, and coverage metrics. Ask a data engineer the same question and the most common answer is: “we check the dashboard.”
Data pipelines are software. They have inputs, logic, and outputs. They can have bugs. They can break silently. And unlike application bugs that trigger error pages, data bugs produce numbers that look plausible—until someone makes a business decision based on them.
Pipelines Are Software — They Need Tests
The bar for data‑pipeline testing shouldn’t be lower than for application code. If anything, it should be higher. Application bugs are usually visible (broken UI, failed request). Data bugs are invisible (wrong aggregation, missing rows, stale values) and their impact compounds over time.
Yet most data teams have no automated tests. They rely on manual spot‑checks, analyst complaints, and hope. Testing a pipeline means catching problems before they reach consumers, not after.
The Testing Pyramid for Data
Borrow the testing pyramid from software engineering and adapt it for data:
| Layer | Focus | Characteristics |
|---|---|---|
| Base | Schema and contract tests | Fast, cheap, run on every pipeline execution. Verify that the output schema matches what consumers expect (columns, data types, constraints). |
| Middle | Data validation tests | Check the values in the output (uniqueness, non‑nulls, ranges, referential integrity). |
| Top | Regression and integration tests | Compare today’s output to historical patterns (row counts, totals, distributions). |
Run more tests at the base (they’re cheap and fast) and fewer at the top (they’re expensive but comprehensive).
Schema and Contract Tests
Schema tests are the simplest and most impactful place to start. After every pipeline run, verify:
- Column existence – Every expected column is present.
- Data types – Columns have their expected types.
- Not‑null constraints – Required columns contain no nulls.
- Uniqueness – Primary‑key columns have no duplicates.
-- Example schema and contract tests
-- Check for unexpected nulls
SELECT COUNT(*) AS null_count
FROM orders
WHERE order_id IS NULL OR customer_id IS NULL;
-- Check for duplicates
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;

Runtime Data Validation
Schema tests verify structure. Data‑validation tests verify content. Run these after every pipeline execution, before marking the job as successful:
- Range checks – Numeric values fall within expected bounds (e.g., order total should not be negative).
- Referential integrity – Foreign keys reference existing records.
- Freshness checks – The most recent event timestamp is within the expected window.
- Volume checks – Row counts stay within historical norms (e.g., ±20 % of the trailing 7‑day average).
- Custom business rules – Domain‑specific assertions such as “every invoice must have at least one line item” or “no employee should have a start date in the future.”
Regression and Anomaly Detection
Regression tests compare today’s output to historical baselines:
- Aggregate comparison – Compare key metrics (total revenue, row count, distinct customer count) against the previous run. Flag deviations beyond a threshold (e.g., ±15 %).
- Distribution checks – Compare the distribution of categorical columns (status values, country codes) against historical norms. A sudden spike in “unknown” status may indicate a source‑schema change.
- Trend analysis – Monitor metrics over time to detect gradual drifts that could signal upstream data quality issues.
By layering tests from fast schema checks up to comprehensive regression checks, data teams can catch bugs early, maintain confidence in their pipelines, and avoid costly downstream decisions based on faulty data.
Track Metrics Over Time
A gradual decline in row count over weeks may indicate a leak that daily checks miss.
Regression tests are more expensive to maintain because they require historical baselines and threshold tuning.
Start simple (row‑count ± 20 %) and refine as you learn what “normal” looks like.
What to Do Next
Add three tests to your most critical pipeline today:
- Uniqueness check on the primary key.
- Null check on required columns.
- Row‑count comparison against yesterday’s output.
Run them after every pipeline execution. These three tests alone will catch the majority of data problems before they reach consumers.
