Schema Validation Passed - So Why Did My Pipeline Fail?

Published: (December 27, 2025 at 04:37 AM EST)
7 min read
Source: Dev.to

Source: Dev.to

Schema validation does one job really well

Schema validation does one job really well: it checks if your data file is parseable.

// This passes every schema validator alive
{
  "user_id": "12345",
  "email": "test@example.com",
  "created_date": "2025-12-26"
}

Looks good, right? The JSON is valid. The CSV has the right number of columns. The XML tags are closed properly.

But here’s what schema validation doesn’t care about:

  • Whether user_id should actually be a number (you’re storing it as a string)
  • Whether created_date is really a date, or just a string that looks like one
  • Whether the file has only headers and no data rows
  • Whether a column you’re counting on actually exists
  • Whether email values are suddenly changing from user@domain.com to "N/A" or null

Your validator checks the shape. It doesn’t check if the shape makes sense.

Real‑World: The Column Rename That Cost 6 Hours

Here’s a scenario that happens in production more often than you’d think:

Your vendor sends you a CSV every day. Your pipeline imports it into a database. Downstream dashboards depend on it. For months, everything works.

Then one morning, a column name changes.

  • Maybe it was customer_name. Now it’s full_name.
  • Maybe order_date became date_order.

Your validation passes. The file parses. The schema check says “all good.”

But your transformation code? It’s looking for customer_name. It doesn’t find it. Your pipeline either fails hard or silently drops that column, and your dashboard now shows incomplete data for an entire day.

“In an enterprise environment you are usually not in control of the data sources. Column renames manifest as missing columns in your expected schema and a new column at the same time. The pipeline cannot resolve this issue and will fail.” – Reddit engineer

Schema validation saw

  • Valid file structure ✓
  • All columns present ✓
  • No parse errors ✓

What it missed

  • The column you’re depending on is gone
  • A new, unexpected column appeared ✗

Try It

DatumInt

Finding Errors

The Silent Killers: Issues That Pass Validation Every Time

1. Headers‑Only Files (The Truncation Trap)

Your vendor sends a CSV with only column headers and zero data rows. Maybe the system crashed mid‑export. Maybe someone hit “export template” by accident.

  • Validation: “Does this parse?” → Yes. Headers are valid. Columns are correct.
  • Reality: When you load this into your data warehouse with a truncate‑before‑copy strategy, you delete all your data and replace it with nothing.

“A vendor sent a file with headers only from a truncate pre‑copy script that passes schema validation.” – Engineer

Fix: Add a file‑size or row‑count check. Headers‑only files are usually a few hundred bytes; real data files are much larger.

2. Type Mismatches That Slip Through

Your schema says age should be a number, but the file contains:

age
25
30
"unknown"
35

Most validators will treat the column as a string (the “safe” choice). Your downstream system expects an integer, leading to:

  • Type‑conversion errors downstream
  • Silent casting of "unknown" to 0 or NULL
  • Broken aggregations (e.g., you can’t average strings)

The file is perfectly valid; the data isn’t.

3. Date‑Format Chaos

Your schema expects ISO‑8601 dates, but the vendor’s system switched regions and now sends:

12/25/2025
26-12-2025
2025.12.26

All are valid date representations, but they’re all different parsers.

  • Schema validation: “It looks like a string. It’s a valid string. Ship it.”
  • Your pipeline: “What the hell is 26-12-2025?”

Standardize date formats upstream or add flexible parsing with explicit validation.

4. The Null Tsunami

A column suddenly fills with NULL values, or worse, with placeholder strings:

email
user1@example.com
user2@example.com
"N/A"
"unknown"
null

Your schema says “emails are present.” Technically true, but 40 % of records now contain garbage. Downstream analytics churn out meaningless metrics, and no one sees an error.

  • No validation error
  • No parsing failure
  • Just bad data that corrupts everything downstream

“High null or duplicate record ratios silently corrupt downstream dashboards and analytics without obvious error signals.” – Data engineer

Why Do These Issues Slip Through?

Schema validation is deterministic and intentionally narrow. It’s like a bouncer checking your ID at a club:

  • “Is this a real ID?” → Yes
  • “Does it look tampered with?” → No
  • “Are you actually the person I think you are?” → That’s up to you.

In other words, validation guarantees that the file fits the shape you described, but it can’t guarantee that the content makes sense for your business logic.

Takeaways

  1. Validate the data, not just the schema. Add sanity checks (row counts, value ranges, domain constraints).
  2. Treat schema validation as the first line of defense, not the only one. Layer additional quality checks downstream.
  3. Automate alerts for anomalies (sudden spikes in nulls, unexpected column names, out‑of‑range values).
  4. Invest in observability—log schema mismatches, data‑quality metrics, and monitor them in real time.

Validation checks

  • Syntax correctness
  • Expected column presence
  • Basic type structure

Validation does NOT check

  • Whether columns are actually used downstream
  • Whether values make sense for your business logic
  • Whether unexpected changes happened
  • Whether file size suggests truncation
  • Whether data quality degraded

This isn’t a flaw in validation – it’s by design. You can’t know every business rule, context, or dependency ahead of time.
But you can catch the common issues before they blow up your pipeline.

PIPELINES

The Overkill Approach: Enterprise Data Validation

If you’re running a massive data operation, there are heavy‑duty tools:

  • Great Expectations – Python, comprehensive, mature
  • dbt‑expectations – if you use dbt, highly recommended
  • dlt – data‑load tool, handles schema evolution
  • Airbyte – SaaS, out‑of‑the‑box validation

These are powerful. They let you define expectations such as:

  • “This column should never be NULL”
  • “Percentages should be 0‑100”
  • user_id should be unique”
  • “Dates should be within reasonable bounds”
  • “These categorical fields should only have these values”

But they also require:

  • Setup time (30 min – days)
  • Ongoing maintenance (as your schema changes)
  • Infrastructure (especially dbt)
  • Team coordination (who writes the expectations?)

For a solo engineer, a small team, or a one‑off vendor integration, that’s often overkill.

The Middle Ground: Lightweight Pre‑Ingestion Checks

There’s a sweet spot between “nothing” and an “enterprise platform”: quick, deterministic checks right before you ingest. Think of it as a health check before you let data into your system.

Check 1 – Schema Diff

Expected columns: [user_id, email, created_date]
Actual columns:   [user_id, email, creation_date] ← Different name!
Status: MISMATCH

Detects column renames, missing columns, or surprise new columns. Takes seconds.

Check 2 – File Size / Row Count

File size: 342 bytes (headers only?)
Row count: 0
Status: WARNING – File has headers but no data

Detects truncations, empty exports, or failed syncs.

Check 3 – Type and Value Validation

Column: age
Expected: numeric
Actual values: 25, 30, "unknown", 35
Status: TYPE MISMATCH in row 3
Value "unknown" is not a number

Detects type mismatches and garbage values.

Check 4 – Null and Outlier Detection

Column: email
Nulls: 12 % (expected < 1 %)
Status: WARNING – High null rate

Detects unexpected null spikes and potential data‑quality issues.

What This Approach Does NOT Solve

This won’t catch…

  • Business‑logic failures – e.g., “total revenue is negative.” A check can flag it, but it can’t know if it’s intentional.
  • Cross‑table inconsistencies – e.g., “User IDs in this file don’t match our existing user database.” You need database context.
  • Semantic drift – e.g., “We changed what active_user means, and now our metrics are wrong.” The data looks fine; the definition changed.
  • Deep anomalies – e.g., “This month’s sales are 3× normal, but the numbers look valid.” You need analysis, not just validation.

These belong to monitoring, alerting, and investigation, not validation.

This IS Good For…

  • Early catches – Stopping broken data at the door before it corrupts dashboards.
  • Debugging speed – When something breaks, this tells you which file broke and why, in seconds.

Seconds.

Peace of mind: You know when you’re sending clean data downstream.

Vendor reliability: Quickly spotting when a vendor changed formats without telling you.

“I often run into structural or data‑quality issues that I need to gracefully handle… I store all raw data for reprocessing purposes. After corrections, I automatically reprocess raw data for failures.”
— One data engineer

Translation: they catch errors, fix them at the source, then re‑ingest. Fast validation would save them hours.

Your Next Move

When you receive a data file and need to ask:

  • “Why did this break?”
  • “Which rows are problematic?”
  • “Is this safe to ingest?”

You have three options:

  1. Set up enterprise tooling (if you have the time and scale)
  2. Do it manually (if you like debugging at 3 AM)
  3. Use a lightweight check (if you want answers in seconds)

If you choose option 3, you can upload your file to DatumInt right now. Detective D will scan it, flag issues, explain what went wrong, and give you a clear picture of whether it’s safe to ingest. No infrastructure. No setup. Just answers.

The shape of your data is valid.
The reality of your data? That’s where the problems hide.

▶ TRY IT:

One Last Thing

Next time someone tells you “schema validation passed,” ask the follow‑up question:

“But did you check what actually changed?”

That question saves pipelines.

Have you been burned by data that passed validation but broke your pipeline? The scenario matters. Comment below or reach out—I’m collecting real stories because validation tools should be built on real failures, not guesses.

Back to Blog

Related posts

Read more »