Mapping Karrot’s Data: How We Built Column-Level Lineage

Published: 5 months ago (December 4, 2025 at 10:25 PM EST)

4 min read

Source: 당근마켓 Tech Blog

The problem: when you can’t see how data flows

At Karrot, many different teams create derived tables in BigQuery. Data engineers, analysts, product managers — lots of people transform data into the shape they need.

But we had a major problem. It was very hard to understand

Which tables affect which other tables?
Which upstream sources does this table ultimately depend on?

Because the overall flow of data wasn’t visible, continuing work or troubleshooting issues often became painful.

Case 1: The nightmare of cascading failures

A single table’s pipeline failed, but that table was referenced by many other tables, causing a chain reaction of failures. We had no quick way to see which tables were impacted, so we manually dug through queries one by one. It took hours to understand the blast radius and fix the issue.

Case 2: The ripple effect of a MySQL schema change

When we wanted to drop a single column from MySQL, we had no reliable way to answer, “Which tables and columns will break if we delete this?” Determining the impact took a lot of time.

These experiences made it clear that we needed a system that could reliably trace and manage how data flows. That system is data lineage.

Why column‑level lineage?

Once we decided to build lineage, the next question was: “How granular should we go?” Should we only track table‑level lineage or go all the way to column‑level?

The limits of table‑level lineage

Table‑level lineage is relatively easy to build. BigQuery provides a JOBS view that includes referenced tables and destination tables, which can be used to infer dependencies between tables. However, it isn’t enough for day‑to‑day work because:

It doesn’t tell you which specific columns in downstream tables are affected.
It can’t answer the impact of deleting or changing a single column.
It hides the propagation of sensitive fields like PII.

In other words, table‑level lineage alone is too coarse‑grained.

The value of column‑level lineage

Tracking lineage at the column level enables:

Fine‑grained impact analysis – e.g., modifying users.email instantly shows downstream columns such as user_stats.email or marketing.user_email.
Data security & PII tracking – follow where a PII column originates and where it flows across the warehouse.
Faster root‑cause analysis – quickly trace a downstream data quality issue back to its source column.

The remaining question was: How can we extract such detailed column‑level lineage reliably?

Evaluating approaches for lineage extraction

We considered several approaches, focusing on two criteria:

Accurate extraction at the column level.
Ability to parse queries from every team at Karrot, regardless of style variations.

1) BigQuery `INFORMATION_SCHEMA`‑based approach

Limitations:

No column‑level tracking – BigQuery’s built‑in metadata doesn’t provide fine‑grained column dependencies.
No view‑level lineage – Views are not recorded as referenced objects; only the underlying base tables appear, making it hard to understand pipelines built on top of views.

Thus, we couldn’t obtain the needed detail solely from BigQuery metadata.

2) Using an open‑source framework like OpenLineage

Karrot’s teams execute queries via many mechanisms (Airflow, cron jobs, notebooks, internal batch systems, etc.). To adopt OpenLineage we would have needed to:

Install a client library or plugin in every execution environment.
Instrument each job to emit lineage events.

From a central data team’s perspective, this posed challenges:

Requiring many teams to add new dependencies to their workflows.
Risking interference with each team’s business logic and deployment patterns.
Continuously verifying that new pipelines were correctly instrumented.

OpenLineage is powerful, but the integration effort and our immediate need for column‑level parsing made it less suitable.

Our choice: SQL parsing

Karrot’s main data warehouse is BigQuery, and almost all DW‑related workloads run there. Instead of instrumenting every executor, we decided to parse the query logs stored in BigQuery.

In other words, we pull all queries from BigQuery’s job metadata and extract lineage directly from the SQL. This approach is simpler and more robust for our environment.

What do we parse?

Most data work at Karrot is done in BigQuery using SQL. BigQuery stores all executed queries in INFORMATION_SCHEMA.JOBS, which includes fields such as:

query – the raw SQL text.
creation_time, end_time, user_email, etc.

We use the query field as the input for our SQL parser, which identifies source tables, destination tables, and column‑level mappings.

-- Example placeholder for parsing logic
SELECT
  src_table,
  src_column,
  dest_table,
  dest_column
FROM
  parsed_lineage
WHERE
  job_id = @job_id;

(Implementation details and additional code snippets would follow here, preserving any code blocks with appropriate language tags.)

Mapping Karrot’s Data: How We Built Column-Level Lineage

The problem: when you can’t see how data flows

Case 1: The nightmare of cascading failures

Case 2: The ripple effect of a MySQL schema change

Why column‑level lineage?

The limits of table‑level lineage

The value of column‑level lineage

Evaluating approaches for lineage extraction

1) BigQuery `INFORMATION_SCHEMA`‑based approach

2) Using an open‑source framework like OpenLineage

Our choice: SQL parsing

What do we parse?

Related posts

동적 사용자 분할을 활용한 새로운 A/B 테스트 시스템을 소개합니다

MediaTek NPU and LiteRT: Powering the next generation of on-device AI

Removing juries: 'A move towards an authoritarian state'

Can CVE-2025-55182 (React Server Components Vulnerability) Create Files Like .sh, .gz, or XMRig Miners in Server Root?

The problem: when you can’t see how data flows

Case 1: The nightmare of cascading failures

Case 2: The ripple effect of a MySQL schema change

Why column‑level lineage?

The limits of table‑level lineage

The value of column‑level lineage

Evaluating approaches for lineage extraction

1) BigQuery INFORMATION_SCHEMA‑based approach

2) Using an open‑source framework like OpenLineage

Our choice: SQL parsing

What do we parse?

Related posts

동적 사용자 분할을 활용한 새로운 A/B 테스트 시스템을 소개합니다

MediaTek NPU and LiteRT: Powering the next generation of on-device AI

Removing juries: 'A move towards an authoritarian state'

Can CVE-2025-55182 (React Server Components Vulnerability) Create Files Like .sh, .gz, or XMRig Miners in Server Root?

1) BigQuery `INFORMATION_SCHEMA`‑based approach