The Time Our Pipeline Processed the Same Day’s Data 47 Times
Source: Dev.to
I noticed something odd in our Airflow logs on Monday morning. Our daily data pipeline had run multiple times over the weekend instead of once per day.
Not just a few extra runs – 47 executions, all processing the same day’s data: December 3rd.
Each run showed as successful. No errors. No alerts. Just the same date being processed over and over.
Here’s what happened and what I learned about retry logic that I wish I’d known sooner.
How I Found It
Monday morning, I was reviewing our weekend pipeline runs as part of my routine checks. Our Airflow dashboard showed an unusual pattern – our main transformation DAG had executed far more times than it should have.
Looking closer, I saw the DAG had run 47 times between Saturday morning and Monday, even though we schedule it once per day at 2 AM.
What caught my attention: every single run was processing December 3rd’s data. Not December 4th, 5th, or 6th – just December 3rd, repeatedly.
All runs showed as successful (green status). No failed tasks. The logs showed normal processing – read data, transform it, write to warehouse, mark complete.
The Investigation
I checked the obvious things first:
-
Was someone manually triggering reruns?
No. The audit logs showed all runs were automatic, triggered by the scheduler. -
Had the source data changed?
No. The S3 timestamps showed December 3rd’s data hadn’t been modified since it was originally created. -
Was there a scheduler configuration issue?
The schedule looked correct: daily at 2 AM.
Then I noticed something in the run history. The pattern started on Saturday. Our pipeline ran at 2 AM (normal), then again at 4 AM, 6 AM, 8 AM… every two hours through the weekend.
That’s when I realized: these weren’t scheduled runs. These were retries.
The Background
The previous Friday, we’d deployed a new analytics feature – calculating average transaction values by customer segment. Marketing wanted to track premium customer behavior separately from regular customers.
The code had been tested thoroughly. We ran it against sample data from the past week. All tests passed. We deployed Friday afternoon.
What we didn’t test: weekend data patterns.
The Root Cause
Our pipeline used Airflow’s execution_date to determine which data partition to process:
execution_date = context['execution_date']
data_date = execution_date.strftime('%Y-%m-%d')
s3_path = f"s3://bucket/data/date={data_date}/"
Pipeline steps
- Read data from S3
- Transform and validate records
- Calculate daily metrics
- Write to warehouse
Step 3 is where things broke on weekends.
Our new metric calculated “average transaction value per customer segment”:
# Calculate average for our premium customer segment
target_customers = df[df['customer_segment'] == 'premium']
total_value = target_customers['amount'].sum()
customer_count = target_customers['customer_id'].nunique()
avg_value = total_value / customer_count
This worked fine on the weekdays we tested:
| Date | Premium customers | Result |
|---|---|---|
| Dec 3 (Wed) | 8,500 | Success |
| Dec 4 (Thu) | 7,200 | Success |
| Dec 5 (Fri) | 6,800 | Success |
| Dec 6 (Sat) | 0 | Failure |
Our premium segment was entirely B2B customers – business accounts, enterprise clients. They don’t transact on weekends because the businesses are closed.
On Saturday we had 0 premium customers, which caused a division‑by‑zero error:
customer_count = target_customers['customer_id'].nunique() # Returns 0
avg_value = total_value / 0 # Division by zero error
The task failed, and Airflow scheduled a retry. The retry logic we wrote looked like this:
if task_instance.try_number > 1:
# If this is a retry, process the last successful date
# to avoid reprocessing potentially corrupted data
last_successful = get_last_successful_date()
data_date = last_successful
else:
data_date = execution_date.strftime('%Y-%m-%d')
The intention was good: if a task fails partway through processing, don’t reprocess potentially corrupted data; instead, go back to the last known good date.
What actually happened:
- December 6 processing failed (division by zero).
- Retry triggered;
try_number= 2, so the code fetchedlast_successful = December 3. - The retry processed December 3 data (which had premium transactions).
- The calculation succeeded, and Airflow marked the run for December 6 as complete.
The same thing happened for December 7 (Sunday) and continued through the weekend until I stopped it Monday morning.
The Impact
The immediate problem was duplicate data. We’d loaded December 3’s transactions into our warehouse 47 times.
- Our deduplication logic caught most of it – we used transaction IDs as primary keys, so the database simply overwrote the same records.
- However, not all downstream reports deduplicated. Some aggregation tables counted each load as new data. For a few hours on Monday morning, our dashboards showed December 3 with 47× the normal transaction volume.
The bigger problem: the bug exposed a flaw in our retry strategy that could silently corrupt historical data whenever a weekend (or any period with missing data for a segment) caused a task to fail.
Takeaways
- Test on real weekend data – especially when new logic depends on data that may be absent on certain days.
- Retry logic should be idempotent and avoid re‑processing old dates unless you’re absolutely sure it won’t cause duplication.
- Guard against division‑by‑zero (or any operation that can fail on empty datasets) with explicit checks.
- Add alerts for unusual retry patterns (e.g., many retries in a short window) so they’re caught early.
By fixing the division‑by‑zero guard and simplifying the retry to re‑process the same execution_date (or to skip the run entirely when there’s no data), we prevented further duplicate loads and restored confidence in our pipeline.
The Problem
We had no data for December 6th or 7th. The pipeline thought it had processed those dates successfully (because it processed December 3rd instead), so it moved on to December 8th.
We skipped two days of weekend data without realizing it until a business user asked why our weekend sales reports were blank.
The Fix
I fixed two things:
1️⃣ Immediate bug – handle zero‑count scenarios in calculations
target_customers = df[df['customer_segment'] == 'premium']
customer_count = target_customers['customer_id'].nunique()
if customer_count > 0:
avg_value = target_customers['amount'].sum() / customer_count
else:
# No customers in this segment – set to NULL rather than failing
avg_value = None
2️⃣ Retry logic – removed it entirely
# Always process the execution date, regardless of retry count
data_date = execution_date.strftime('%Y-%m-%d')
Key insight: Retries should reprocess the same data, not fall back to old data. If there’s a real data problem, retrying won’t help. If it’s a transient issue, retrying the same operation will work.
Weekend‑specific update
# Weekend data note: Premium segment (B2B) has zero weekend activity
# This is expected behavior – record NULL for weekend metrics
What I Learned
- Test with realistic data patterns. We only tested weekday data because it was convenient. We should also test weekend, holiday, and month‑end data – all the edge cases.
- Retry logic needs careful thought. Assuming “last successful date” is a safe fallback was wrong. Retries must reprocess the same data, not different data.
- Division by zero is common in analytics. Whenever you calculate averages or ratios, handle the zero‑count case explicitly instead of letting it fail.
- Monitor successful runs, not just failures. All our alerts focused on failures. These runs succeeded, so we had no alerts. The issue was only caught by manually reviewing logs.
- Execution date vs. data date matter. Airflow’s execution date is when the job runs; the data you process might be different, especially with retries. Keep them separate in your code.
The Aftermath
After the fix, the pipeline handled weekend data normally:
- Saturday: Processed December 13th – Premium metrics =
NULL(expected). Success. - Sunday: Processed December 14th – Premium metrics =
NULL(expected). Success. - No retries. No duplicate processing.
I backfilled the missing December 6th and 7th data manually and added a test case for weekend scenarios to our test suite.
- Total time debugging: ~3 hours
- Time spent fixing missing weekend data: ~2 hours
Lesson learned: Always test edge cases, especially predictable ones like weekends.
Discussion
Have you deployed code on a Friday that broke over the weekend? Or had retry logic that made things worse instead of better? I’d love to hear how others handle data‑quality validation for metrics with variable data patterns.
Connect with me on LinkedIn or check out my portfolio.
Thanks for reading! Follow for more practical data‑engineering stories and lessons from production systems.
