I Built a Zero-Downtime Database Migration Pipeline (PostgreSQL to Aurora)

Published: (December 3, 2025 at 10:22 PM EST)
5 min read
Source: Dev.to

Source: Dev.to

The Problem

Last year, I inherited a project running on self‑managed PostgreSQL. The database had grown to 500 GB, and we were spending too much time on maintenance: patching, backups, replication issues, the usual headaches. The decision was made to move to Aurora PostgreSQL for managed operations, better scalability, and native AWS integration.

The catch? This was a production database serving 50,000 daily active users. Any downtime meant lost revenue and angry customers. The business gave us a maintenance window of… zero minutes. No pressure.

What I Tried First (And Why It Didn’t Work)

My initial thought was simple: pg_dump and pg_restore during a quiet period. Classic approach, right?

# The naive approach
pg_dump -Fc production_db > backup.dump
pg_restore -d aurora_target backup.dump

For a 500 GB database, this would take roughly 4–6 hours depending on network and instance sizes. That’s 4–6 hours of stale data accumulating on the source while users keep writing. Unacceptable.

I also looked at logical replication native to PostgreSQL, but setting up publication/subscription across AWS accounts with proper security controls was turning into a monster. Plus, the operational tooling around it was basically DIY.

Then I found AWS DMS. It handles the heavy lifting: full load plus change data capture in a managed service. The challenge was building everything around it to make migrations repeatable and safe.

The Solution

I built a complete migration framework with four major components:

  • Terraform modules for all AWS infrastructure
  • Python automation scripts for validation and cutover
  • GitHub Actions workflows for CI/CD
  • CloudWatch monitoring for full observability

Architecture Overview

Architecture Overview

The blue‑green strategy works like this: DMS performs a full load of all existing data, then switches to CDC mode to capture ongoing changes. Both databases stay in sync until we’re ready to cut over.

Terraform Infrastructure

I created modular Terraform for reproducibility across environments. Here’s how the DMS module looks:

module "dms" {
  source = "./modules/dms"

  project     = "db-migration"
  environment = "prod"

  subnet_ids         = var.private_subnet_ids
  security_group_ids = [module.networking.dms_security_group_id]

  replication_instance_class = "dms.r5.4xlarge"
  multi_az                   = true

  source_db_host     = var.source_db_host
  source_db_username = var.source_db_username
  source_db_password = var.source_db_password

  target_db_host     = module.aurora.cluster_endpoint
  target_db_username = var.aurora_master_username
  target_db_password = var.aurora_master_password
}

The DMS module creates:

  • Replication instance with appropriate sizing
  • Source and target endpoints with proper SSL configuration
  • Replication task with CDC enabled
  • CloudWatch alarms for monitoring lag and errors

Validation Scripts

Before any cutover, you need confidence that the data matches. I wrote a Python validation tool that checks multiple dimensions:

# Quick validation (uses table statistics for fast estimates)
python validation.py --quick

# Full validation (exact counts, checksums, sequence values)
python validation.py --full

# Just check DMS status
python validation.py --dms

The full validation performs:

CheckDescription
Row CountsCompare exact row counts between source and target
ChecksumsMD5 hash of sample data from each table
SequencesVerify sequence values are synchronized
Primary KeysEnsure all tables have PKs (required for CDC)
DMS StatusTask running, replication lag below threshold

Snippet from the checksum validation:

def calculate_checksum(self, table: str, columns: list, limit: int = 1000) -> str:
    """Calculate MD5 checksum of sample rows."""
    cols = ", ".join(columns)
    query = f"""
        SELECT md5(string_agg(row_hash, '' ORDER BY row_hash))
        FROM (
            SELECT md5(ROW({cols})::text) as row_hash
            FROM {table}
            ORDER BY {columns[0]}
            LIMIT {limit}
        ) t
    """
    result = self.execute_query(query)
    return result[0][0] if result else None

The Cutover Process

Cutover is where things get nerve‑wracking. I built a multi‑phase process with automatic rollback capability at each stage:

PhaseActionRollback Available
1Pre‑validation (verify DMS, row counts)Yes
2Wait for sync (CDC lag under threshold)Yes
3Drain connections (terminate source connections)Yes
4Final sync (wait for remaining changes)Yes
5Stop replicationManual only
6Post‑validationManual only

The cutover script saves state to JSON after each phase, so you can resume if something fails:

# Always do a dry run first
python cutover.py --dry-run

# Execute when ready
python cutover.py --execute

# Resume from saved state if interrupted
python cutover.py --execute --resume

GitHub Actions Integration

Everything is automated through GitHub Actions. The cutover workflow requires manual approval for production:

jobs:
  approval:
    name: Approve Cutover
    runs-on: ubuntu-latest
    if: github.event.inputs.mode == 'execute'
    environment: prod-cutover  # Requires manual approval
    steps:
      - name: Cutover Approved
        run: echo "Cutover approved"

  cutover:
    name: Database Cutover
    needs: [approval]
    # ... actual cutover steps

The workflow pulls credentials from AWS Secrets Manager, runs the cutover script, uploads state artifacts for auditing, and sends SNS notifications on failure.

Results

The migration completed successfully with these metrics:

MetricValue
Total data migrated512 GB
Migration time (full load)3 hours 22 minutes
CDC lag during cutover2.1 seconds
Application downtime0 seconds
Data validation errors0

Post‑migration, we observed:

  • 40 % reduction in read latency (Aurora read replicas)
  • Zero time spent on database maintenance
  • Automatic backups and point‑in‑time recovery

Lessons Learned

  1. Test your validation scripts thoroughly. I initially had a bug where the checksum query didn’t handle NULL values correctly. Caught it in staging, thankfully.
  2. Size your DMS instance appropriately. We started with a dms.r5.2xlarge and hit CPU limits during full load. Upgrading to 4xlarge cut migration time in half.
  3. Monitor CDC lag obsessively. I set up CloudWatch alarms for lag exceeding 30 seconds. During our migration, we had a spike to 45 seconds when someone ran a batch job on the source. Knowing about it immediately let us delay cutover until it settled.
  4. Have a rollback plan that you’ve actually tested. We kept the source PostgreSQL running for 48 hours post‑cutover. When a minor bug surfaced (unrelated to the migration), having the option to rollback gave everyone peace of mind while we investigated.
  5. Communicate more than you think is necessary. We sent updates every hour.
Back to Blog

Related posts

Read more »