Brex Database Disaster Recovery

Published: 2 months ago (November 30, 2025 at 01:11 AM EST)

3 min read

Source: Dev.to

Introduction

Brex is a financial‑operating‑system platform that provides corporate cards, expense management, travel, bill‑pay and banking services. In a recent AWS FSI Meetup, engineering managers and team members discussed how they leveraged Amazon Aurora Global Database to improve resiliency and support international expansion.

Importance of Disaster Recovery

The team highlighted the need to prepare the infrastructure for disaster scenarios, focusing on the data layer. Their stack primarily uses PostgreSQL with PgBouncer and read replicas for both application and analytical workloads. Previously, the disaster‑recovery (DR) process was manual and time‑consuming.

Goals for a DR Solution

Implement a warm DR solution to reduce both Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Define acceptable RTO/RPO values by analysing metrics, current capabilities and extensive testing.
Ensure applications can tolerate any additional latency or data loss during a failover.

Choosing Amazon Aurora Global Database

Aurora Global Database offered the required features with minimal changes to the existing architecture, allowing the use of a secondary region when needed.

Current Implementation Caveats

A custom DNS endpoint was used for read‑only workloads, serving both application and analytical queries.
Migration from PostgreSQL to Aurora risked downtime.

Migration Challenges & Approach

The team focused on automation to minimise manual steps:

Built a Temporal workflow to run automated jobs that validate each migration step and prepare the target environment.
Performed a controlled switchover to Aurora Global after the workflow confirmed the database status.
Utilised a short AWS‑provided downtime window (2‑3 minutes) to adjust endpoints and client connections.

Using Temporal Workflows for Automation

Before Migration

Application → PgBouncer → PostgreSQL primary & replica

Migration Process

Created an Aurora read replica with zero‑downtime.
Promoted the replica to a global writer endpoint.
Updated PgBouncer to point to the Aurora global writer.
Optionally, provisioned additional clusters for multi‑region setups.

Additional Tools & Processes

Flux – kept Kubernetes clusters in sync via GitOps. The workflow generated Flux pull‑requests ahead of time and merged them after manual verification.
Terraform – templated the creation of Aurora global clusters and managed reader instances.
Internal CLI – provided self‑service commands for teams to trigger failover (unplanned outage) or switchover (planned maintenance) of Aurora clusters.

Improving Workflow Performance

Initial end‑to‑end automation took ~15 minutes.
Introduced parallel steps (e.g., fetching credentials, creating Flux PRs) to cut runtime to ~10 minutes.
Added a dry‑run flag for non‑destructive testing.
Final optimisation, including pre‑created Flux PRs and reduced Git operations, brought the total time down to 3 minutes.

Lessons Learned

Thorough testing in staging is essential before production migration.
Automation reduces human error and enables repeatable processes across many databases.
Dry‑run migrations help validate the workflow without impacting production.
Iterative improvements—migrating a few databases each week—allowed the team to refine the process continuously.

This article summarizes the key points from the Brex engineering team’s presentation on disaster‑recovery automation using Amazon Aurora, Terraform, Flux, and Temporal.