Brex Database Disaster Recovery
Source: Dev.to
Introduction
Brex is a financial‑operating‑system platform that provides corporate cards, expense management, travel, bill‑pay and banking services. In a recent AWS FSI Meetup, engineering managers and team members discussed how they leveraged Amazon Aurora Global Database to improve resiliency and support international expansion.
Importance of Disaster Recovery
The team highlighted the need to prepare the infrastructure for disaster scenarios, focusing on the data layer. Their stack primarily uses PostgreSQL with PgBouncer and read replicas for both application and analytical workloads. Previously, the disaster‑recovery (DR) process was manual and time‑consuming.
Goals for a DR Solution
- Implement a warm DR solution to reduce both Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Define acceptable RTO/RPO values by analysing metrics, current capabilities and extensive testing.
- Ensure applications can tolerate any additional latency or data loss during a failover.
Choosing Amazon Aurora Global Database
Aurora Global Database offered the required features with minimal changes to the existing architecture, allowing the use of a secondary region when needed.
Current Implementation Caveats
- A custom DNS endpoint was used for read‑only workloads, serving both application and analytical queries.
- Migration from PostgreSQL to Aurora risked downtime.
Migration Challenges & Approach
The team focused on automation to minimise manual steps:
- Built a Temporal workflow to run automated jobs that validate each migration step and prepare the target environment.
- Performed a controlled switchover to Aurora Global after the workflow confirmed the database status.
- Utilised a short AWS‑provided downtime window (2‑3 minutes) to adjust endpoints and client connections.
Using Temporal Workflows for Automation
Before Migration
- Application → PgBouncer → PostgreSQL primary & replica
Migration Process
- Created an Aurora read replica with zero‑downtime.
- Promoted the replica to a global writer endpoint.
- Updated PgBouncer to point to the Aurora global writer.
- Optionally, provisioned additional clusters for multi‑region setups.
Additional Tools & Processes
- Flux – kept Kubernetes clusters in sync via GitOps. The workflow generated Flux pull‑requests ahead of time and merged them after manual verification.
- Terraform – templated the creation of Aurora global clusters and managed reader instances.
- Internal CLI – provided self‑service commands for teams to trigger failover (unplanned outage) or switchover (planned maintenance) of Aurora clusters.
Improving Workflow Performance
- Initial end‑to‑end automation took ~15 minutes.
- Introduced parallel steps (e.g., fetching credentials, creating Flux PRs) to cut runtime to ~10 minutes.
- Added a dry‑run flag for non‑destructive testing.
- Final optimisation, including pre‑created Flux PRs and reduced Git operations, brought the total time down to 3 minutes.
Lessons Learned
- Thorough testing in staging is essential before production migration.
- Automation reduces human error and enables repeatable processes across many databases.
- Dry‑run migrations help validate the workflow without impacting production.
- Iterative improvements—migrating a few databases each week—allowed the team to refine the process continuously.
This article summarizes the key points from the Brex engineering team’s presentation on disaster‑recovery automation using Amazon Aurora, Terraform, Flux, and Temporal.