When One DNS Record Broke the Internet
Source: Dev.to
At 3 AM Eastern on October 20 2025, a Ring doorbell in suburban Ohio went dark. Simultaneously, a Robinhood trader in Manhattan watched his Bitcoin position freeze mid‑transaction. In London, taxpayers discovered HMRC’s Government Gateway—serving 50 million users—had vanished. Across trading floors, boardrooms, and data centres worldwide, a single question crystallised: How did one DNS record take down so much of the internet?
1. What Happened?
| Time (PDT) | Event |
|---|---|
| 11:48 PM (Oct 19) | Two automated processes inside AWS’s internal DNS management attempted to update the same record simultaneously. |
| Result | A race condition produced an empty DNS entry for dynamodb.us-east-1.amazonaws.com – the digital equivalent of erasing a phone number from the directory while someone is dialing it. |
| 12:38 AM | Engineers identified the DNS issue (≈ 50 min after it began). |
| 2:25 AM | DynamoDB record restored. |
| ≈ 2:00 PM | Full recovery of all dependent services after ≈ 15 hours of outage. |
1.1 Technical Cascade
- DynamoDB outage → every application trying to connect received a “wrong number”.
- EC2 Droplet Workflow Manager (DWFM) could not maintain server leases → healthy servers marked unhealthy.
- New instances launched without network connectivity.
- Load balancers failed health checks.
- CloudWatch stopped logging metrics.
- Lambda functions hung.
- Security token validation broke.
The corrupted state propagated across thousands of inter‑connected services, extending the recovery window well beyond the DNS fix itself.
2. Business Impact
- Household names went dark: Snapchat, Reddit, Robinhood, Coinbase, Amazon retail, United Airlines (check‑in), Ring doorbells, numerous banking services.
- Geographic reach: Outage reports from 60+ countries, > 17 million individual reports.
2.1 Direct Financial Losses
| Source | Estimate |
|---|---|
| Parametrix (cloud‑insurance monitor) | $500 – $650 million in direct U.S. losses |
| Gartner (2014) | $5,600 / minute downtime (enterprise average) |
| Ponemon Institute (latest) | >$9,000 / minute for large organisations |
“The actual figure for any given organization depends heavily on industry vertical, organization size, and business model.” – author’s note
2.2 Indirect Costs
- Trust erosion – PwC research: 32 % of customers abandon a brand after a single bad experience.
- Insurance gaps – Most cyber policies trigger only after 8 + hours of downtime. CyberCube estimated potential claims of $38 million – $581 million, yet many firms discovered exposure far exceeding coverage.
- Innovation stalls – Engineering teams diverted from roadmaps to fire‑fighting, accruing technical debt.
- Reputation liability – In an always‑on economy, downtime becomes a competitive disadvantage; resilience is now a market differentiator.
3. Government & Public‑Sector Fallout
3.1 United Kingdom
- HMRC’s Government Gateway (50 million users) went dark.
- Major banks (Lloyds, Bank of Scotland, Halifax) experienced simultaneous failures.
- Dame Meg Hillier, Chair of the UK Treasury Committee, questioned Parliament:
“Why are seemingly key parts of our IT infrastructure hosted abroad when a data centre in Virginia can take down British tax services?”
- 41 active AWS contracts across UK government departments total £1.11 billion (source: Tussell).
- HMRC contract alone: up to £350 million (Dec 2023 – Nov 2026).
“Why are so many critical UK institutions, from HMRC to major banks, dependent on a data centre on the east coast of the US?” – Mark Boost, CEO, Civo
4. Why US‑EAST‑1 Is the Epicentre
- Oldest & busiest AWS region – handles an estimated 35 %–40 % of global AWS traffic (industry analysts).
- Located in Northern Virginia, nicknamed “Data Center Alley”, the world’s highest concentration of data centres.
4.1 Historical US‑EAST‑1 Outages
| Date | Cause | Impact |
|---|---|---|
| February 2017 | Human error during S3 maintenance | Global S3 latency, downstream service disruptions |
| November 2020 | Power outage & network switch failure | Partial loss of EC2, RDS, and Lambda in the region |
| December 2022 | DNS misconfiguration affecting Route 53 | DNS resolution failures for multiple services |
| July 2024 | Network congestion & throttling | Elevated latency for CloudFront and API Gateway |
| October 2025 | Race condition in internal DNS → empty DynamoDB record | 15‑hour global outage affecting millions of users |
Pattern: The majority of large‑scale AWS incidents originate in US‑EAST‑1, underscoring a single‑point‑of‑failure risk for any architecture that relies heavily on this region.
5. Takeaways & Recommendations
- Diversify regional dependencies – Deploy critical services across multiple AWS regions (or multi‑cloud).
- Implement DNS resilience – Use secondary DNS providers, health‑checked CNAME failover, and automated verification of DNS updates.
- Design for graceful degradation – Circuit‑breaker patterns, fallback data stores, and read‑replica strategies can keep core functionality alive when a single service fails.
- Audit cloud‑insurance coverage – Ensure policies trigger at realistic downtime thresholds and cover indirect losses (reputation, regulatory penalties).
- Conduct regular chaos engineering drills – Simulate DNS failures, region‑wide outages, and dependent‑service loss to validate recovery processes.
TL;DR
A race condition in AWS’s internal DNS system erased the dynamodb.us-east-1.amazonaws.com record, triggering a 15‑hour, worldwide cascade that cost $500 – $650 million in direct losses and exposed massive single‑point‑of‑failure risks in the US‑EAST‑1 region. The incident highlights the urgent need for multi‑region architectures, DNS hardening, and robust cloud‑insurance strategies to safeguard both commercial and public‑sector services.
Debugging caused significant portions of the internet to go down, affecting services such as Netflix, Slack, and Amazon’s own retail operations.
Five Major Outages in Eight Years, All from the Same Region
Yet companies continue concentrating workloads there. Why?
- Legacy decisions – Existing architectures were built before alternatives existed.
- Lower latency for East‑Coast users – Proximity to major population centers.
- Feature availability – Some services are only offered in certain regions.
- False comfort of “multi‑AZ deployments.”
The Problem: Multi‑AZ Doesn’t Protect Against Regional Failures
Availability Zones (AZs) within the same region share foundational infrastructure.
When that infrastructure fails—DNS, DynamoDB, Kinesis—your multi‑AZ architecture fails together.
The Counterargument: Why Concentration Also Enables Resilience
- Scale & investment – AWS spends billions annually on infrastructure, employs thousands of security engineers, and operates at a level most enterprises could never afford internally.
- Uptime record – Despite headline‑grabbing outages, AWS maintains a five‑year rolling uptime average of 99.95 %, exceeding what most organizations achieve with on‑premises data centers.
Costs of Fragmentation
- Multi‑cloud architectures are complex to operate and expensive to maintain.
- Data synchronization across providers creates consistency challenges.
- Different APIs require different expertise.
- The operational overhead of managing three cloud providers may exceed the resilience benefits for many organizations.
These are legitimate arguments. The question isn’t whether concentration has benefits—it clearly does—but whether the systemic risks now outweigh them, and whether market forces alone can address those risks.
A Possible Contributor to Increasing Outage Frequency: Changes in AWS’s Engineering Workforce
Corey Quinn, former AWS employee and current industry analyst at The Duckbill Group, has written extensively about this issue in The Register.
- Between 2022 – 2024, AWS experienced over 27,000 layoffs.
- Internal documents show 69‑81 % “regretted attrition”—employees the company wanted to retain but lost.
“You can hire a bunch of very smart people who will explain how DNS works at a deep technical level, but the one thing you can’t hire for is the person who remembers that when DNS starts getting wonky, check that seemingly unrelated system in the corner, because it has historically played a contributing role to some outages of yesteryear.” – Corey Quinn
Caveats
- Former employees may have incomplete information or personal grievances.
- AWS doesn’t publicly disclose engineering headcount or expertise distribution.
- Correlation between workforce changes and outage patterns doesn’t prove causation.
Political and Regulatory Response to the October 2025 Outage
| Actor | Statement / Action | Implication |
|---|---|---|
| Sen. Elizabeth Warren (U.S.) | “If a company can break the entire internet, they are too big. Period. It’s time to break up Big Tech.” (X) | Highlights growing bipartisan concern about concentration risk and national‑security implications. |
| Competition and Markets Authority (CMA) (UK) | Concluded a multi‑year investigation; found AWS and Microsoft hold 30‑40 % of UK cloud‑spending each; recommended “strategic market status” under the Digital Markets, Competition and Consumers Act 2024. | Allows regulators to impose legally binding conduct requirements; acknowledges lock‑in effects (≤1 % annual provider switching). |
Resilience Doesn’t Require an Unlimited Budget – It Requires Strategic Thinking
1. Tiered Approach
Not every system needs a multi‑region active‑active architecture.
| Workload | Recommended Topology |
|---|---|
| Revenue‑generating transaction systems | Active‑active multi‑region |
| Internal dashboards | Active‑passive or single‑region |
2. Design for Observability
- You can’t fix what you can’t see.
- Implement cross‑region monitoring, replication‑lag tracking, and synthetic transactions to detect problems before customers do.
3. Test Relentlessly
- Monthly game days.
- Chaos‑engineering experiments.
- Unannounced failover tests.
- Document every discovered issue, fix it, then test again.
4. Build Multi‑Region Capabilities Incrementally
- Start with active‑passive failover for critical systems.
- Define clear Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO).
- Graduate to active‑active only when justified by business impact.
The Reality Behind the Numbers
AWS’s 99.95 % five‑year uptime sounds impressive—until you realize the October 2025 outage alone consumed years of that SLA budget in fifteen hours.
- The fifteen‑hour outage translates into significant financial loss, customer‑trust erosion, and operational disruption that can’t be “invoiced” away.
“The cloud isn’t a metaphor. It’s fiber‑optic cables under the Atlantic. It’s cooling systems in Northern Virginia. It’s two automated processes racing to update the same DNS record at 11:48 PM on a Saturday night.”
Bottom Line
- Buildings fail. So do the systems we build inside them.
- The question isn’t if the next outage will happen—it’s whether you’ll be ready when it does.
References
- AWS Official Post‑Event Summary (October 2025)
- Parametrix Economic Estimate
- UK CMA Cloud Investigation – Final Decision (July 2025)
https://gov.uk/cma-cases/cloud-services-market-investigation - UK Government AWS Contracts (Tussell data) – referenced in The Register, 29 Oct 2025
- Gartner Downtime Cost Study (2014)
https://blogs.gartner.com - PwC Customer Experience Report
https://pwc.com - TeleGeography Analysis (70 % claim disputed)
https://cardinalnews.org
Disclaimer:
The views expressed in this article are my own and do not represent those of my employer. All AWS outage data is sourced from official AWS post‑event summaries, industry reports from Parametrix and CyberCube, CMA investigation findings, and verified news coverage. Economic impact estimates are based on published industry methodologies and should be understood as approximations given the complexity of measuring distributed economic effects.