Day 00 - Prelude
Source: Dev.to
Introduction
- Two requests update the same thing at the same time (race conditions)
- Retries create duplicate effects (double emails, double charges, double writes)
- Rate limits / quotas are inconsistent under load
- Ordering matters per customer/resource, but events arrive “whenever”
“It worked locally” but breaks under real traffic
…you’re in distributed systems territory.
And you don’t need a “massive distributed system” for this to be true. Even with a single server, concurrent requests + retries + partial failures can create the same class of problems. If you later add replicas, the pain gets amplified fast.
So the topic isn’t how many servers you have. It’s coordination – per‑key coordination.
Per‑key coordination means: for one specific thing (like an order or a user), there’s one place that decides what happens.
That sentence sounds obvious… until you meet the moment where it stops being optional.
The moment it becomes real
Imagine a button: Pay.
On the happy path it’s boring:
click → charge → 200 OK → “Paid.”
The failure path is where your system reveals what it actually believes.
- The user clicks Pay.
- The server charges the card.
- Something goes wrong before the user sees success – a timeout, a network hiccup, a crash, the client giving up early, a proxy retrying, a job‑runner retrying later… pick your favorite.
The point is: the system did some work, but the outside world didn’t get a clean “done” signal.
Now the intent arrives again.
You’re no longer debugging payments. You’re debugging this question:
Did we already do the thing? If yes, what should we do now?
That question doesn’t live only in checkout flows. It shows up when you:
- send an email
- create a subscription
- increment a quota
- finalize an order
- apply a state transition
- update a profile
- accept an invite
This is why production bugs can feel haunted. The code looks fine. Tests pass. Logs look normal. Yet outcomes are wrong—because the system answers “did it already happen?” inconsistently.
“Maybe” Is the Most Expensive State
At some point your system can’t confidently say “yes” or “no.” It can only say “maybe.”
“Maybe” is expensive because it forces two bad choices:
| Choice | Consequence |
|---|---|
| Do it again | Duplicates (double charge, double email, double write) |
| Refuse to do it | Missed work (no email, stale state, inconsistent outcomes) |
The frustrating part is that “maybe” isn’t rare. It shows up in normal reality:
- concurrent requests
- retries
- webhook redeliveries
- at‑least‑once job processing
- crashes between “side effect happened” and “response delivered”
So the fix usually isn’t “add another condition.”
The fix is to introduce a consistent place where decisions get made.
The Missing Building Lego
When the weirdness clusters around one thing—an order, a user, a tenant, or a resource—the shape of the fix is usually the same:
One Coordinator per Key
A single place that can say:
- “I’ve already seen this
requestId; don’t apply it twice.” - “For this order, state transitions happen in order.”
- “For this tenant, quotas are enforced consistently.”
- “Only one worker can hold this lock right now.”
In this series I’ll use Cloudflare Durable Objects as the coordination primitive so we can focus on the patterns, not the plumbing.
Why This Series Uses Durable Objects
You can solve coordination problems on any major cloud. The question isn’t “can I do it on AWS/GCP?” — you can.
The real question is how many moving parts you need to make it correct, and how easy it is to reason about under retries and concurrency.
Coordination bugs rarely stem from missing features. They arise when the system is forced to answer “Did it already happen?” but has no single place that can answer consistently for a given key.
Durable Objects appear in this series because they provide a very direct building‑block for that pattern:
one key → one stateful place to decide, with storage attached.
What You Usually End Up Building on AWS/GCP (and Why It’s Easy to Get Wrong)
If you try to recreate “one place per key decides” using common cloud primitives, you typically end up building a small distributed system of its own:
Core Components
- Stateless compute – Lambda, Cloud Run, etc.
- Database with conditional writes / transactions – DynamoDB, Spanner, PostgreSQL
- Cache/lock service (optional) – Redis for counters or distributed locks
- Queue / workflow layer (optional) – SQS, Pub/Sub, Step Functions, Cloud Tasks
The Glue Problem
Each piece works fine on its own, but the real challenge is the glue that ties them together:
- Ordering & idempotency – multiple services must agree on the order of operations and ensure actions are idempotent.
- Cross‑service retries – retries need to be coordinated across service boundaries to avoid duplicate work.
- Partial failures – you must handle scenarios where some components succeed while others fail.
- Lock management – locks require TTLs, renewal, fencing, and careful failure handling.
- Debugging complexity – tracing “which service saw what, in what order?” quickly becomes a nightmare.
You can absolutely make it work, but the coordination logic ends up spread across infrastructure decisions rather than being confined to application code.
What Durable Objects Are (and What You Get by Default)
A Durable Object is a stateful instance addressed by an ID (or a name‑derived key) that combines compute with persistent storage.
Three properties that matter for coordination
- Requests for the same key go to the same object – gives you a natural “home” for decisions about
order:123ortenant:acme. - Single‑threaded execution per object – you can write coordination logic without reinventing locks inside your own code.
- Storage is attached to the object – the place that decides can also remember what it decided (dedupe keys, current state, counters, queue state).
Patterns you can implement
- Idempotency / deduplication (request‑ID sets)
- Single‑writer ordering per key
- Per‑key rate limits / quotas
- Per‑key queues
- Stampede protection (one refresh, many wait)
All of these can be built without assembling a separate coordination stack.
The real benefit
- Durable Objects give you a single, stateful, single‑threaded endpoint per key, removing the need to stitch together multiple services just to achieve reliable coordination.
This makes the system easier to reason about, test, and debug—especially under retries and concurrency.
The Benefit isn’t “Cloudflare vs AWS”
Durable Objects reduce coordination from “a system‑design problem across multiple services” to “a local decision inside one keyed instance.”
That’s why they’re a great teaching tool for these patterns: we can spend the series on ordering, deduplication, rate limits, per‑key queues, stampede protection, and sharding—rather than wiring and operationalizing the coordination stack.
When not to use Durable Objects
Durable Objects aren’t a universal default. Consider them only when:
- A clean database transaction already solves the correctness problem.
- Your workload is purely stateless.
- The primary requirement is global querying/analytics across many keys.
Guiding principle: Use the simplest tool that can make the decision consistently.
What I’m Going to Build Over 30 Days
The goal is to end with a toolbox: the same coordination idea applied as repeatable patterns.
| Days | Focus | Topics |
|---|---|---|
| 1‑3 | Primitive “click” | Mapping keys to coordinators, choosing keys safely, and understanding what is and isn’t durable. |
| 4‑6 | Real‑world “Pay button” | Deduping retries (idempotency), enforcing ordering (single‑writer per key), and keeping quotas consistent (rate limiting). |
| 7‑30 | Shipping‑time patterns | Per‑key queues, locks, stampede protection, hot keys, sharding, and the trade‑offs that decide when DO is the right tool versus a database, Redis, or a queue. |
I’ll keep it flexible on purpose: if a topic turns out to be more useful than planned, I’ll spend more time there.
Posting Schedule (holiday break)
This is a 30‑post run. No posts will be published on Dec 24, 25, 29, 30, 31 and Jan 1, 2.
Planned Map (subject to change)
I’ll keep this updated as days slip.
| Day | Topic |
|---|---|
| Day 01 | One Key, One Coordinator – The primitive: route by key → one stateful place to decide. |
| Day 02 | Key Design = Partitioning – Good keys isolate; bad keys collide (and create hot spots). |
| Day 03 | What’s Actually Durable? – Memory vs. storage vs. “what survives” (and what doesn’t). |
| Day 04 | Single‑Writer per Key – Enforce ordering and avoid races by serializing per key. |
| Day 05 | Idempotency: Dedupe Retries – Turn “maybe” into “already handled” with request IDs. |
| Day 06 | Rate Limiting per Key – Consistent quotas even when you scale and retries happen. |
| Day 07 | Weekly Recap #1 + Cheatsheet – The first “pattern index” you can bookmark. |
| Day 08 | A Per‑Key Queue – Queue work per key to control order and throughput. |
| Day 09 | Locks per Resource (and When Not to) – When you truly need mutual exclusion, and the foot‑guns. |
| Day 10 | Debounce/Throttle per Key – Collapse bursts into one decision. |
| Day 11 | Stampede Protection (“Single Flight”) – One refresh runs; everyone else waits. |
| Day 12 | Consistent Counters per Key – Quotas, usage, and “exactly‑once‑ish” counting. |
| Day 13 | Leader per Key (Coordinator Role) – When one instance must orchestrate steps for a key. |
| Day 14 | Weekly Recap #2 + Pattern Index – Consolidate: dedupe, ordering, queues, locks, stampedes. |
| Day 15 | Handling At‑Least‑Once Delivery – Designing for duplicates as a normal case. |
| Day 16 | Webhooks: Redelivery Without Panic – Make webhook handlers safe under retries. |
| Day 17 | Sagas per Key (Multi‑Step Workflows) – A simple saga state machine you can reason about. |
| Day 18 | Backpressure per Key – Protect correctness when load spikes. |
| Day 19 | Hot Keys: Symptoms and Triage – How to recognize and mitigate before it melts. |
| Day 20 | Sharding a Hot Key – Split one key into many without losing correctness. |
| Day 21 | Weekly Recap #3 + Failure‑Modes Checklist – The “what breaks in prod” list. |
| Day 22 | Observability That Helps Coordination Bugs – Logs/metrics/tracing for “did it already happen?” |
| Day 23 | Testing Retries, Races, and Ordering – A harness to reproduce the “haunted” bugs. |
| Day 24 | Anti‑Patterns (What Not To Do) – Mistakes that create invisible correctness debt. |
| Day 25 | DO vs DB vs Redis vs Queues (Honest Trade‑offs) – How to choose the simplest correct tool. |
| Day 26 | Multi‑Tenant Boundaries – Isolation, fairness, and per‑tenant abuse prevention. |
| Day 27 | Coordinating Fan‑Out (Realtimes / Rooms) – When many clients depend on one key’s truth. |
| Day 28 | Composition: Building a “Coordination Kit” – Combine patterns instead of rewriting them. |
| Day 29 | A Small Capstone Demo – A realistic flow that uses multiple patterns together. |
| Day 30 | Final Index + Learning Paths – Where to start depending on your problem (retries, ordering, quotas, hot keys). |
How to Follow Along
The code lives in a single GitHub repo:
Pillin/Durable‑Objects‑30days
Stay curious and ship.