Day 00 - Prelude

Published: 1 month ago (December 20, 2025 at 06:22 PM EST)

8 min read

Source: Dev.to

Introduction

Two requests update the same thing at the same time (race conditions)
Retries create duplicate effects (double emails, double charges, double writes)
Rate limits / quotas are inconsistent under load
Ordering matters per customer/resource, but events arrive “whenever”

“It worked locally” but breaks under real traffic

…you’re in distributed systems territory.

And you don’t need a “massive distributed system” for this to be true. Even with a single server, concurrent requests + retries + partial failures can create the same class of problems. If you later add replicas, the pain gets amplified fast.

So the topic isn’t how many servers you have. It’s coordination – per‑key coordination.

Per‑key coordination means: for one specific thing (like an order or a user), there’s one place that decides what happens.

That sentence sounds obvious… until you meet the moment where it stops being optional.

The moment it becomes real

Imagine a button: Pay.

On the happy path it’s boring:

click → charge → 200 OK → “Paid.”

The failure path is where your system reveals what it actually believes.

The user clicks Pay.
The server charges the card.
Something goes wrong before the user sees success – a timeout, a network hiccup, a crash, the client giving up early, a proxy retrying, a job‑runner retrying later… pick your favorite.

The point is: the system did some work, but the outside world didn’t get a clean “done” signal.

Now the intent arrives again.

You’re no longer debugging payments. You’re debugging this question:

Did we already do the thing? If yes, what should we do now?

That question doesn’t live only in checkout flows. It shows up when you:

send an email
create a subscription
increment a quota
finalize an order
apply a state transition
update a profile
accept an invite

This is why production bugs can feel haunted. The code looks fine. Tests pass. Logs look normal. Yet outcomes are wrong—because the system answers “did it already happen?” inconsistently.

“Maybe” Is the Most Expensive State

At some point your system can’t confidently say “yes” or “no.” It can only say “maybe.”

“Maybe” is expensive because it forces two bad choices:

Choice	Consequence
Do it again	Duplicates (double charge, double email, double write)
Refuse to do it	Missed work (no email, stale state, inconsistent outcomes)

The frustrating part is that “maybe” isn’t rare. It shows up in normal reality:

concurrent requests
retries
webhook redeliveries
at‑least‑once job processing
crashes between “side effect happened” and “response delivered”

So the fix usually isn’t “add another condition.”
The fix is to introduce a consistent place where decisions get made.

The Missing Building Lego

When the weirdness clusters around one thing—an order, a user, a tenant, or a resource—the shape of the fix is usually the same:

One Coordinator per Key

A single place that can say:

“I’ve already seen this requestId; don’t apply it twice.”
“For this order, state transitions happen in order.”
“For this tenant, quotas are enforced consistently.”
“Only one worker can hold this lock right now.”

In this series I’ll use Cloudflare Durable Objects as the coordination primitive so we can focus on the patterns, not the plumbing.

Why This Series Uses Durable Objects

You can solve coordination problems on any major cloud. The question isn’t “can I do it on AWS/GCP?” — you can.

The real question is how many moving parts you need to make it correct, and how easy it is to reason about under retries and concurrency.

Coordination bugs rarely stem from missing features. They arise when the system is forced to answer “Did it already happen?” but has no single place that can answer consistently for a given key.

Durable Objects appear in this series because they provide a very direct building‑block for that pattern:

one key → one stateful place to decide, with storage attached.

What You Usually End Up Building on AWS/GCP (and Why It’s Easy to Get Wrong)

If you try to recreate “one place per key decides” using common cloud primitives, you typically end up building a small distributed system of its own:

Core Components

Stateless compute – Lambda, Cloud Run, etc.
Database with conditional writes / transactions – DynamoDB, Spanner, PostgreSQL
Cache/lock service (optional) – Redis for counters or distributed locks
Queue / workflow layer (optional) – SQS, Pub/Sub, Step Functions, Cloud Tasks

The Glue Problem

Each piece works fine on its own, but the real challenge is the glue that ties them together:

Ordering & idempotency – multiple services must agree on the order of operations and ensure actions are idempotent.
Cross‑service retries – retries need to be coordinated across service boundaries to avoid duplicate work.
Partial failures – you must handle scenarios where some components succeed while others fail.
Lock management – locks require TTLs, renewal, fencing, and careful failure handling.
Debugging complexity – tracing “which service saw what, in what order?” quickly becomes a nightmare.

You can absolutely make it work, but the coordination logic ends up spread across infrastructure decisions rather than being confined to application code.

What Durable Objects Are (and What You Get by Default)

A Durable Object is a stateful instance addressed by an ID (or a name‑derived key) that combines compute with persistent storage.

Three properties that matter for coordination

Requests for the same key go to the same object – gives you a natural “home” for decisions about order:123 or tenant:acme.
Single‑threaded execution per object – you can write coordination logic without reinventing locks inside your own code.
Storage is attached to the object – the place that decides can also remember what it decided (dedupe keys, current state, counters, queue state).

Patterns you can implement

Idempotency / deduplication (request‑ID sets)
Single‑writer ordering per key
Per‑key rate limits / quotas
Per‑key queues
Stampede protection (one refresh, many wait)

All of these can be built without assembling a separate coordination stack.

The real benefit

Durable Objects give you a single, stateful, single‑threaded endpoint per key, removing the need to stitch together multiple services just to achieve reliable coordination.
This makes the system easier to reason about, test, and debug—especially under retries and concurrency.

The Benefit isn’t “Cloudflare vs AWS”

Durable Objects reduce coordination from “a system‑design problem across multiple services” to “a local decision inside one keyed instance.”

That’s why they’re a great teaching tool for these patterns: we can spend the series on ordering, deduplication, rate limits, per‑key queues, stampede protection, and sharding—rather than wiring and operationalizing the coordination stack.

When not to use Durable Objects

Durable Objects aren’t a universal default. Consider them only when:

A clean database transaction already solves the correctness problem.
Your workload is purely stateless.
The primary requirement is global querying/analytics across many keys.

Guiding principle: Use the simplest tool that can make the decision consistently.

What I’m Going to Build Over 30 Days

The goal is to end with a toolbox: the same coordination idea applied as repeatable patterns.

Days	Focus	Topics
1‑3	Primitive “click”	Mapping keys to coordinators, choosing keys safely, and understanding what is and isn’t durable.
4‑6	Real‑world “Pay button”	Deduping retries (idempotency), enforcing ordering (single‑writer per key), and keeping quotas consistent (rate limiting).
7‑30	Shipping‑time patterns	Per‑key queues, locks, stampede protection, hot keys, sharding, and the trade‑offs that decide when DO is the right tool versus a database, Redis, or a queue.

I’ll keep it flexible on purpose: if a topic turns out to be more useful than planned, I’ll spend more time there.

Posting Schedule (holiday break)

This is a 30‑post run. No posts will be published on Dec 24, 25, 29, 30, 31 and Jan 1, 2.

Planned Map (subject to change)

I’ll keep this updated as days slip.

Day	Topic
Day 01	One Key, One Coordinator – The primitive: route by key → one stateful place to decide.
Day 02	Key Design = Partitioning – Good keys isolate; bad keys collide (and create hot spots).
Day 03	What’s Actually Durable? – Memory vs. storage vs. “what survives” (and what doesn’t).
Day 04	Single‑Writer per Key – Enforce ordering and avoid races by serializing per key.
Day 05	Idempotency: Dedupe Retries – Turn “maybe” into “already handled” with request IDs.
Day 06	Rate Limiting per Key – Consistent quotas even when you scale and retries happen.
Day 07	Weekly Recap #1 + Cheatsheet – The first “pattern index” you can bookmark.
Day 08	A Per‑Key Queue – Queue work per key to control order and throughput.
Day 09	Locks per Resource (and When Not to) – When you truly need mutual exclusion, and the foot‑guns.
Day 10	Debounce/Throttle per Key – Collapse bursts into one decision.
Day 11	Stampede Protection (“Single Flight”) – One refresh runs; everyone else waits.
Day 12	Consistent Counters per Key – Quotas, usage, and “exactly‑once‑ish” counting.
Day 13	Leader per Key (Coordinator Role) – When one instance must orchestrate steps for a key.
Day 14	Weekly Recap #2 + Pattern Index – Consolidate: dedupe, ordering, queues, locks, stampedes.
Day 15	Handling At‑Least‑Once Delivery – Designing for duplicates as a normal case.
Day 16	Webhooks: Redelivery Without Panic – Make webhook handlers safe under retries.
Day 17	Sagas per Key (Multi‑Step Workflows) – A simple saga state machine you can reason about.
Day 18	Backpressure per Key – Protect correctness when load spikes.
Day 19	Hot Keys: Symptoms and Triage – How to recognize and mitigate before it melts.
Day 20	Sharding a Hot Key – Split one key into many without losing correctness.
Day 21	Weekly Recap #3 + Failure‑Modes Checklist – The “what breaks in prod” list.
Day 22	Observability That Helps Coordination Bugs – Logs/metrics/tracing for “did it already happen?”
Day 23	Testing Retries, Races, and Ordering – A harness to reproduce the “haunted” bugs.
Day 24	Anti‑Patterns (What Not To Do) – Mistakes that create invisible correctness debt.
Day 25	DO vs DB vs Redis vs Queues (Honest Trade‑offs) – How to choose the simplest correct tool.
Day 26	Multi‑Tenant Boundaries – Isolation, fairness, and per‑tenant abuse prevention.
Day 27	Coordinating Fan‑Out (Realtimes / Rooms) – When many clients depend on one key’s truth.
Day 28	Composition: Building a “Coordination Kit” – Combine patterns instead of rewriting them.
Day 29	A Small Capstone Demo – A realistic flow that uses multiple patterns together.
Day 30	Final Index + Learning Paths – Where to start depending on your problem (retries, ordering, quotas, hot keys).

How to Follow Along

The code lives in a single GitHub repo:
Pillin/Durable‑Objects‑30days

Stay curious and ship.

Day 00 - Prelude

Introduction

The moment it becomes real

“Maybe” Is the Most Expensive State

The Missing Building Lego

One Coordinator per Key

Why This Series Uses Durable Objects

What You Usually End Up Building on AWS/GCP (and Why It’s Easy to Get Wrong)

Core Components

The Glue Problem

What Durable Objects Are (and What You Get by Default)

Three properties that matter for coordination

Patterns you can implement

The real benefit

The Benefit isn’t “Cloudflare vs AWS”

When not to use Durable Objects

What I’m Going to Build Over 30 Days

Posting Schedule (holiday break)

Planned Map (subject to change)

How to Follow Along

Related posts

Idempotency

POJO-actor v1.0: A Lightweight Actor Model Library for Java

Understanding the core mechanics of ThingsDB

Day 01 - One Key, One Coordinator

Introduction

The moment it becomes real

“Maybe” Is the Most Expensive State

The Missing Building Lego

One Coordinator per Key

Why This Series Uses Durable Objects

What You Usually End Up Building on AWS/GCP (and Why It’s Easy to Get Wrong)

Core Components

The Glue Problem

What Durable Objects Are (and What You Get by Default)

Three properties that matter for coordination

Patterns you can implement

The real benefit

The Benefit isn’t “Cloudflare vs AWS”

When not to use Durable Objects

What I’m Going to Build Over 30 Days

Posting Schedule (holiday break)

Planned Map (subject to change)

How to Follow Along

Related posts

Idempotency

POJO-actor v1.0: A Lightweight Actor Model Library for Java

Understanding the core mechanics of ThingsDB

Day 01 - One Key, One Coordinator

The Benefit isn’t “Cloudflare vs AWS”

What I’m Going to Build Over 30 Days