Webhooks at Scale: Designing an Idempotent, Replay-Safe, and Observable Webhook System
Source: Dev.to
Introduction
Webhooks look easy until your system processes the same payment three times, drops a critical event, and you can’t prove what actually happened. This article is a production‑grade deep dive into building a webhook ingestion system that survives retries, replays, out‑of‑order delivery, provider bugs, and your own future self.
Most webhook providers promise:
- at‑least‑once delivery
- retries on failure
- signed payloads
What they don’t promise:
- ordering
- uniqueness
- consistency
- sane retry behavior
Reality: webhooks are an unreliable distributed queue that you do not control. Treat them as such.
Typical failure modes:
- Duplicate events processed twice
- Provider retries for hours after success
- Events arriving out of order
- Partial failures mid‑processing
- Clock skew breaking signatures
- Silent drops with no audit trail
A correct design assumes all of these happen daily.
Architecture Overview
Webhook Provider
│
│ POST /webhook
▼
Ingress Layer (Fast, Stateless)
│
│ enqueue
▼
Persistent Event Store
│
│ dedupe + order
▼
Event Processor
│
│ side effects
▼
Domain Services
Key Principle
Never do business logic in the webhook handler.
Webhook endpoints must:
- Verify the signature
- Persist the raw payload
- Return a
2xxresponse
Anything else belongs downstream.
Minimal webhook handler (Node/Express)
app.post('/webhook', async (req, res) => {
verifySignature(req);
await storeRawEvent(req);
res.status(200).end();
});
If your endpoint takes more than 1–2 seconds, retries are guaranteed.
What to store
- All request headers
- Raw request body
- Reception timestamp
- Provider event ID (if any)
Idempotency
If your system is not idempotent, retries become data corruption.
Wrong approaches
- “We’ll check if status already changed” ❌
- “We’ll trust provider event IDs” ❌
Correct approach
Create your own idempotency key:
const key = hash(`${provider}:${eventType}:${externalObjectId}`);
Persist the key with a unique constraint. If the insert fails, treat it as a duplicate and skip safely.
Ordering
Providers do not guarantee ordering. Never assume:
- Event A arrives before event B
- Timestamps are monotonic
Strategy
Model events as state transitions and reject invalid transitions:
if (!isValidTransition(currentState, nextEvent)) {
logAndIgnore();
}
This makes ordering irrelevant because only valid state changes are applied.
Transactional Outbox Pattern
Databases are transactional; external APIs are not. Use the outbox pattern:
- Write the domain change and an outbox record in the same transaction.
- Commit the transaction.
- An asynchronous worker reads pending outbox records and executes side effects.
- Mark the outbox record as done.
Benefits
- Prevents double emails, double charges, and partial failures.
Common Mistakes
- Parsing JSON before verification.
- Ignoring header casing.
- Using the system clock blindly.
Best practices
- Verify against the raw body.
- Allow a small clock skew when checking timestamps.
- Fail closed: if verification fails, do not retry internally.
Observability & Auditing
You need to answer three questions for every webhook:
- Did we receive it?
- Did we process it?
- What did it change?
Minimum requirements
- Event ID traceable across logs.
- Processing status persisted.
- Dead‑letter queue for failures.
If you can’t answer these in under 5 minutes, your system is blind. Missing any of them can lead to bugs you can’t undo.
Conclusion
Webhooks are not callbacks; they are untrusted, replayable messages. Once you treat them as such—storing raw payloads, enforcing idempotency, handling out‑of‑order delivery, and using an outbox for side effects—they become boring, reliable infrastructure. And boring infrastructure is the goal.