Line of Defense: Three Systems, Not One
Source: Dev.to
The Three Mechanisms
“Rate limiting” is often used as a catch‑all for anything that rejects or slows down requests. In reality there are three distinct mechanisms, each protecting against a different failure mode and each asking a different question.
| Mechanism | Question it asks | What it protects |
|---|---|---|
| Load shedding | “Is this server healthy enough to handle any request?” | The server protects itself |
| Rate limiting | “Is this caller sending too many requests?” | The system protects abusive callers |
| Adaptive throttling | “Is the downstream struggling right now?” | Downstream services are protected from this server |
A rate limiter won’t save you when your server is OOM‑ing — every user is within their quota, but the server is dying.
Load shedding won’t stop one customer from consuming 80 % of your capacity — total concurrency is fine, the distribution is unfair.
Neither will prevent you from hammering a downstream service that’s already struggling.
These are complementary systems. Treating them as one thing—or building only one of the three—leaves gaps that appear exactly when you need protection most.
Layer 1 – Load Shedding
Protects this server from itself.
- Is memory pressure too high?
- Are there too many concurrent requests?
- Did a downstream just return
RESOURCE_EXHAUSTED?
If any of these are true, reject immediately—doesn’t matter who the user is or what the request is. The building is at capacity.
Layer 2 – Rate Limiting
Protects the system from abusive users.
- Is this specific user, API key, or IP address sending more than its allowed share?
This is the classic rate limiter—per‑user counters, sliding windows, token buckets.
Layer 3 – Adaptive Throttling
Protects downstream services from this server.
- The server tracks its success rate when calling each downstream.
- If 20 % of calls to the payment service are failing, it starts probabilistically dropping 20 % of outbound calls—giving the payment service breathing room to recover.
Why the Order Matters
- Load shedding runs at the highest priority—before authentication, before request parsing, before anything else.
- If rate limiting (Layer 2) runs first, the server spends CPU checking Redis counters, computing sliding‑window math, and doing per‑user lookups. Then it reaches Layer 1, which says “actually the server is dying, reject everything.” All that work was wasted.
- Load shedding is cheap—one atomic counter check or a GC‑flag read. It takes microseconds. Rate limiting might require a Redis round‑trip. Run the cheap check first.
- Analogy – Think of a nightclub: the fire marshal at the door (load shedding) doesn’t check IDs. “Building is at capacity. Nobody gets in.” Only if the building isn’t full does the bouncer (rate limiter) check your guest list.
Illustrative Scenarios
| Situation | What Layer 1 Does | What Layer 2 Does | What Layer 3 Does |
|---|---|---|---|
| Bad deployment – new ML model eats 3× memory | Detects GC pressure spike, starts shedding | Blind – every user is within limit | Blind – downstream fine |
| One customer spikes 10× – migration script bug | May eventually catch if overall concurrency exceeds limit | Catches immediately – per‑user counter crosses threshold | Blind – downstream fine |
Downstream payment service degrades – returns RESOURCE_EXHAUSTED on 40 % | Reactive backoff on those responses | Blind – users are within limits | Probabilistically drops outbound calls to give service room |
| DDoS – thousands of IPs, each moderate traffic | Catches total concurrency spike | Catches per‑IP limits (if set) | Blind – inbound problem |
| Slow dependency – DB query goes from 5 ms to 2 s | Sees concurrent request count spike toward limit | Blind – users are within limits | May not see errors (slow responses aren’t errors) |
| Both Layer 1 & Layer 2 fail | — | — | Still prevents cascade into downstream services |
Takeaway: No single layer handles everything. They are complementary, not redundant. If one layer fails, the others still provide protection.
Rate Limiting Is Not One Tool – It’s Two
| Approach | What It Does | Caller Experience |
|---|---|---|
| Rejection | Returns 429 Too Many Requests. The request is over the limit and is rejected. | Caller must handle the error. |
| Delay (queueing) | Holds the request in a queue and releases it when the rate allows. The request is delayed, not rejected. | Caller sees a slower response but no error. |
Both achieve the same goal—enforcing a rate—but they provide completely different experiences.
The key question: When do you reject, and when do you delay?
- Reject when an external connection is being held open (e.g., a user’s HTTP connection).
- Delay when you can safely buffer the request and release it later without breaking the client’s expectations.
TL;DR
- Load shedding → protects the server itself.
- Rate limiting → protects the system from abusive callers.
- Adaptive throttling → protects downstream services.
Run them in order (load shedding → rate limiting → adaptive throttling) to maximize efficiency and resilience.
Rate‑Limiting: Reject or Delay?
The rule is simple:
Reject when a caller is merely waiting for a connection.
Delay when you can afford to wait.
Below are common situations that illustrate why the choice matters.
1. Connection‑Pool Exhaustion
“You’re holding that connection — which means a thread, a socket, memory.
Delay 500 users and you’ve exhausted your connection pool. Now legitimate users who are under the limit can’t get a connection.
Your rate limiter just caused an outage for good users by being too nice to bad ones.
Reject fast. Free the connection. Let the client’s retry logic handle it.”
2. External API Rate Limits (e.g., Stripe)
“Delay when your own system needs the request to succeed.
You’re calling Stripe’s payment API. You know their limit: 100 req/s.
The 101st request doesn’t need to fail — it just needs to wait 10 ms for the next second’s budget.
If you reject it instead, you need retry logic, back‑off timers, dead‑letter queues, monitoring for the retries — an entire infrastructure to handle a problem that ‘just wait’ solves.”
3. Public API Burst Traffic
“Your public API gets a burst from a customer. Reject. Return 429 instantly.
The customer’s SDK has built‑in retry with exponential back‑off. Your server processes the rejection in microseconds and moves on.
If you delayed instead, 500 connections stay open, the connection pool starves, and everyone experiences an outage.”
4. Bulk Email Sends (SendGrid)
“You’re sending 50 000 marketing emails through SendGrid.
Delay. SendGrid allows 500 req/s. Queue all 50 000, drip them at 500 /s → takes 100 s, every email delivered.
If you rejected instead, 49 500 emails bounce in the first second. You’d then need a dead‑letter queue and retry scheduling for a problem that ‘wait your turn’ solves completely.”
5. gRPC Internal Traffic
“Your gRPC server receives internal traffic from an upstream service. Reject. Return RESOURCE_EXHAUSTED.
The upstream’s adaptive throttler (Layer 3 on their side) sees the error and automatically backs off. The system self‑heals.
If you delayed instead, the upstream’s gRPC deadline expires while its request sits in your queue. Timeout errors are worse than clean rejections — the upstream can’t tell ‘server is slow’ from ‘I’m being rate‑limited’.”
6. Batch Job Scraping a Partner API
“A batch job scrapes 10 000 records from a partner API nightly.
Delay. Partner allows 50 req/s. Pace it perfectly → 3.3 min, all requests succeed, partner never sees a spike.
If you rejected instead, 9 950 requests fail immediately, retry logic fires, and you hammer the partner for 20 min instead of a clean 3‑minute crawl.”
7. User‑Facing Payment Endpoint
“A user calls your payment endpoint during checkout. Reject.
The user sees a button that says ‘Pay Now’. A 200 ms rejection with a ‘please try again’ message is infinitely better than a 5‑second delay where they think the page froze, hit refresh, and trigger a duplicate payment.”
TL;DR
| Situation | Action | Why |
|---|---|---|
| Caller just needs a free connection | Reject (e.g., 429, RESOURCE_EXHAUSTED) | Frees resources instantly; client can retry with back‑off |
| You can afford to wait for quota or pacing | Delay (queue, sleep, token bucket) | Guarantees successful processing without extra retry infrastructure |
| External service has a known rate limit | Delay until budget is available | Avoids unnecessary failures and downstream retry storms |
| User‑experience is latency‑sensitive | Reject quickly with a clear message | Prevents UI hangs and duplicate actions |