Line of Defense: Three Systems, Not One

Published: (February 27, 2026 at 11:50 PM EST)
8 min read
Source: Dev.to

Source: Dev.to

The Three Mechanisms

“Rate limiting” is often used as a catch‑all for anything that rejects or slows down requests. In reality there are three distinct mechanisms, each protecting against a different failure mode and each asking a different question.

MechanismQuestion it asksWhat it protects
Load shedding“Is this server healthy enough to handle any request?”The server protects itself
Rate limiting“Is this caller sending too many requests?”The system protects abusive callers
Adaptive throttling“Is the downstream struggling right now?”Downstream services are protected from this server

A rate limiter won’t save you when your server is OOM‑ing — every user is within their quota, but the server is dying.
Load shedding won’t stop one customer from consuming 80 % of your capacity — total concurrency is fine, the distribution is unfair.
Neither will prevent you from hammering a downstream service that’s already struggling.

These are complementary systems. Treating them as one thing—or building only one of the three—leaves gaps that appear exactly when you need protection most.

Layer 1 – Load Shedding

Protects this server from itself.

  • Is memory pressure too high?
  • Are there too many concurrent requests?
  • Did a downstream just return RESOURCE_EXHAUSTED?

If any of these are true, reject immediately—doesn’t matter who the user is or what the request is. The building is at capacity.

Layer 2 – Rate Limiting

Protects the system from abusive users.

  • Is this specific user, API key, or IP address sending more than its allowed share?

This is the classic rate limiter—per‑user counters, sliding windows, token buckets.

Layer 3 – Adaptive Throttling

Protects downstream services from this server.

  • The server tracks its success rate when calling each downstream.
  • If 20 % of calls to the payment service are failing, it starts probabilistically dropping 20 % of outbound calls—giving the payment service breathing room to recover.

Why the Order Matters

  1. Load shedding runs at the highest priority—before authentication, before request parsing, before anything else.
  2. If rate limiting (Layer 2) runs first, the server spends CPU checking Redis counters, computing sliding‑window math, and doing per‑user lookups. Then it reaches Layer 1, which says “actually the server is dying, reject everything.” All that work was wasted.
  3. Load shedding is cheap—one atomic counter check or a GC‑flag read. It takes microseconds. Rate limiting might require a Redis round‑trip. Run the cheap check first.
  4. Analogy – Think of a nightclub: the fire marshal at the door (load shedding) doesn’t check IDs. “Building is at capacity. Nobody gets in.” Only if the building isn’t full does the bouncer (rate limiter) check your guest list.

Illustrative Scenarios

SituationWhat Layer 1 DoesWhat Layer 2 DoesWhat Layer 3 Does
Bad deployment – new ML model eats 3× memoryDetects GC pressure spike, starts sheddingBlind – every user is within limitBlind – downstream fine
One customer spikes 10× – migration script bugMay eventually catch if overall concurrency exceeds limitCatches immediately – per‑user counter crosses thresholdBlind – downstream fine
Downstream payment service degrades – returns RESOURCE_EXHAUSTED on 40 %Reactive backoff on those responsesBlind – users are within limitsProbabilistically drops outbound calls to give service room
DDoS – thousands of IPs, each moderate trafficCatches total concurrency spikeCatches per‑IP limits (if set)Blind – inbound problem
Slow dependency – DB query goes from 5 ms to 2 sSees concurrent request count spike toward limitBlind – users are within limitsMay not see errors (slow responses aren’t errors)
Both Layer 1 & Layer 2 failStill prevents cascade into downstream services

Takeaway: No single layer handles everything. They are complementary, not redundant. If one layer fails, the others still provide protection.

Rate Limiting Is Not One Tool – It’s Two

ApproachWhat It DoesCaller Experience
RejectionReturns 429 Too Many Requests. The request is over the limit and is rejected.Caller must handle the error.
Delay (queueing)Holds the request in a queue and releases it when the rate allows. The request is delayed, not rejected.Caller sees a slower response but no error.

Both achieve the same goal—enforcing a rate—but they provide completely different experiences.

The key question: When do you reject, and when do you delay?

  • Reject when an external connection is being held open (e.g., a user’s HTTP connection).
  • Delay when you can safely buffer the request and release it later without breaking the client’s expectations.

TL;DR

  • Load shedding → protects the server itself.
  • Rate limiting → protects the system from abusive callers.
  • Adaptive throttling → protects downstream services.

Run them in order (load shedding → rate limiting → adaptive throttling) to maximize efficiency and resilience.

Rate‑Limiting: Reject or Delay?

The rule is simple:

Reject when a caller is merely waiting for a connection.
Delay when you can afford to wait.

Below are common situations that illustrate why the choice matters.

1. Connection‑Pool Exhaustion

“You’re holding that connection — which means a thread, a socket, memory.
Delay 500 users and you’ve exhausted your connection pool. Now legitimate users who are under the limit can’t get a connection.
Your rate limiter just caused an outage for good users by being too nice to bad ones.
Reject fast. Free the connection. Let the client’s retry logic handle it.”

2. External API Rate Limits (e.g., Stripe)

“Delay when your own system needs the request to succeed.
You’re calling Stripe’s payment API. You know their limit: 100 req/s.
The 101st request doesn’t need to fail — it just needs to wait 10 ms for the next second’s budget.
If you reject it instead, you need retry logic, back‑off timers, dead‑letter queues, monitoring for the retries — an entire infrastructure to handle a problem that ‘just wait’ solves.”

3. Public API Burst Traffic

“Your public API gets a burst from a customer. Reject. Return 429 instantly.
The customer’s SDK has built‑in retry with exponential back‑off. Your server processes the rejection in microseconds and moves on.
If you delayed instead, 500 connections stay open, the connection pool starves, and everyone experiences an outage.”

4. Bulk Email Sends (SendGrid)

“You’re sending 50 000 marketing emails through SendGrid.
Delay. SendGrid allows 500 req/s. Queue all 50 000, drip them at 500 /s → takes 100 s, every email delivered.
If you rejected instead, 49 500 emails bounce in the first second. You’d then need a dead‑letter queue and retry scheduling for a problem that ‘wait your turn’ solves completely.”

5. gRPC Internal Traffic

“Your gRPC server receives internal traffic from an upstream service. Reject. Return RESOURCE_EXHAUSTED.
The upstream’s adaptive throttler (Layer 3 on their side) sees the error and automatically backs off. The system self‑heals.
If you delayed instead, the upstream’s gRPC deadline expires while its request sits in your queue. Timeout errors are worse than clean rejections — the upstream can’t tell ‘server is slow’ from ‘I’m being rate‑limited’.”

6. Batch Job Scraping a Partner API

“A batch job scrapes 10 000 records from a partner API nightly.
Delay. Partner allows 50 req/s. Pace it perfectly → 3.3 min, all requests succeed, partner never sees a spike.
If you rejected instead, 9 950 requests fail immediately, retry logic fires, and you hammer the partner for 20 min instead of a clean 3‑minute crawl.”

7. User‑Facing Payment Endpoint

“A user calls your payment endpoint during checkout. Reject.
The user sees a button that says ‘Pay Now’. A 200 ms rejection with a ‘please try again’ message is infinitely better than a 5‑second delay where they think the page froze, hit refresh, and trigger a duplicate payment.”

TL;DR

SituationActionWhy
Caller just needs a free connectionReject (e.g., 429, RESOURCE_EXHAUSTED)Frees resources instantly; client can retry with back‑off
You can afford to wait for quota or pacingDelay (queue, sleep, token bucket)Guarantees successful processing without extra retry infrastructure
External service has a known rate limitDelay until budget is availableAvoids unnecessary failures and downstream retry storms
User‑experience is latency‑sensitiveReject quickly with a clear messagePrevents UI hangs and duplicate actions
0 views
Back to Blog

Related posts

Read more »

Designing a URL Shortener

Designing a URL shortener is one of the most popular system‑design interview questions. It looks simple, but it tests your understanding of scalability, databas...