Line of Defense: Three Systems, Not One

Published: 3 days ago (February 27, 2026 at 11:50 PM EST)

8 min read

Source: Dev.to

The Three Mechanisms

“Rate limiting” is often used as a catch‑all for anything that rejects or slows down requests. In reality there are three distinct mechanisms, each protecting against a different failure mode and each asking a different question.

Mechanism	Question it asks	What it protects
Load shedding	“Is this server healthy enough to handle any request?”	The server protects itself
Rate limiting	“Is this caller sending too many requests?”	The system protects abusive callers
Adaptive throttling	“Is the downstream struggling right now?”	Downstream services are protected from this server

A rate limiter won’t save you when your server is OOM‑ing — every user is within their quota, but the server is dying.
Load shedding won’t stop one customer from consuming 80 % of your capacity — total concurrency is fine, the distribution is unfair.
Neither will prevent you from hammering a downstream service that’s already struggling.

These are complementary systems. Treating them as one thing—or building only one of the three—leaves gaps that appear exactly when you need protection most.

Layer 1 – Load Shedding

Protects this server from itself.

Is memory pressure too high?
Are there too many concurrent requests?
Did a downstream just return RESOURCE_EXHAUSTED?

If any of these are true, reject immediately—doesn’t matter who the user is or what the request is. The building is at capacity.

Layer 2 – Rate Limiting

Protects the system from abusive users.

Is this specific user, API key, or IP address sending more than its allowed share?

This is the classic rate limiter—per‑user counters, sliding windows, token buckets.

Layer 3 – Adaptive Throttling

Protects downstream services from this server.

The server tracks its success rate when calling each downstream.
If 20 % of calls to the payment service are failing, it starts probabilistically dropping 20 % of outbound calls—giving the payment service breathing room to recover.

Why the Order Matters

Load shedding runs at the highest priority—before authentication, before request parsing, before anything else.
If rate limiting (Layer 2) runs first, the server spends CPU checking Redis counters, computing sliding‑window math, and doing per‑user lookups. Then it reaches Layer 1, which says “actually the server is dying, reject everything.” All that work was wasted.
Load shedding is cheap—one atomic counter check or a GC‑flag read. It takes microseconds. Rate limiting might require a Redis round‑trip. Run the cheap check first.
Analogy – Think of a nightclub: the fire marshal at the door (load shedding) doesn’t check IDs. “Building is at capacity. Nobody gets in.” Only if the building isn’t full does the bouncer (rate limiter) check your guest list.

Illustrative Scenarios

Situation	What Layer 1 Does	What Layer 2 Does	What Layer 3 Does
Bad deployment – new ML model eats 3× memory	Detects GC pressure spike, starts shedding	Blind – every user is within limit	Blind – downstream fine
One customer spikes 10× – migration script bug	May eventually catch if overall concurrency exceeds limit	Catches immediately – per‑user counter crosses threshold	Blind – downstream fine
Downstream payment service degrades – returns `RESOURCE_EXHAUSTED` on 40 %	Reactive backoff on those responses	Blind – users are within limits	Probabilistically drops outbound calls to give service room
DDoS – thousands of IPs, each moderate traffic	Catches total concurrency spike	Catches per‑IP limits (if set)	Blind – inbound problem
Slow dependency – DB query goes from 5 ms to 2 s	Sees concurrent request count spike toward limit	Blind – users are within limits	May not see errors (slow responses aren’t errors)
Both Layer 1 & Layer 2 fail	—	—	Still prevents cascade into downstream services

Takeaway: No single layer handles everything. They are complementary, not redundant. If one layer fails, the others still provide protection.

Rate Limiting Is Not One Tool – It’s Two

Approach	What It Does	Caller Experience
Rejection	Returns `429 Too Many Requests`. The request is over the limit and is rejected.	Caller must handle the error.
Delay (queueing)	Holds the request in a queue and releases it when the rate allows. The request is delayed, not rejected.	Caller sees a slower response but no error.

Both achieve the same goal—enforcing a rate—but they provide completely different experiences.

The key question: When do you reject, and when do you delay?

Reject when an external connection is being held open (e.g., a user’s HTTP connection).
Delay when you can safely buffer the request and release it later without breaking the client’s expectations.

TL;DR

Load shedding → protects the server itself.
Rate limiting → protects the system from abusive callers.
Adaptive throttling → protects downstream services.

Run them in order (load shedding → rate limiting → adaptive throttling) to maximize efficiency and resilience.

Rate‑Limiting: Reject or Delay?

The rule is simple:

Reject when a caller is merely waiting for a connection.
Delay when you can afford to wait.

Below are common situations that illustrate why the choice matters.

1. Connection‑Pool Exhaustion

“You’re holding that connection — which means a thread, a socket, memory.
Delay 500 users and you’ve exhausted your connection pool. Now legitimate users who are under the limit can’t get a connection.
Your rate limiter just caused an outage for good users by being too nice to bad ones.
Reject fast. Free the connection. Let the client’s retry logic handle it.”

2. External API Rate Limits (e.g., Stripe)

“Delay when your own system needs the request to succeed.
You’re calling Stripe’s payment API. You know their limit: 100 req/s.
The 101st request doesn’t need to fail — it just needs to wait 10 ms for the next second’s budget.
If you reject it instead, you need retry logic, back‑off timers, dead‑letter queues, monitoring for the retries — an entire infrastructure to handle a problem that ‘just wait’ solves.”

3. Public API Burst Traffic

“Your public API gets a burst from a customer. Reject. Return 429 instantly.
The customer’s SDK has built‑in retry with exponential back‑off. Your server processes the rejection in microseconds and moves on.
If you delayed instead, 500 connections stay open, the connection pool starves, and everyone experiences an outage.”

4. Bulk Email Sends (SendGrid)

“You’re sending 50 000 marketing emails through SendGrid.
Delay. SendGrid allows 500 req/s. Queue all 50 000, drip them at 500 /s → takes 100 s, every email delivered.
If you rejected instead, 49 500 emails bounce in the first second. You’d then need a dead‑letter queue and retry scheduling for a problem that ‘wait your turn’ solves completely.”

5. gRPC Internal Traffic

“Your gRPC server receives internal traffic from an upstream service. Reject. Return RESOURCE_EXHAUSTED.
The upstream’s adaptive throttler (Layer 3 on their side) sees the error and automatically backs off. The system self‑heals.
If you delayed instead, the upstream’s gRPC deadline expires while its request sits in your queue. Timeout errors are worse than clean rejections — the upstream can’t tell ‘server is slow’ from ‘I’m being rate‑limited’.”

6. Batch Job Scraping a Partner API

“A batch job scrapes 10 000 records from a partner API nightly.
Delay. Partner allows 50 req/s. Pace it perfectly → 3.3 min, all requests succeed, partner never sees a spike.
If you rejected instead, 9 950 requests fail immediately, retry logic fires, and you hammer the partner for 20 min instead of a clean 3‑minute crawl.”

7. User‑Facing Payment Endpoint

“A user calls your payment endpoint during checkout. Reject.
The user sees a button that says ‘Pay Now’. A 200 ms rejection with a ‘please try again’ message is infinitely better than a 5‑second delay where they think the page froze, hit refresh, and trigger a duplicate payment.”

TL;DR

Situation	Action	Why
Caller just needs a free connection	Reject (e.g., 429, RESOURCE_EXHAUSTED)	Frees resources instantly; client can retry with back‑off
You can afford to wait for quota or pacing	Delay (queue, sleep, token bucket)	Guarantees successful processing without extra retry infrastructure
External service has a known rate limit	Delay until budget is available	Avoids unnecessary failures and downstream retry storms
User‑experience is latency‑sensitive	Reject quickly with a clear message	Prevents UI hangs and duplicate actions

Line of Defense: Three Systems, Not One

The Three Mechanisms

Layer 1 – Load Shedding

Layer 2 – Rate Limiting

Layer 3 – Adaptive Throttling

Why the Order Matters

Illustrative Scenarios

Rate Limiting Is Not One Tool – It’s Two

TL;DR

Rate‑Limiting: Reject or Delay?

1. Connection‑Pool Exhaustion

2. External API Rate Limits (e.g., Stripe)

3. Public API Burst Traffic

4. Bulk Email Sends (SendGrid)

5. gRPC Internal Traffic

6. Batch Job Scraping a Partner API

7. User‑Facing Payment Endpoint

TL;DR

Related posts

The Ultimate Guide to Enterprise Web Development in 2026

Designing a URL Shortener

National Vaccine Appointment & Administration System

Gateway Domain-Centric Routing (GDCR) : The Foundation - Version v6.0

The Three Mechanisms

Layer 1 – Load Shedding

Layer 2 – Rate Limiting

Layer 3 – Adaptive Throttling

Why the Order Matters

Illustrative Scenarios

Rate Limiting Is Not One Tool – It’s Two

TL;DR

Rate‑Limiting: Reject or Delay?

1. Connection‑Pool Exhaustion

2. External API Rate Limits (e.g., Stripe)

3. Public API Burst Traffic

4. Bulk Email Sends (SendGrid)

5. gRPC Internal Traffic

6. Batch Job Scraping a Partner API

7. User‑Facing Payment Endpoint

TL;DR

Related posts

The Ultimate Guide to Enterprise Web Development in 2026

Designing a URL Shortener

National Vaccine Appointment & Administration System

Gateway Domain-Centric Routing (GDCR) : The Foundation - Version v6.0

Layer 1 – Load Shedding

Layer 2 – Rate Limiting

Layer 3 – Adaptive Throttling

Rate‑Limiting: Reject or Delay?