API Rate Limits & Throttling: What's Actually Happening and How to Fix It

Published: (February 17, 2026 at 02:56 PM EST)
11 min read
Source: Dev.to

Source: Dev.to

Rate limiting is the #1 reason AI API calls fail in production

It isn’t a bug – it’s the provider protecting their infrastructure. This guide explains what’s happening, how to read the signals, and how to stop it from breaking your app.


The Scenario

Your app has been running fine for weeks. Then, on a Monday morning, some users start seeing errors. The errors come and go – sometimes the same question works on the second try.

Your logs are full of this:

HTTP 429 — Too Many Requests

You’re being rate‑limited. And if you handle it wrong, you’ll make it worse.


What Is Rate Limiting?

Think of a highway on‑ramp with a traffic light. When too many cars try to merge at once, the light turns red and lets them through one at a time. Nobody’s banned from the highway – they just have to wait their turn.

AI providers (OpenAI, Anthropic, Google, …) work the same way. When too many requests arrive, they start telling some customers: “Slow down.”

That’s a rate limit. It isn’t an error in your code; it’s the provider saying: “I can handle your request, just not right now.”

TermWhat It Means
Rate limitMaximum number of requests allowed in a time window
ThrottlingThe provider actively slowing down or rejecting your requests
429 status codeThe HTTP response that means “too many requests”
QuotaYour total allocation (per minute, per day, or per month)

The Three Types of Rate Limits

Most people think there’s only one rate limit. In reality there are three, and they trigger independently.

TypeWhat It LimitsExample LimitHow You Hit It
Requests per minute (RPM)Number of API calls60 RPMSending too many questions, even short ones
Tokens per minute (TPM)Total tokens processed90 000 TPMSending fewer requests, but each one is huge (long documents, big prompts)
Tokens per day (TPD)Daily token budget1 000 000 TPDSustained high usage over hours

Important: You can hit TPM while staying under RPM. A single request with a 50 000‑token document eats more than half your minute’s budget. You only sent one request – but you’re already throttled. Always check your provider’s current documentation for exact limits – they change frequently and vary by tier.


How to Read a 429 Error

When you get rate‑limited, the provider doesn’t just say “no.” It tells you when to try again. Most people ignore this information.

The Response Headers

HTTP/1.1 429 Too Many Requests
retry-after: 2
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 12s
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-tokens: 28s
HeaderWhat It Tells You
retry-afterSeconds to wait before trying again. Use this number.
x-ratelimit-limit-requestsYour RPM cap
x-ratelimit-remaining-requestsHow many requests you have left this window
x-ratelimit-reset-requestsWhen your request limit resets
x-ratelimit-limit-tokensYour TPM cap
x-ratelimit-remaining-tokensHow many tokens you have left this window
x-ratelimit-reset-tokensWhen your token limit resets

Example: You get a 429 and the retry-after header says 2. That means: wait 2 seconds and try again. Not 0 seconds, not 30 seconds – exactly 2 seconds. The provider is literally telling you the answer.


Status Codes: Which Errors to Retry

Not every error is a rate limit. Here’s a simple rule:

CodeMeaningRetry?What to Do
429Too Many RequestsYesWait and retry with backoff
500Server ErrorOnceTry once more, then check the provider’s status page
503Service UnavailableYesProvider is overloaded – wait and retry
400Bad RequestNoYour request is malformed – fix your code
401UnauthorizedNoAPI key is invalid or expired – fix it
403ForbiddenNoKey lacks permission for this model or action

The key rule: Only retry on 429, 500, and 503. Everything else means something is wrong on your end – retrying won’t help.


The Retry Problem (And Why Most Teams Make It Worse)

The Retry Storm

Request fails (429)
  → Code immediately retries
    → Also fails (429) — still in the same window
      → Code retries again
        → Also fails
          → 3 users are now each retrying 5 times
            → 15 requests where there were 3
              → Rate limit is now 5× worse

This is called a retry storm. Your retry logic creates more traffic, which causes more 429s, which causes more retries – a death spiral.

Retry ApproachWhat HappensResult
No retryUser sees an errorBad UX, but no damage
Immediate retrySame request hits the same limitRetry storm – makes it worse
Fixed delay (e.g., 1 s)All retries fire at the same timeThundering herd – same problem
Exponential backoffWait 1 s, 2 s, 4 s, 8 s …Spreads load, gives limits time to reset
Exponential backoff + jitterSame as above + random 0‑1 s addedPrevents synchronized retries across users

The Right Way: Exponential Backoff with Jitter

Instead of retrying immediately (which makes things worse), wait a little longer each time:

  1. First retry: wait ~1 second
  2. Second retry: wait ~2 seconds
  3. Third retry: wait ~4 seconds
  4. Keep doubling up to a maximum of 5 retries

If it’s still failing, stop and show the user a helpful error.

Add jitter: add a small random delay (e.g., 0‑1 second) to each wait so that multiple users don’t all retry at the exact same moment.

That’s it – double the wait each time, sprinkle in a pinch of randomness, and give up after five attempts.


Preventing Rate Limits Before They Happen

Three strategies, in order of impact:

  1. Batch & chunk – send larger payloads less frequently (e.g., combine multiple user queries into one request when possible).
  2. Cache responses – avoid duplicate calls for the same prompt or data.
  3. Monitor & adapt – read the retry-after and rate‑limit headers, and dynamically throttle your own request queue based on the provider’s signals.

Implement these, and you’ll dramatically reduce the chance of hitting rate limits in the first place.

1. Request Queuing

Without a queue, every user hits the API directly. With a queue, your app controls the flow.

WITHOUT QUEUE:
  User A ──→ API
  User B ──→ API → 100 simultaneous calls → 429s
  User C ──→ API

  User Z ──→ API

WITH QUEUE:
  User A ──┐
  User B ──┤
  User C ──┼──→ Queue ──→ 10 requests/sec ──→ API → No 429s
  …      │
  User Z ──┘
  • Users A and B get instant responses.
  • User Z waits a few seconds.
  • Nobody gets an error.

The queue absorbs the traffic spike and releases it at a rate the API can handle.


2. Caching

If 200 users ask “How do I reset my password?” in one day — why call the API 200 times?

StrategyHow It WorksBest For
Exact matchSame question → cached answerFAQs, common queries
Semantic cacheSimilar questions → cached answerSupport bots, knowledge bases
TTL‑basedCache expires after X minutesData that changes periodically

Example: 200 identical questions per day.

Without cache: 200 API calls.
With cache: 1 API call + 199 cache hits → rate‑limit usage drops by 99.5 %.


3. Smaller Prompts

TPM limits are about total tokens. A 10 000‑token request eats 100× more budget than a 100‑token request.

OptimizationToken Savings
Send only relevant chunks, not full docs30‑60 %
Shorter system prompts10‑20 %
Summarize long docs with a cheap model first50‑70 %

Monitoring: What to Watch

Don’t wait for users to report 429s. Watch these numbers:

MetricWarningCriticalAction
RPM usage %70 % of limit90 % of limitEnable queuing or caching
TPM usage %70 % of limit90 % of limitOptimize prompt sizes
429 count / hourAny10+ per hourCheck for retry storms
Retry rate5 % of requests15 % of requestsBack‑off isn’t aggressive enough
P95 response time5 s15 sRate‑limit delays hitting UX
Daily token spend70 % of TPD90 % of TPDWill run out of daily quota

Enterprise: The Noisy‑Neighbor Problem

One enterprise customer runs a batch job — 500 requests in a minute. Your shared API key gets rate‑limited. Now every customer is affected.

ProblemSolution
One customer blocks everyonePer‑tenant rate limiting – enforce limits per customer before hitting the API
Real‑time chat delayed by batch jobsPriority queues – chat requests go before batch jobs
Shared key runs out of quotaSeparate API keys – different keys for different customers or use‑cases
Unpredictable usage spikesBatch vs. real‑time separation – batch jobs use a different key with lower priority

Troubleshooting Checklist

When 429s start showing up, work through this in order:

StepWhat to Check
1x‑ratelimit‑remaining‑requests and x‑ratelimit‑remaining‑tokens — which limit did you hit?
2Is it RPM or TPM? Too many requests or too many tokens per request?
3Check for retry storms — is your retry count multiplying the problem?
4Check retry‑after header — are you waiting the recommended time?
5Is one user or tenant consuming a disproportionate quota?
6Prompt sizes — did someone add a huge system prompt or send large documents?
7Duplicate requests — is the frontend sending the same request multiple times?
8Tier — did you recently exceed a billing threshold that changes your limits?
9Provider status page — any capacity issues on the provider side?
10Time of day — peak hours (US business hours) have tighter effective limits.

Common Patterns Quick Reference

SymptomLikely CauseFix
429s for everyone at onceShared rate limit exhaustedPer‑tenant limits or request queue
429s for one customer onlyThat customer is sending too muchPer‑customer throttling
429s only during peak hoursHitting RPM at high‑traffic timesQueue + cache
429s after deploying new featureNew feature sends more or larger requestsAudit token usage
429s that get worse over timeRetry stormExponential backoff + jitter
429s on token limit but low RPMSending very large promptsReduce context and prompt size
Intermittent 429s, no patternHovering near the limitAdd 20 % buffer below your limit
429s after a billing changeTier downgrade reduced limitsCheck provider dashboard for current tier

The Bottom Line

Rate limits aren’t bugs—they’re a feature of every AI API. The difference between a junior and senior engineer:

  • Junior: “The API is broken, it keeps returning errors.”

  • Senior: “We’re hitting our TPM limit during peak hours. I’m adding a request queue with exponential backoff and caching frequent queries. That should keep us under 70 % utilization.”

  • Know your limits.

  • Monitor your usage.

  • Retry smart, not fast.

When in doubt, check the response headers—the answer is usually right there.

0 views
Back to Blog

Related posts

Read more »