API Rate Limits & Throttling: What's Actually Happening and How to Fix It
Source: Dev.to
Rate limiting is the #1 reason AI API calls fail in production
It isn’t a bug – it’s the provider protecting their infrastructure. This guide explains what’s happening, how to read the signals, and how to stop it from breaking your app.
The Scenario
Your app has been running fine for weeks. Then, on a Monday morning, some users start seeing errors. The errors come and go – sometimes the same question works on the second try.
Your logs are full of this:
HTTP 429 — Too Many Requests
You’re being rate‑limited. And if you handle it wrong, you’ll make it worse.
What Is Rate Limiting?
Think of a highway on‑ramp with a traffic light. When too many cars try to merge at once, the light turns red and lets them through one at a time. Nobody’s banned from the highway – they just have to wait their turn.
AI providers (OpenAI, Anthropic, Google, …) work the same way. When too many requests arrive, they start telling some customers: “Slow down.”
That’s a rate limit. It isn’t an error in your code; it’s the provider saying: “I can handle your request, just not right now.”
| Term | What It Means |
|---|---|
| Rate limit | Maximum number of requests allowed in a time window |
| Throttling | The provider actively slowing down or rejecting your requests |
| 429 status code | The HTTP response that means “too many requests” |
| Quota | Your total allocation (per minute, per day, or per month) |
The Three Types of Rate Limits
Most people think there’s only one rate limit. In reality there are three, and they trigger independently.
| Type | What It Limits | Example Limit | How You Hit It |
|---|---|---|---|
| Requests per minute (RPM) | Number of API calls | 60 RPM | Sending too many questions, even short ones |
| Tokens per minute (TPM) | Total tokens processed | 90 000 TPM | Sending fewer requests, but each one is huge (long documents, big prompts) |
| Tokens per day (TPD) | Daily token budget | 1 000 000 TPD | Sustained high usage over hours |
Important: You can hit TPM while staying under RPM. A single request with a 50 000‑token document eats more than half your minute’s budget. You only sent one request – but you’re already throttled. Always check your provider’s current documentation for exact limits – they change frequently and vary by tier.
How to Read a 429 Error
When you get rate‑limited, the provider doesn’t just say “no.” It tells you when to try again. Most people ignore this information.
The Response Headers
HTTP/1.1 429 Too Many Requests
retry-after: 2
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 12s
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-tokens: 28s
| Header | What It Tells You |
|---|---|
retry-after | Seconds to wait before trying again. Use this number. |
x-ratelimit-limit-requests | Your RPM cap |
x-ratelimit-remaining-requests | How many requests you have left this window |
x-ratelimit-reset-requests | When your request limit resets |
x-ratelimit-limit-tokens | Your TPM cap |
x-ratelimit-remaining-tokens | How many tokens you have left this window |
x-ratelimit-reset-tokens | When your token limit resets |
Example: You get a 429 and the retry-after header says 2. That means: wait 2 seconds and try again. Not 0 seconds, not 30 seconds – exactly 2 seconds. The provider is literally telling you the answer.
Status Codes: Which Errors to Retry
Not every error is a rate limit. Here’s a simple rule:
| Code | Meaning | Retry? | What to Do |
|---|---|---|---|
| 429 | Too Many Requests | Yes | Wait and retry with backoff |
| 500 | Server Error | Once | Try once more, then check the provider’s status page |
| 503 | Service Unavailable | Yes | Provider is overloaded – wait and retry |
| 400 | Bad Request | No | Your request is malformed – fix your code |
| 401 | Unauthorized | No | API key is invalid or expired – fix it |
| 403 | Forbidden | No | Key lacks permission for this model or action |
The key rule: Only retry on 429, 500, and 503. Everything else means something is wrong on your end – retrying won’t help.
The Retry Problem (And Why Most Teams Make It Worse)
The Retry Storm
Request fails (429)
→ Code immediately retries
→ Also fails (429) — still in the same window
→ Code retries again
→ Also fails
→ 3 users are now each retrying 5 times
→ 15 requests where there were 3
→ Rate limit is now 5× worse
This is called a retry storm. Your retry logic creates more traffic, which causes more 429s, which causes more retries – a death spiral.
| Retry Approach | What Happens | Result |
|---|---|---|
| No retry | User sees an error | Bad UX, but no damage |
| Immediate retry | Same request hits the same limit | Retry storm – makes it worse |
| Fixed delay (e.g., 1 s) | All retries fire at the same time | Thundering herd – same problem |
| Exponential backoff | Wait 1 s, 2 s, 4 s, 8 s … | Spreads load, gives limits time to reset |
| Exponential backoff + jitter | Same as above + random 0‑1 s added | Prevents synchronized retries across users |
The Right Way: Exponential Backoff with Jitter
Instead of retrying immediately (which makes things worse), wait a little longer each time:
- First retry: wait ~1 second
- Second retry: wait ~2 seconds
- Third retry: wait ~4 seconds
- Keep doubling up to a maximum of 5 retries
If it’s still failing, stop and show the user a helpful error.
Add jitter: add a small random delay (e.g., 0‑1 second) to each wait so that multiple users don’t all retry at the exact same moment.
That’s it – double the wait each time, sprinkle in a pinch of randomness, and give up after five attempts.
Preventing Rate Limits Before They Happen
Three strategies, in order of impact:
- Batch & chunk – send larger payloads less frequently (e.g., combine multiple user queries into one request when possible).
- Cache responses – avoid duplicate calls for the same prompt or data.
- Monitor & adapt – read the
retry-afterand rate‑limit headers, and dynamically throttle your own request queue based on the provider’s signals.
Implement these, and you’ll dramatically reduce the chance of hitting rate limits in the first place.
1. Request Queuing
Without a queue, every user hits the API directly. With a queue, your app controls the flow.
WITHOUT QUEUE:
User A ──→ API
User B ──→ API → 100 simultaneous calls → 429s
User C ──→ API
…
User Z ──→ API
WITH QUEUE:
User A ──┐
User B ──┤
User C ──┼──→ Queue ──→ 10 requests/sec ──→ API → No 429s
… │
User Z ──┘
- Users A and B get instant responses.
- User Z waits a few seconds.
- Nobody gets an error.
The queue absorbs the traffic spike and releases it at a rate the API can handle.
2. Caching
If 200 users ask “How do I reset my password?” in one day — why call the API 200 times?
| Strategy | How It Works | Best For |
|---|---|---|
| Exact match | Same question → cached answer | FAQs, common queries |
| Semantic cache | Similar questions → cached answer | Support bots, knowledge bases |
| TTL‑based | Cache expires after X minutes | Data that changes periodically |
Example: 200 identical questions per day.
Without cache: 200 API calls.
With cache: 1 API call + 199 cache hits → rate‑limit usage drops by 99.5 %.
3. Smaller Prompts
TPM limits are about total tokens. A 10 000‑token request eats 100× more budget than a 100‑token request.
| Optimization | Token Savings |
|---|---|
| Send only relevant chunks, not full docs | 30‑60 % |
| Shorter system prompts | 10‑20 % |
| Summarize long docs with a cheap model first | 50‑70 % |
Monitoring: What to Watch
Don’t wait for users to report 429s. Watch these numbers:
| Metric | Warning | Critical | Action |
|---|---|---|---|
| RPM usage % | 70 % of limit | 90 % of limit | Enable queuing or caching |
| TPM usage % | 70 % of limit | 90 % of limit | Optimize prompt sizes |
| 429 count / hour | Any | 10+ per hour | Check for retry storms |
| Retry rate | 5 % of requests | 15 % of requests | Back‑off isn’t aggressive enough |
| P95 response time | 5 s | 15 s | Rate‑limit delays hitting UX |
| Daily token spend | 70 % of TPD | 90 % of TPD | Will run out of daily quota |
Enterprise: The Noisy‑Neighbor Problem
One enterprise customer runs a batch job — 500 requests in a minute. Your shared API key gets rate‑limited. Now every customer is affected.
| Problem | Solution |
|---|---|
| One customer blocks everyone | Per‑tenant rate limiting – enforce limits per customer before hitting the API |
| Real‑time chat delayed by batch jobs | Priority queues – chat requests go before batch jobs |
| Shared key runs out of quota | Separate API keys – different keys for different customers or use‑cases |
| Unpredictable usage spikes | Batch vs. real‑time separation – batch jobs use a different key with lower priority |
Troubleshooting Checklist
When 429s start showing up, work through this in order:
| Step | What to Check |
|---|---|
| 1 | x‑ratelimit‑remaining‑requests and x‑ratelimit‑remaining‑tokens — which limit did you hit? |
| 2 | Is it RPM or TPM? Too many requests or too many tokens per request? |
| 3 | Check for retry storms — is your retry count multiplying the problem? |
| 4 | Check retry‑after header — are you waiting the recommended time? |
| 5 | Is one user or tenant consuming a disproportionate quota? |
| 6 | Prompt sizes — did someone add a huge system prompt or send large documents? |
| 7 | Duplicate requests — is the frontend sending the same request multiple times? |
| 8 | Tier — did you recently exceed a billing threshold that changes your limits? |
| 9 | Provider status page — any capacity issues on the provider side? |
| 10 | Time of day — peak hours (US business hours) have tighter effective limits. |
Common Patterns Quick Reference
| Symptom | Likely Cause | Fix |
|---|---|---|
| 429s for everyone at once | Shared rate limit exhausted | Per‑tenant limits or request queue |
| 429s for one customer only | That customer is sending too much | Per‑customer throttling |
| 429s only during peak hours | Hitting RPM at high‑traffic times | Queue + cache |
| 429s after deploying new feature | New feature sends more or larger requests | Audit token usage |
| 429s that get worse over time | Retry storm | Exponential backoff + jitter |
| 429s on token limit but low RPM | Sending very large prompts | Reduce context and prompt size |
| Intermittent 429s, no pattern | Hovering near the limit | Add 20 % buffer below your limit |
| 429s after a billing change | Tier downgrade reduced limits | Check provider dashboard for current tier |
The Bottom Line
Rate limits aren’t bugs—they’re a feature of every AI API. The difference between a junior and senior engineer:
-
Junior: “The API is broken, it keeps returning errors.”
-
Senior: “We’re hitting our TPM limit during peak hours. I’m adding a request queue with exponential backoff and caching frequent queries. That should keep us under 70 % utilization.”
-
Know your limits.
-
Monitor your usage.
-
Retry smart, not fast.
When in doubt, check the response headers—the answer is usually right there.