3 Hours Wasted on asyncio Pitfalls That Almost Took Down Production
Source: Dev.to
Incident Overview
Last Friday at 5 PM, right when I was about to close my laptop and sneak out, the alert channel exploded — the online data collection service had a timeout rate spiking to 40 %, and all downstream reports were blank. I checked the logs and found that the crawler processing thousands of URLs was still using the old synchronous requests library, fetching them one by one. Each request averaged 1.2 seconds, so a full round took nearly 20 minutes, but the business requirement demanded completion within 5 minutes. The only thought that crossed my mind was to rewrite it with asyncio for concurrency and deploy before leaving.
That decision led to three major pitfalls, and I almost wrecked the service. Below are the hard‑learned lessons, hoping to save you those three hours.
Understanding asyncio
The core of asyncio is the event loop plus coroutines.
- The event loop is a constantly polling scheduler.
- Each coroutine is a task that can voluntarily pause and hand back control.
When a coroutine is waiting for a network response (I/O), the event loop immediately switches to another ready coroutine, keeping the CPU from spinning idle.
The biggest difference from traditional multithreading is that asyncio uses cooperative scheduling within a single thread, avoiding thread‑switching overhead and GIL contention. It especially shines in network‑request‑heavy scenarios.
Typical pattern:
async def coro():
await async_io_operation()
Gather multiple coroutines with asyncio.gather(). The total duration depends on the slowest task, not the sum of all tasks.
Naïve implementation (what not to do)
import asyncio
import requests # 同步库,不能用!
async def fetch(url):
# 错误示范:直接把同步的 requests 放在协程里
resp = requests.get(url, timeout=5) # 这次调用会阻塞整个线程!
return resp.status_code
async def main():
urls = ["https://httpbin.org/delay/1"] * 10
tasks = [fetch(url) for url in urls]
results = await asyncio.gather(*tasks)
print(results)
asyncio.run(main())
Running this shows all requests are still sequential. requests.get() is a blocking call; while it waits for the network it never yields control back to the event loop, so only one coroutine runs at a time. The event loop becomes effectively useless.
Correct async HTTP client
import asyncio
import aiohttp
async def fetch(session, url):
# 使用 aiohttp 的异步请求,await 时将控制权交还事件循环
async with session.get(url, timeout=aiohttp.ClientTimeout(total=5)) as resp:
return await resp.text()
async def main():
urls = ["https://httpbin.org/delay/1"] * 10
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
results = await asyncio.gather(*tasks)
print(f"完成 {len(results)} 个请求")
asyncio.run(main())
Now the event loop truly runs the ten 1‑second requests concurrently, finishing in just over 1 second instead of 10 seconds. In production the crawler went from 20 minutes to under 2 minutes.
Handling exceptions and concurrency limits
When the URL list grew to several hundred, occasional timeouts or DNS failures caused asyncio.gather() to raise an exception immediately, cancelling the remaining coroutines and wiping out the whole batch.
async def fetch_with_sem(sem, session, url):
async with sem: # 限制并发数,防止瞬间占满文件描述符
try:
async with session.get(url, timeout=aiohttp.ClientTimeout(total=10)) as resp:
return url, await resp.text()
except Exception as e:
return url, f"ERROR: {e}"
async def main():
urls = [...] # 几百个 URL
sem = asyncio.Semaphore(50) # 限制并发,避免触发系统或服务端限制
async with aiohttp.ClientSession() as session:
tasks = [fetch_with_sem(sem, session, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True) # 关键!
for url, content in results:
if isinstance(content, Exception):
print(f"{url} 失败: {content}")
else:
process(content)
return_exceptions=Truemakesgatherreturn exception objects instead of aborting the whole batch.- A semaphore caps concurrent connections (e.g., 50) to avoid exhausting file descriptors or hitting remote rate limits.
- Swallowing exceptions and optionally retrying them yields a stable service.
Practical tips
- Never call
time.sleep()inside a coroutine – it blocks the entire thread. Useawait asyncio.sleep()instead. - Beware of unlimited concurrency – spawning thousands of connections at once can exhaust file descriptor limits or trigger server rate limiting. Use
asyncio.Semaphoreor connection‑pool limits. - Implement back‑off and retries – transient network issues are normal. Combine
return_exceptions=Truewith exponential backoff retries for robust production‑grade code.
Conclusion
Rewriting a synchronous I/O‑bound service with asyncio is one of the most satisfying optimizations you can make. But these pitfalls can easily turn it into a nightmare if you’re not careful. I lost three hours and almost a stable production Friday. I hope this post saves you from the same fate.