I got rate-limited scraping 100 pages. Here's what actually worked
Source: Dev.to
Background
I needed product data from an e-commerce site – just the name, price, and availability. Their API required an enterprise plan ($500 / month), so I decided to scrape the public pages instead.
My first run was impatient: I sent requests as fast as possible and got rate‑limited on page 47, losing all the data and having to start over.
First attempt
import requests
from bs4 import BeautifulSoup
for page in range(1, 101):
response = requests.get(f'https://example.com/products?page={page}')
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data...Result: banned at page 47, zero data collected.
What actually worked
1. Add random delays
import time
import random
time.sleep(random.uniform(2, 5)) # 2–5 second delays2. Rotate user agents
import random
import requests
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
# Add 3–4 more
]
headers = {'User-Agent': random.choice(user_agents)}
response = requests.get(url, headers=headers)3. Save progress
import json
with open('progress.json', 'w') as f:
json.dump({'last_page': page, 'data': results}, f)If the scraper crashes, you can restart from the last saved page instead of starting from page 1.
Results
- Scraping slowly (with delays, rotating UA, and periodic saves) avoided bans.
- User‑agent rotation matters because many sites check this header.
- Saving progress every 10–20 pages prevents total data loss.
- The second run completed all 100 pages in about 15 minutes (instead of the 2 minutes the fast run attempted).
For larger jobs I now use tools like ParseForge that handle throttling and rotation automatically, but the above approach works well for smaller projects.