Scraper worked on my laptop. Deployed to server and got instant 403s.
Source: Dev.to
What broke
The target site was checking the User-Agent header. My laptop sent requests with a normal browser user agent because I was using Playwright for something else and had set it globally in my profile.
The server, a fresh Ubuntu install, used the default Python requests User-Agent:
python-requests/2.31.0The site rejected that and returned 403 Forbidden for every request.
Fixed it
Added a custom User-Agent to the request headers:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get('https://example.com/products', headers=headers)
if response.status_code == 200:
# Parse the data
products = response.json()
else:
print(f"Failed: {response.status_code}")With this change the site started returning 200 OK again.
Other things that sometimes matter
Besides User-Agent, some sites also check:
Referer header – they may require a valid referer to allow the request.
headers = { 'User-Agent': 'Mozilla/5.0...', 'Referer': 'https://example.com/' }Accept headers – real browsers send a variety of accept headers.
headers = { 'User-Agent': 'Mozilla/5.0...', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate, br' }
Most of the time, setting a proper User-Agent is enough. When it isn’t, adding these additional headers usually resolves the issue.
Tip: Always check response.status_code before parsing the response. This prevents trying to parse an error page (e.g., a 403) as JSON and encountering confusing parsing errors.