The Anti-Bot Detection Checklist I Use Before Every Scraping Project
Source: Dev.to
The Anti-Bot Detection Checklist I Use Before Every Scraping Project
Every scraping project I take on starts with this checklist. Not because I’m paranoid — but because I’ve learned the hard way that production scrapers fail silently. They return 200 OK with garbage data, or they get rate-limited so gradually you don’t notice for days. This is the systematic approach I’ve refined over 50+ scraping projects. Before writing a single line of code, check what you’re up against:
Check CDN and headers
curl -I https://target-site.com
Look for these common protection headers:
X-Engine: akamai-html-protection
X-Served-By: DataDome
cf-ray: Cloudflare
X-Bot-Status: blocked
Common protection platforms: Cloudflare → Look for cf-ray and __cfduid cookies DataDome → Look for datadome in headers or scripts PerimeterX → Look for _pxff cookies Akamai → Look for akamai-html-protection headers curl https://target-site.com/robots.txt | grep -v ”^#”
Don’t take this as gospel — but it’s a good signal. If they explicitly disallow your use case, that’s a flag. Some sites are fully static (fast, easy). Others render everything with JavaScript (need Playwright/Puppeteer). Check: // Quick check - fetch raw HTML vs rendered content // If they differ significantly, you need JS rendering
const https = require(‘https’); const html = await fetch(‘https://target.com’).then(r => r.text()); const hasAngularVueReact = /ng-app|vue|react|NEXT_DATA/i.test(html); console.log(‘Needs JS rendering:’, hasAngularVueReact);
const USER_AGENTS = [ ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120 Safari’, ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120 Edge/120’, ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120 Firefox/120’, // Add 10-15 more realistic user agents ];
function randomUA() { return USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)]; }
Never use a single UA string. Rotate through 10+ realistic ones. async function scrapeWithRetry(url, maxRetries = 3) { for (let i = 0; i !data[f]);
if (missing.length > 0) { console.warn(‘Missing fields:’, missing.join(’, ’)); return false; }
if (data.price && typeof data.price !== ‘number’) { console.warn(‘Invalid price type’); return false; }
return true; }
Set up automated health checks that alert you when your scraper starts returning garbage: // Run this every hour async function healthCheck() { const testUrl = ‘https://target-site.com/product-page’; const result = await scrape(testUrl);
const blockType = detectBlock(result);
if (blockType) {
sendAlert(Scraper blocked by ${blockType}!);
return false;
}
if (!validateData(result.parsed)) { sendAlert(‘Scraper returning invalid data!’); return false; }
return true; }
This is the most overlooked step. Store every response as raw HTML before parsing: async function scrapeAndStore(url) { const response = await fetch(url); const raw = await response.text();
// Store raw for debugging await db.rawResponses.insert({ url, raw_html: raw, timestamp: new Date(), status: response.status });
// Then parse const parsed = parseHTML(raw); return parsed; }
When your parser breaks (and it will), you’ll thank yourself for the raw data. A production-ready scraper isn’t just code — it’s a system: Monitoring → Alerting → Health Checks → Data Validation → Backup Parser ↑ ↑ ↑ ↑ Residential Proxies ──────── Sticky Sessions ──── Error Handling
If you only implement three things from this list: Residential proxies (biggest win) Block detection (prevents silent failures) Store raw HTML (enables debugging) Everything else is incremental improvement. Questions about specific anti-bot systems? I’ve dealt with all of them — drop a comment.