Scrapy HTTP Cache: The Complete Beginner's Guide (Stop Hammering Websites)
Source: Dev.to
When I First Started Building Spiders
I used to test my spiders by running them over and over.
Each time I tweaked a selector, I’d run the spider again, hit the website again, and download the same pages again.
After my 50th run in one day, I realized something: I was being a terrible internet citizen. I was hammering some poor website with hundreds of requests just because I couldn’t write a CSS selector properly.
Then I discovered Scrapy’s HTTP cache – a game‑changer. Now when I test my spiders, they fetch pages once and reuse the cached responses. Faster testing. No guilt. No getting blocked.
Let me show you how to use caching properly.
What Is HTTP Cache?
Think of HTTP cache like a photocopy machine for webpages.
Without cache
- Run spider → Downloads page
- Fix selector → Run again → Downloads same page again
- Fix another thing → Run again → Downloads same page again
You’re downloading the exact same page multiple times. Wasteful, slow, and annoying to the website.
With cache
- Run spider → Downloads page → Saves a copy
- Fix selector → Run again → Uses saved copy (no download!)
- Fix another thing → Run again → Still using saved copy
You download once, test infinite times. The website only sees one request.
Enabling Cache (The One‑Liner)
Add this to your settings.py:
HTTPCACHE_ENABLED = True
That’s it. Scrapy now caches everything.
Run your spider:
scrapy crawl myspider
First run: downloads pages normally. Check your project folder – you’ll see a new .scrapy/httpcache/myspider/ directory where cached pages live.
Run it again:
scrapy crawl myspider
This time it’s lightning fast. No actual HTTP requests; everything comes from cache.
How Cache Works (The Simple Explanation)
- First request – Spider asks for a URL.
- Cache checks – “Do I have this page already?”
- Cache miss – No, it doesn’t.
- Download – Fetch from the website.
- Store – Save the response to cache.
- Return – Give the response to the spider.
Next time you request the same URL:
- Request – Spider asks for the same URL.
- Cache checks – “Do I have this page?”
- Cache hit – Yes!
- Return – Give the cached response (no download!).
Simple. Fast. Efficient.
Basic Cache Settings
How Long to Keep Cache
By default, the cache never expires. You can set an expiration time:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400 # 24 hours
After 24 hours, cached pages are re‑downloaded.
When to use expiration
- Scraping news sites (content changes daily)
- Product prices (change frequently)
- Any dynamic content
When NOT to use expiration
- Development (you want pages to stay cached)
- Scraping static content
- Historical data that doesn’t change
Where to Store Cache
Default location: .scrapy/httpcache/. Change it with:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'my_custom_cache'
Now the cache lives in my_custom_cache/.
Ignore Certain Status Codes
Don’t cache error pages:
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]
404s and 500s won’t be cached – you don’t want to cache broken pages.
Cache Policies (Two Flavors)
Scrapy provides two cache policies: DummyPolicy and RFC2616Policy.
DummyPolicy (The Simple One)
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
- Caches everything.
- Never checks if the cache is fresh.
- Never revalidates.
Use when: testing, offline development, or when you want to “replay” scrapes exactly.
RFC2616Policy (The Smart One)
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
- Respects HTTP caching headers (
Cache‑Control,max‑age, etc.). - Revalidates when needed.
Use when: running production scrapers, respecting website caching rules, needing up‑to‑date data, or simply being a good internet citizen.
Real Example: Development vs. Production
Development Setup (Cache Everything)
# settings.py
# Development: cache everything forever
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_DIR = '.dev_cache'
HTTPCACHE_EXPIRATION_SECS = 0 # Never expire
Perfect for testing – download once, test forever.
Production Setup (Smart Caching)
# settings.py
# Production: respect HTTP caching rules
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_DIR = '.prod_cache'
HTTPCACHE_EXPIRATION_SECS = 3600 # 1 hour
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]
Respects website rules and updates when needed.
Practical Workflow
Step 1: Enable Cache for Development
# settings.py
HTTPCACHE_ENABLED = True
From here you can tweak the other options (policy, directory, expiration, etc.) to suit your workflow. Happy crawling!
Step 2: First Run (Populate Cache)
scrapy crawl myspider
This downloads all pages and caches them.
Step 3: Develop with Cache
Now you can run your spider hundreds of times without hitting the website:
# Run it again
scrapy crawl myspider
# Fix selector
# Run again
scrapy crawl myspider
# Fix another thing
# Run again
scrapy crawl myspider
All runs are instant because they are served from the cache.
Step 4: Clear Cache When Needed
When the website’s structure changes or you need fresh data:
rm -rf .scrapy/httpcache/
Then run the spider again to re‑populate the cache with fresh pages.
Per‑Request Cache Control
You can disable caching for specific requests:
def parse(self, response):
# This request won't be cached
yield scrapy.Request(
'https://example.com/dynamic',
callback=self.parse_dynamic,
meta={'dont_cache': True}
)
Useful when some pages must be fresh while others can be cached.
Advanced: Storage Backends
Scrapy provides two storage backends for the HTTP cache: Filesystem (default) and DBM.
Filesystem (Default)
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
Pros
- Easy to inspect (just open files)
- Works everywhere
- Simple
Cons
- Many small files
- Slower with thousands of pages
- Takes more disk space
DBM (Database)
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'
Pros
- Faster with lots of pages
- Fewer files
- More efficient
Cons
- Harder to inspect
- Database‑specific issues
- More complex
Tip: For most projects, stick with the Filesystem backend; it’s simpler.
Debugging with Cache
See What’s Cached
ls -R .scrapy/httpcache/
You’ll see a folder for each request. Inside each folder you’ll find:
request_body– the request that was maderequest_headers– headers sentresponse_body– HTML receivedresponse_headers– response headersmeta– metadata
Check If a Request Was Cached
Scrapy logs cache hits:
[scrapy.core.engine] DEBUG: Crawled (200) (referer: None) ['cached']
The ['cached'] suffix indicates a cache hit.
Without a cache hit the log looks like:
[scrapy.core.engine] DEBUG: Crawled (200) (referer: None)
Common Gotchas
Gotcha #1: Cache Persists Between Spider Runs
The cache survives across runs. To force fresh data, clear it manually:
rm -rf .scrapy/httpcache/
Or set an expiration time (see above).
Gotcha #2: Different Spiders Share the Same Cache Directory
If you have multiple spiders in one project, they share .scrapy/httpcache/, but each spider gets its own sub‑folder:
.scrapy/httpcache/
spider1/
spider2/
spider3/
Gotcha #3: POST Requests Aren’t Cached by Default
Only GET requests are cached. POST requests (e.g., form submissions) bypass the cache:
# This won't be cached
yield scrapy.FormRequest(
'https://example.com/search',
formdata={'query': 'test'}
)
This is intentional because POST requests are usually not idempotent.
Gotcha #4: Redirects Are Cached Too
If a URL redirects, the redirect is cached. Subsequent runs will return the cached final page without following the redirect again:
https://example.com → https://www.example.com
Real‑World Scenarios
Scenario 1: Testing Selectors
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0 # never expire
Run once to populate the cache, then tweak selectors all day without hitting the site.
Scenario 2: Scraping Historical Data
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_EXPIRATION_SECS = 0 # keep forever
Perfect for data that never changes (e.g., old articles).
Scenario 3: Production Scraper
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_EXPIRATION_SECS = 1800 # 30 minutes
Respects HTTP caching rules and refreshes after 30 minutes—a balanced approach.
Scenario 4: Offline Development
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False # fail if a page isn’t cached
Your spider will only use cached pages; if a page is missing, it fails instead of trying to download it—ideal for working on a plane.
Tips Nobody Tells You
- Version control the cache (or a subset) when collaborating to ensure everyone works with the same data.
- Combine policies: use
DummyPolicyfor static pages andRFC2616Policyfor dynamic ones by overridingHTTPCACHE_POLICYper‑spider. - Monitor cache size: periodically run
du -sh .scrapy/httpcache/and clean old entries to avoid disk bloat. - Use
HTTPCACHE_IGNORE_HTTP_CODESto prevent caching error pages (e.g., 404, 500).
Tip #1: Use Cache for CI/CD
In continuous integration you don’t want to hit real websites. Use the Scrapy HTTP cache:
# settings.py for CI/CD
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False # Tests fail if page not cached
Pre‑populate the cache in your repo. Tests run against cached pages – fast and reliable.
Tip #2: Share Cache Between Developers
Commit the cache folder to version control:
git add .scrapy/httpcache/
git commit -m "Add test cache"
Now everyone on the team uses the same cached pages for testing, giving consistent results.
Tip #3: Different Cache for Different Environments
# settings.py
import os
HTTPCACHE_ENABLED = True
if os.getenv('ENV') == 'production':
HTTPCACHE_DIR = '.prod_cache'
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
else:
HTTPCACHE_DIR = '.dev_cache'
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
Separate cache for dev and prod – best of both worlds.
Tip #4: Compress Cache to Save Space
# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True # Compress cached responses
Saves tons of disk space, especially with large pages.
Complete Example Spider
A production‑ready spider with smart caching:
# spider.py
import scrapy
class SmartCacheSpider(scrapy.Spider):
name = 'smartcache'
start_urls = ['https://example.com/products']
custom_settings = {
'HTTPCACHE_ENABLED': True,
'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy',
'HTTPCACHE_EXPIRATION_SECS': 3600,
'HTTPCACHE_IGNORE_HTTP_CODES': [404, 500, 502, 503],
'HTTPCACHE_GZIP': True,
'HTTPCACHE_DIR': '.product'
}
def parse(self, response):
# parsing logic here
pass