Scrapy HTTP Cache: The Complete Beginner's Guide (Stop Hammering Websites)

Published: 1 month ago (December 26, 2025 at 01:51 PM EST)

7 min read

Source: Dev.to

When I First Started Building Spiders

I used to test my spiders by running them over and over.
Each time I tweaked a selector, I’d run the spider again, hit the website again, and download the same pages again.

After my 50th run in one day, I realized something: I was being a terrible internet citizen. I was hammering some poor website with hundreds of requests just because I couldn’t write a CSS selector properly.

Then I discovered Scrapy’s HTTP cache – a game‑changer. Now when I test my spiders, they fetch pages once and reuse the cached responses. Faster testing. No guilt. No getting blocked.

Let me show you how to use caching properly.

What Is HTTP Cache?

Think of HTTP cache like a photocopy machine for webpages.

Without cache

Run spider → Downloads page
Fix selector → Run again → Downloads same page again
Fix another thing → Run again → Downloads same page again

You’re downloading the exact same page multiple times. Wasteful, slow, and annoying to the website.

With cache

Run spider → Downloads page → Saves a copy
Fix selector → Run again → Uses saved copy (no download!)
Fix another thing → Run again → Still using saved copy

You download once, test infinite times. The website only sees one request.

Enabling Cache (The One‑Liner)

Add this to your settings.py:

HTTPCACHE_ENABLED = True

That’s it. Scrapy now caches everything.

Run your spider:

scrapy crawl myspider

First run: downloads pages normally. Check your project folder – you’ll see a new .scrapy/httpcache/myspider/ directory where cached pages live.

Run it again:

scrapy crawl myspider

This time it’s lightning fast. No actual HTTP requests; everything comes from cache.

How Cache Works (The Simple Explanation)

First request – Spider asks for a URL.
Cache checks – “Do I have this page already?”
Cache miss – No, it doesn’t.
Download – Fetch from the website.
Store – Save the response to cache.
Return – Give the response to the spider.

Next time you request the same URL:

Request – Spider asks for the same URL.
Cache checks – “Do I have this page?”
Cache hit – Yes!
Return – Give the cached response (no download!).

Simple. Fast. Efficient.

Basic Cache Settings

How Long to Keep Cache

By default, the cache never expires. You can set an expiration time:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400  # 24 hours

After 24 hours, cached pages are re‑downloaded.

When to use expiration

Scraping news sites (content changes daily)
Product prices (change frequently)
Any dynamic content

When NOT to use expiration

Development (you want pages to stay cached)
Scraping static content
Historical data that doesn’t change

Where to Store Cache

Default location: .scrapy/httpcache/. Change it with:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'my_custom_cache'

Now the cache lives in my_custom_cache/.

Ignore Certain Status Codes

Don’t cache error pages:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]

404s and 500s won’t be cached – you don’t want to cache broken pages.

Cache Policies (Two Flavors)

Scrapy provides two cache policies: DummyPolicy and RFC2616Policy.

DummyPolicy (The Simple One)

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'

Caches everything.
Never checks if the cache is fresh.
Never revalidates.

Use when: testing, offline development, or when you want to “replay” scrapes exactly.

RFC2616Policy (The Smart One)

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'

Respects HTTP caching headers (Cache‑Control, max‑age, etc.).
Revalidates when needed.

Use when: running production scrapers, respecting website caching rules, needing up‑to‑date data, or simply being a good internet citizen.

Real Example: Development vs. Production

Development Setup (Cache Everything)

# settings.py

# Development: cache everything forever
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_DIR = '.dev_cache'
HTTPCACHE_EXPIRATION_SECS = 0  # Never expire

Perfect for testing – download once, test forever.

Production Setup (Smart Caching)

# settings.py

# Production: respect HTTP caching rules
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_DIR = '.prod_cache'
HTTPCACHE_EXPIRATION_SECS = 3600  # 1 hour
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]

Respects website rules and updates when needed.

Practical Workflow

Step 1: Enable Cache for Development

# settings.py
HTTPCACHE_ENABLED = True

From here you can tweak the other options (policy, directory, expiration, etc.) to suit your workflow. Happy crawling!

Step 2: First Run (Populate Cache)

scrapy crawl myspider

This downloads all pages and caches them.

Step 3: Develop with Cache

Now you can run your spider hundreds of times without hitting the website:

# Run it again
scrapy crawl myspider

# Fix selector
# Run again
scrapy crawl myspider

# Fix another thing
# Run again
scrapy crawl myspider

All runs are instant because they are served from the cache.

Step 4: Clear Cache When Needed

When the website’s structure changes or you need fresh data:

rm -rf .scrapy/httpcache/

Then run the spider again to re‑populate the cache with fresh pages.

Per‑Request Cache Control

You can disable caching for specific requests:

def parse(self, response):
    # This request won't be cached
    yield scrapy.Request(
        'https://example.com/dynamic',
        callback=self.parse_dynamic,
        meta={'dont_cache': True}
    )

Useful when some pages must be fresh while others can be cached.

Advanced: Storage Backends

Scrapy provides two storage backends for the HTTP cache: Filesystem (default) and DBM.

Filesystem (Default)

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Pros

Easy to inspect (just open files)
Works everywhere
Simple

Cons

Many small files
Slower with thousands of pages
Takes more disk space

DBM (Database)

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'

Pros

Faster with lots of pages
Fewer files
More efficient

Cons

Harder to inspect
Database‑specific issues
More complex

Tip: For most projects, stick with the Filesystem backend; it’s simpler.

Debugging with Cache

See What’s Cached

ls -R .scrapy/httpcache/

You’ll see a folder for each request. Inside each folder you’ll find:

request_body – the request that was made
request_headers – headers sent
response_body – HTML received
response_headers – response headers
meta – metadata

Check If a Request Was Cached

Scrapy logs cache hits:

[scrapy.core.engine] DEBUG: Crawled (200)  (referer: None) ['cached']

The ['cached'] suffix indicates a cache hit.

Without a cache hit the log looks like:

[scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)

Common Gotchas

Gotcha #1: Cache Persists Between Spider Runs

The cache survives across runs. To force fresh data, clear it manually:

rm -rf .scrapy/httpcache/

Or set an expiration time (see above).

If you have multiple spiders in one project, they share .scrapy/httpcache/, but each spider gets its own sub‑folder:

.scrapy/httpcache/
    spider1/
    spider2/
    spider3/

Gotcha #3: POST Requests Aren’t Cached by Default

Only GET requests are cached. POST requests (e.g., form submissions) bypass the cache:

# This won't be cached
yield scrapy.FormRequest(
    'https://example.com/search',
    formdata={'query': 'test'}
)

This is intentional because POST requests are usually not idempotent.

Gotcha #4: Redirects Are Cached Too

If a URL redirects, the redirect is cached. Subsequent runs will return the cached final page without following the redirect again:

https://example.com → https://www.example.com

Real‑World Scenarios

Scenario 1: Testing Selectors

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0   # never expire

Run once to populate the cache, then tweak selectors all day without hitting the site.

Scenario 2: Scraping Historical Data

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_EXPIRATION_SECS = 0   # keep forever

Perfect for data that never changes (e.g., old articles).

Scenario 3: Production Scraper

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_EXPIRATION_SECS = 1800   # 30 minutes

Respects HTTP caching rules and refreshes after 30 minutes—a balanced approach.

Scenario 4: Offline Development

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False   # fail if a page isn’t cached

Your spider will only use cached pages; if a page is missing, it fails instead of trying to download it—ideal for working on a plane.

Tips Nobody Tells You

Version control the cache (or a subset) when collaborating to ensure everyone works with the same data.
Combine policies: use DummyPolicy for static pages and RFC2616Policy for dynamic ones by overriding HTTPCACHE_POLICY per‑spider.
Monitor cache size: periodically run du -sh .scrapy/httpcache/ and clean old entries to avoid disk bloat.
Use HTTPCACHE_IGNORE_HTTP_CODES to prevent caching error pages (e.g., 404, 500).

Tip #1: Use Cache for CI/CD

In continuous integration you don’t want to hit real websites. Use the Scrapy HTTP cache:

# settings.py for CI/CD
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False  # Tests fail if page not cached

Pre‑populate the cache in your repo. Tests run against cached pages – fast and reliable.

Commit the cache folder to version control:

git add .scrapy/httpcache/
git commit -m "Add test cache"

Now everyone on the team uses the same cached pages for testing, giving consistent results.

Tip #3: Different Cache for Different Environments

# settings.py
import os

HTTPCACHE_ENABLED = True

if os.getenv('ENV') == 'production':
    HTTPCACHE_DIR = '.prod_cache'
    HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
else:
    HTTPCACHE_DIR = '.dev_cache'
    HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'

Separate cache for dev and prod – best of both worlds.

Tip #4: Compress Cache to Save Space

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True  # Compress cached responses

Saves tons of disk space, especially with large pages.

Complete Example Spider

A production‑ready spider with smart caching:

# spider.py
import scrapy

class SmartCacheSpider(scrapy.Spider):
    name = 'smartcache'
    start_urls = ['https://example.com/products']

    custom_settings = {
        'HTTPCACHE_ENABLED': True,
        'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy',
        'HTTPCACHE_EXPIRATION_SECS': 3600,
        'HTTPCACHE_IGNORE_HTTP_CODES': [404, 500, 502, 503],
        'HTTPCACHE_GZIP': True,
        'HTTPCACHE_DIR': '.product'
    }

    def parse(self, response):
        # parsing logic here
        pass

When I First Started Building Spiders

What Is HTTP Cache?

Without cache

With cache

Enabling Cache (The One‑Liner)

How Cache Works (The Simple Explanation)

Basic Cache Settings

How Long to Keep Cache

Where to Store Cache

Ignore Certain Status Codes

Cache Policies (Two Flavors)

DummyPolicy (The Simple One)

RFC2616Policy (The Smart One)

Real Example: Development vs. Production

Development Setup (Cache Everything)

Production Setup (Smart Caching)

Practical Workflow

Step 1: Enable Cache for Development

Step 2: First Run (Populate Cache)

Step 3: Develop with Cache

Step 4: Clear Cache When Needed

Per‑Request Cache Control

Advanced: Storage Backends

Filesystem (Default)

DBM (Database)

Debugging with Cache

See What’s Cached

Check If a Request Was Cached

Common Gotchas

Gotcha #1: Cache Persists Between Spider Runs

Gotcha #2: Different Spiders Share the Same Cache Directory

Gotcha #3: POST Requests Aren’t Cached by Default

Gotcha #4: Redirects Are Cached Too

Real‑World Scenarios

Scenario 1: Testing Selectors

Scenario 2: Scraping Historical Data

Scenario 3: Production Scraper

Scenario 4: Offline Development

Tips Nobody Tells You

Tip #1: Use Cache for CI/CD

Tip #2: Share Cache Between Developers