Scrapy HTTP Cache: The Complete Beginner's Guide (Stop Hammering Websites)

Published: (December 26, 2025 at 01:51 PM EST)
7 min read
Source: Dev.to

Source: Dev.to

When I First Started Building Spiders

I used to test my spiders by running them over and over.
Each time I tweaked a selector, I’d run the spider again, hit the website again, and download the same pages again.

After my 50th run in one day, I realized something: I was being a terrible internet citizen. I was hammering some poor website with hundreds of requests just because I couldn’t write a CSS selector properly.

Then I discovered Scrapy’s HTTP cache – a game‑changer. Now when I test my spiders, they fetch pages once and reuse the cached responses. Faster testing. No guilt. No getting blocked.

Let me show you how to use caching properly.

What Is HTTP Cache?

Think of HTTP cache like a photocopy machine for webpages.

Without cache

  • Run spider → Downloads page
  • Fix selector → Run again → Downloads same page again
  • Fix another thing → Run again → Downloads same page again

You’re downloading the exact same page multiple times. Wasteful, slow, and annoying to the website.

With cache

  • Run spider → Downloads page → Saves a copy
  • Fix selector → Run again → Uses saved copy (no download!)
  • Fix another thing → Run again → Still using saved copy

You download once, test infinite times. The website only sees one request.

Enabling Cache (The One‑Liner)

Add this to your settings.py:

HTTPCACHE_ENABLED = True

That’s it. Scrapy now caches everything.

Run your spider:

scrapy crawl myspider

First run: downloads pages normally. Check your project folder – you’ll see a new .scrapy/httpcache/myspider/ directory where cached pages live.

Run it again:

scrapy crawl myspider

This time it’s lightning fast. No actual HTTP requests; everything comes from cache.

How Cache Works (The Simple Explanation)

  1. First request – Spider asks for a URL.
  2. Cache checks – “Do I have this page already?”
  3. Cache miss – No, it doesn’t.
  4. Download – Fetch from the website.
  5. Store – Save the response to cache.
  6. Return – Give the response to the spider.

Next time you request the same URL:

  1. Request – Spider asks for the same URL.
  2. Cache checks – “Do I have this page?”
  3. Cache hit – Yes!
  4. Return – Give the cached response (no download!).

Simple. Fast. Efficient.

Basic Cache Settings

How Long to Keep Cache

By default, the cache never expires. You can set an expiration time:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 86400  # 24 hours

After 24 hours, cached pages are re‑downloaded.

When to use expiration

  • Scraping news sites (content changes daily)
  • Product prices (change frequently)
  • Any dynamic content

When NOT to use expiration

  • Development (you want pages to stay cached)
  • Scraping static content
  • Historical data that doesn’t change

Where to Store Cache

Default location: .scrapy/httpcache/. Change it with:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'my_custom_cache'

Now the cache lives in my_custom_cache/.

Ignore Certain Status Codes

Don’t cache error pages:

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]

404s and 500s won’t be cached – you don’t want to cache broken pages.

Cache Policies (Two Flavors)

Scrapy provides two cache policies: DummyPolicy and RFC2616Policy.

DummyPolicy (The Simple One)

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
  • Caches everything.
  • Never checks if the cache is fresh.
  • Never revalidates.

Use when: testing, offline development, or when you want to “replay” scrapes exactly.

RFC2616Policy (The Smart One)

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
  • Respects HTTP caching headers (Cache‑Control, max‑age, etc.).
  • Revalidates when needed.

Use when: running production scrapers, respecting website caching rules, needing up‑to‑date data, or simply being a good internet citizen.

Real Example: Development vs. Production

Development Setup (Cache Everything)

# settings.py

# Development: cache everything forever
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_DIR = '.dev_cache'
HTTPCACHE_EXPIRATION_SECS = 0  # Never expire

Perfect for testing – download once, test forever.

Production Setup (Smart Caching)

# settings.py

# Production: respect HTTP caching rules
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_DIR = '.prod_cache'
HTTPCACHE_EXPIRATION_SECS = 3600  # 1 hour
HTTPCACHE_IGNORE_HTTP_CODES = [404, 500, 502, 503]

Respects website rules and updates when needed.

Practical Workflow

Step 1: Enable Cache for Development

# settings.py
HTTPCACHE_ENABLED = True

From here you can tweak the other options (policy, directory, expiration, etc.) to suit your workflow. Happy crawling!

Step 2: First Run (Populate Cache)

scrapy crawl myspider

This downloads all pages and caches them.

Step 3: Develop with Cache

Now you can run your spider hundreds of times without hitting the website:

# Run it again
scrapy crawl myspider

# Fix selector
# Run again
scrapy crawl myspider

# Fix another thing
# Run again
scrapy crawl myspider

All runs are instant because they are served from the cache.

Step 4: Clear Cache When Needed

When the website’s structure changes or you need fresh data:

rm -rf .scrapy/httpcache/

Then run the spider again to re‑populate the cache with fresh pages.

Per‑Request Cache Control

You can disable caching for specific requests:

def parse(self, response):
    # This request won't be cached
    yield scrapy.Request(
        'https://example.com/dynamic',
        callback=self.parse_dynamic,
        meta={'dont_cache': True}
    )

Useful when some pages must be fresh while others can be cached.

Advanced: Storage Backends

Scrapy provides two storage backends for the HTTP cache: Filesystem (default) and DBM.

Filesystem (Default)

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Pros

  • Easy to inspect (just open files)
  • Works everywhere
  • Simple

Cons

  • Many small files
  • Slower with thousands of pages
  • Takes more disk space

DBM (Database)

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.DbmCacheStorage'

Pros

  • Faster with lots of pages
  • Fewer files
  • More efficient

Cons

  • Harder to inspect
  • Database‑specific issues
  • More complex

Tip: For most projects, stick with the Filesystem backend; it’s simpler.

Debugging with Cache

See What’s Cached

ls -R .scrapy/httpcache/

You’ll see a folder for each request. Inside each folder you’ll find:

  • request_body – the request that was made
  • request_headers – headers sent
  • response_body – HTML received
  • response_headers – response headers
  • meta – metadata

Check If a Request Was Cached

Scrapy logs cache hits:

[scrapy.core.engine] DEBUG: Crawled (200)  (referer: None) ['cached']

The ['cached'] suffix indicates a cache hit.

Without a cache hit the log looks like:

[scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)

Common Gotchas

Gotcha #1: Cache Persists Between Spider Runs

The cache survives across runs. To force fresh data, clear it manually:

rm -rf .scrapy/httpcache/

Or set an expiration time (see above).

Gotcha #2: Different Spiders Share the Same Cache Directory

If you have multiple spiders in one project, they share .scrapy/httpcache/, but each spider gets its own sub‑folder:

.scrapy/httpcache/
    spider1/
    spider2/
    spider3/

Gotcha #3: POST Requests Aren’t Cached by Default

Only GET requests are cached. POST requests (e.g., form submissions) bypass the cache:

# This won't be cached
yield scrapy.FormRequest(
    'https://example.com/search',
    formdata={'query': 'test'}
)

This is intentional because POST requests are usually not idempotent.

Gotcha #4: Redirects Are Cached Too

If a URL redirects, the redirect is cached. Subsequent runs will return the cached final page without following the redirect again:

https://example.com → https://www.example.com

Real‑World Scenarios

Scenario 1: Testing Selectors

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0   # never expire

Run once to populate the cache, then tweak selectors all day without hitting the site.

Scenario 2: Scraping Historical Data

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_EXPIRATION_SECS = 0   # keep forever

Perfect for data that never changes (e.g., old articles).

Scenario 3: Production Scraper

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
HTTPCACHE_EXPIRATION_SECS = 1800   # 30 minutes

Respects HTTP caching rules and refreshes after 30 minutes—a balanced approach.

Scenario 4: Offline Development

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False   # fail if a page isn’t cached

Your spider will only use cached pages; if a page is missing, it fails instead of trying to download it—ideal for working on a plane.

Tips Nobody Tells You

  • Version control the cache (or a subset) when collaborating to ensure everyone works with the same data.
  • Combine policies: use DummyPolicy for static pages and RFC2616Policy for dynamic ones by overriding HTTPCACHE_POLICY per‑spider.
  • Monitor cache size: periodically run du -sh .scrapy/httpcache/ and clean old entries to avoid disk bloat.
  • Use HTTPCACHE_IGNORE_HTTP_CODES to prevent caching error pages (e.g., 404, 500).

Tip #1: Use Cache for CI/CD

In continuous integration you don’t want to hit real websites. Use the Scrapy HTTP cache:

# settings.py for CI/CD
HTTPCACHE_ENABLED = True
HTTPCACHE_IGNORE_MISSING = False  # Tests fail if page not cached

Pre‑populate the cache in your repo. Tests run against cached pages – fast and reliable.

Tip #2: Share Cache Between Developers

Commit the cache folder to version control:

git add .scrapy/httpcache/
git commit -m "Add test cache"

Now everyone on the team uses the same cached pages for testing, giving consistent results.

Tip #3: Different Cache for Different Environments

# settings.py
import os

HTTPCACHE_ENABLED = True

if os.getenv('ENV') == 'production':
    HTTPCACHE_DIR = '.prod_cache'
    HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.RFC2616Policy'
else:
    HTTPCACHE_DIR = '.dev_cache'
    HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'

Separate cache for dev and prod – best of both worlds.

Tip #4: Compress Cache to Save Space

# settings.py
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True  # Compress cached responses

Saves tons of disk space, especially with large pages.

Complete Example Spider

A production‑ready spider with smart caching:

# spider.py
import scrapy

class SmartCacheSpider(scrapy.Spider):
    name = 'smartcache'
    start_urls = ['https://example.com/products']

    custom_settings = {
        'HTTPCACHE_ENABLED': True,
        'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy',
        'HTTPCACHE_EXPIRATION_SECS': 3600,
        'HTTPCACHE_IGNORE_HTTP_CODES': [404, 500, 502, 503],
        'HTTPCACHE_GZIP': True,
        'HTTPCACHE_DIR': '.product'
    }

    def parse(self, response):
        # parsing logic here
        pass
Back to Blog

Related posts

Read more »