Scrapy Requests and Responses: The Complete Beginner's Guide (With Secrets the Docs Don't Tell You)

Published: (December 23, 2025 at 02:47 AM EST)
6 min read
Source: Dev.to

Source: Dev.to

1. The Basics

ConceptWhat it means in Scrapy
RequestAn object that says “I want to visit this URL”.
ResponseAn object that contains what the website sent back (HTML, JSON, etc.).

Think of web scraping like a conversation:

Request:  "Hey website, can you show me this page?"
Response: "Sure, here's the HTML!"

2. What Scrapy does behind the scenes

A minimal spider:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        # Do something with response
        pass

What really happens

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url=url, callback=self.parse)

Scrapy automatically creates a Request for each URL in start_urls and sends it to parse.

3. Creating Requests Manually

You can build requests yourself to control every detail:

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'

    def start_requests(self):
        yield scrapy.Request(
            url='https://example.com',
            callback=self.parse,
            method='GET',
            headers={'User-Agent': 'My Custom Agent'},
            cookies={'session': 'abc123'},
            meta={'page_num': 1},
            dont_filter=False,
            priority=0,
        )

    def parse(self, response):
        # Process response
        pass

3.1 Request Parameters (quick reference)

ParameterDescription
urlTarget URL (required).
callbackFunction that will receive the Response. Defaults to parse.
methodHTTP method (GET, POST, PUT, DELETE, …). Default: GET.
bodyRaw request body (useful for POST, PUT).
headersCustom request headers.
cookiesCookies to send with the request.
metaArbitrary dict passed to the Response (response.meta). Great for sharing data between callbacks.
dont_filterIf True, Scrapy will not filter this URL as a duplicate.
priorityInteger priority; higher values are processed first (default = 0).

Examples

# 1️⃣ Simple URL
yield scrapy.Request(url='https://example.com/products')

# 2️⃣ Custom callback
yield scrapy.Request(
    url='https://example.com/products',
    callback=self.parse_products,
)

def parse_products(self, response):
    # Handle response here
    pass

# 3️⃣ POST request with JSON body
yield scrapy.Request(
    url='https://example.com/api',
    method='POST',
    body='{"key": "value"}',
    headers={'Content-Type': 'application/json'},
)

# 4️⃣ Custom headers
yield scrapy.Request(
    url='https://example.com',
    headers={
        'User-Agent': 'Mozilla/5.0',
        'Accept': 'text/html',
        'Referer': 'https://google.com',
    },
)

# 5️⃣ Cookies
yield scrapy.Request(
    url='https://example.com',
    cookies={'session_id': '12345', 'user': 'john'},
)

# 6️⃣ Passing data via meta
yield scrapy.Request(
    url='https://example.com/details',
    meta={'product_name': 'Widget', 'price': 29.99},
    callback=self.parse_details,
)

def parse_details(self, response):
    name = response.meta['product_name']
    price = response.meta['price']
    # Do something with name & price

# 7️⃣ Bypass duplicate filter
yield scrapy.Request(
    url='https://example.com',
    dont_filter=True,
)

# 8️⃣ Prioritise a request
yield scrapy.Request(
    url='https://example.com/important',
    priority=10,   # processed before priority 0 requests
)

4. The Response Object

When a request finishes, Scrapy passes a Response to the callback. Here’s what you typically get:

def parse(self, response):
    # Basic attributes
    url      = response.url               # Final URL (after redirects)
    body     = response.body              # Raw bytes
    text     = response.text              # Decoded string (default UTF‑8)
    status   = response.status            # HTTP status code (200, 404, …)
    headers  = response.headers          # Response headers (case‑insensitive dict)

    # Links back to the request
    request  = response.request           # The original Request object
    meta     = response.meta              # Meta dict passed from the request

4.1 Selecting data

# CSS selectors (most readable)
titles = response.css('h1.title::text').getall()
first_title = response.css('h1.title::text').get()

# XPath selectors (more powerful)
titles = response.xpath('//h1[@class="title"]/text()').getall()
# Manual way (verbose)
next_page = response.css('a.next::attr(href)').get()
if next_page:
    full_url = response.urljoin(next_page)
    yield scrapy.Request(full_url, callback=self.parse)

# Preferred way – `response.follow`
next_page = response.css('a.next::attr(href)').get()
if next_page:
    yield response.follow(next_page, callback=self.parse)

# You can even pass a selector directly:
yield response.follow(
    response.css('a.next::attr(href)').get(),
    callback=self.parse,
)

# Or iterate over all <a> tags:
for link in response.css('a'):
    yield response.follow(link, callback=self.parse_page)

response.follow() automatically:

  • Handles relative URLs (urljoin internally).
  • Extracts the href attribute when you give it a selector.
  • Creates the Request object for you (including default callback).

5. Debugging & Introspection

Sometimes you need to peek at the original request that produced a response (especially after redirects).

def parse(self, response):
    # Original request data
    original_url     = response.request.url
    original_headers = response.request.headers
    original_meta    = response.request.meta

    # Log useful info
    self.logger.info(f'Requested: {original_url}')
    self.logger.info(f'Got back: {response.url}')   # May differ after redirects

TL;DR

  • Requests are fully configurable objects (url, method, headers, cookies, meta, priority, …).
  • Responses give you everything you need to extract data (url, body, text, status, headers, plus a back‑reference to the original request).
  • Use response.follow() for clean, concise link‑following logic.
  • Leverage meta to pass data between callbacks, and priority/dont_filter to control crawl order and duplicate handling.

Armed with these details, you can move beyond the basics and write robust, efficient Scrapy spiders that do exactly what you need—no hidden surprises. Happy crawling!

Scrapy Quick‑Reference Cheat Sheet

Below is a cleaned‑up, well‑structured collection of useful Scrapy patterns. Everything is kept exactly as in the original snippets – only the formatting has been improved.

1. Working with Response Headers

def parse(self, response):
    # Get all headers
    all_headers = response.headers

    # Get a specific header
    content_type = response.headers.get('Content-Type')

    # Check cookies the server sent back
    cookies = response.headers.getlist('Set-Cookie')

    # Useful for debugging blocks
    server = response.headers.get('Server')
    self.logger.info(f'Server type: {server}')

2. Preserving meta Across Redirects

def start_requests(self):
    yield scrapy.Request(
        'https://example.com/redirect',
        meta={'important': 'data'},   # custom meta data
        callback=self.parse
    )

def parse(self, response):
    # Even after a redirect, the meta dict is still there!
    data = response.meta['important']

    # The final URL may be different
    self.logger.info(f'Ended up at: {response.url}')

3. Controlling Crawl Order with priority

def parse_listing(self, response):
    # High priority for product pages (process first)
    for product in response.css('.product'):
        url = product.css('a::attr(href)').get()
        yield response.follow(
            url,
            callback=self.parse_product,
            priority=10               # higher number → earlier processing
        )

    # Low priority for pagination (process later)
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(
            next_page,
            callback=self.parse_listing,
            priority=0                # default priority
        )

Tip: Use higher priorities for “must‑have” pages and lower ones for pagination or auxiliary content.

4. Submitting Forms – FormRequest

a) Simple POST request

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'login'

    def start_requests(self):
        yield scrapy.FormRequest(
            url='https://example.com/login',
            formdata={
                'username': 'myuser',
                'password': 'mypass'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        if 'Welcome' in response.text:
            self.logger.info('Login successful!')
        else:
            self.logger.error('Login failed!')

b) Auto‑fill a form from the page (from_response)

class LoginSpider(scrapy.Spider):
    name = 'login'
    start_urls = ['https://example.com/login']

    def parse(self, response):
        # Automatically locate the form, keep hidden fields (e.g., CSRF)
        # and submit the supplied data.
        yield scrapy.FormRequest.from_response(
            response,
            formdata={
                'username': 'myuser',
                'password': 'mypass'
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Now you're logged in – continue crawling.
        yield response.follow('/dashboard', callback=self.parse_dashboard)

What FormRequest.from_response() does for you

  1. Finds the first <form> element (or the one matching formname/formid).
  2. Extracts all form fields, preserving hidden inputs (e.g., CSRF tokens).
  3. Overwrites the fields you provide in formdata.
  4. Submits the request.

5. Pagination with meta (keeping track of the page number)

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'

    def start_requests(self):
        yield scrapy.Request(
            'https://example.com/products?page=1',
            meta={'page': 1},
            callback=self.parse
        )

    def parse(self, response):
        page = response.meta['page']
        self.logger.info(f'Scraping page {page}')

        # Scrape products on the current page
        for product in response.css('.product'):
            yield {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get(),
                'page': page
            }

        # Follow the next page, incrementing the page counter
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(
                next_page,
                meta={'page': page + 1},
                callback=self.parse
            )

6. Chaining Requests – From a listing to a detail page

import scrapy

class DetailSpider(scrapy.Spider):
    name = 'details'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        """Scrape product listings and queue detail pages."""
        for product in response.css('.product'):
            item = {
                'name': product.css('h2::text').get(),
                'price': product.css('.price::text').get()
            }

            detail_url = product.css('a::attr(href)').get()
            yield response.follow(
                detail_url,
                callback=self.parse_detail,
                meta={'item': item}          # pass the partially‑filled item forward
            )

    def parse_detail(self, response):
        """Enrich the item with data from the detail page."""
        item = response.meta['item']
        item['description'] = response.css('.description::text').get()
        item['rating'] = response.css('.rating::text').get()
        item['reviews'] = response.css('.reviews::text').get()
        yield item
Back to Blog

Related posts

Read more »