Scrapy Requests and Responses: The Complete Beginner's Guide (With Secrets the Docs Don't Tell You)
Source: Dev.to
1. The Basics
| Concept | What it means in Scrapy |
|---|---|
| Request | An object that says “I want to visit this URL”. |
| Response | An object that contains what the website sent back (HTML, JSON, etc.). |
Think of web scraping like a conversation:
Request: "Hey website, can you show me this page?"
Response: "Sure, here's the HTML!"
2. What Scrapy does behind the scenes
A minimal spider:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
# Do something with response
pass
What really happens
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
Scrapy automatically creates a Request for each URL in start_urls and sends it to parse.
3. Creating Requests Manually
You can build requests yourself to control every detail:
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
yield scrapy.Request(
url='https://example.com',
callback=self.parse,
method='GET',
headers={'User-Agent': 'My Custom Agent'},
cookies={'session': 'abc123'},
meta={'page_num': 1},
dont_filter=False,
priority=0,
)
def parse(self, response):
# Process response
pass
3.1 Request Parameters (quick reference)
| Parameter | Description |
|---|---|
url | Target URL (required). |
callback | Function that will receive the Response. Defaults to parse. |
method | HTTP method (GET, POST, PUT, DELETE, …). Default: GET. |
body | Raw request body (useful for POST, PUT). |
headers | Custom request headers. |
cookies | Cookies to send with the request. |
meta | Arbitrary dict passed to the Response (response.meta). Great for sharing data between callbacks. |
dont_filter | If True, Scrapy will not filter this URL as a duplicate. |
priority | Integer priority; higher values are processed first (default = 0). |
Examples
# 1️⃣ Simple URL
yield scrapy.Request(url='https://example.com/products')
# 2️⃣ Custom callback
yield scrapy.Request(
url='https://example.com/products',
callback=self.parse_products,
)
def parse_products(self, response):
# Handle response here
pass
# 3️⃣ POST request with JSON body
yield scrapy.Request(
url='https://example.com/api',
method='POST',
body='{"key": "value"}',
headers={'Content-Type': 'application/json'},
)
# 4️⃣ Custom headers
yield scrapy.Request(
url='https://example.com',
headers={
'User-Agent': 'Mozilla/5.0',
'Accept': 'text/html',
'Referer': 'https://google.com',
},
)
# 5️⃣ Cookies
yield scrapy.Request(
url='https://example.com',
cookies={'session_id': '12345', 'user': 'john'},
)
# 6️⃣ Passing data via meta
yield scrapy.Request(
url='https://example.com/details',
meta={'product_name': 'Widget', 'price': 29.99},
callback=self.parse_details,
)
def parse_details(self, response):
name = response.meta['product_name']
price = response.meta['price']
# Do something with name & price
# 7️⃣ Bypass duplicate filter
yield scrapy.Request(
url='https://example.com',
dont_filter=True,
)
# 8️⃣ Prioritise a request
yield scrapy.Request(
url='https://example.com/important',
priority=10, # processed before priority 0 requests
)
4. The Response Object
When a request finishes, Scrapy passes a Response to the callback. Here’s what you typically get:
def parse(self, response):
# Basic attributes
url = response.url # Final URL (after redirects)
body = response.body # Raw bytes
text = response.text # Decoded string (default UTF‑8)
status = response.status # HTTP status code (200, 404, …)
headers = response.headers # Response headers (case‑insensitive dict)
# Links back to the request
request = response.request # The original Request object
meta = response.meta # Meta dict passed from the request
4.1 Selecting data
# CSS selectors (most readable)
titles = response.css('h1.title::text').getall()
first_title = response.css('h1.title::text').get()
# XPath selectors (more powerful)
titles = response.xpath('//h1[@class="title"]/text()').getall()
4.2 Following links
# Manual way (verbose)
next_page = response.css('a.next::attr(href)').get()
if next_page:
full_url = response.urljoin(next_page)
yield scrapy.Request(full_url, callback=self.parse)
# Preferred way – `response.follow`
next_page = response.css('a.next::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
# You can even pass a selector directly:
yield response.follow(
response.css('a.next::attr(href)').get(),
callback=self.parse,
)
# Or iterate over all <a> tags:
for link in response.css('a'):
yield response.follow(link, callback=self.parse_page)
response.follow() automatically:
- Handles relative URLs (
urljoininternally). - Extracts the
hrefattribute when you give it a selector. - Creates the
Requestobject for you (including defaultcallback).
5. Debugging & Introspection
Sometimes you need to peek at the original request that produced a response (especially after redirects).
def parse(self, response):
# Original request data
original_url = response.request.url
original_headers = response.request.headers
original_meta = response.request.meta
# Log useful info
self.logger.info(f'Requested: {original_url}')
self.logger.info(f'Got back: {response.url}') # May differ after redirects
TL;DR
- Requests are fully configurable objects (
url,method,headers,cookies,meta,priority, …). - Responses give you everything you need to extract data (
url,body,text,status,headers, plus a back‑reference to the original request). - Use
response.follow()for clean, concise link‑following logic. - Leverage
metato pass data between callbacks, andpriority/dont_filterto control crawl order and duplicate handling.
Armed with these details, you can move beyond the basics and write robust, efficient Scrapy spiders that do exactly what you need—no hidden surprises. Happy crawling!
Scrapy Quick‑Reference Cheat Sheet
Below is a cleaned‑up, well‑structured collection of useful Scrapy patterns. Everything is kept exactly as in the original snippets – only the formatting has been improved.
1. Working with Response Headers
def parse(self, response):
# Get all headers
all_headers = response.headers
# Get a specific header
content_type = response.headers.get('Content-Type')
# Check cookies the server sent back
cookies = response.headers.getlist('Set-Cookie')
# Useful for debugging blocks
server = response.headers.get('Server')
self.logger.info(f'Server type: {server}')
2. Preserving meta Across Redirects
def start_requests(self):
yield scrapy.Request(
'https://example.com/redirect',
meta={'important': 'data'}, # custom meta data
callback=self.parse
)
def parse(self, response):
# Even after a redirect, the meta dict is still there!
data = response.meta['important']
# The final URL may be different
self.logger.info(f'Ended up at: {response.url}')
3. Controlling Crawl Order with priority
def parse_listing(self, response):
# High priority for product pages (process first)
for product in response.css('.product'):
url = product.css('a::attr(href)').get()
yield response.follow(
url,
callback=self.parse_product,
priority=10 # higher number → earlier processing
)
# Low priority for pagination (process later)
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(
next_page,
callback=self.parse_listing,
priority=0 # default priority
)
Tip: Use higher priorities for “must‑have” pages and lower ones for pagination or auxiliary content.
4. Submitting Forms – FormRequest
a) Simple POST request
import scrapy
class LoginSpider(scrapy.Spider):
name = 'login'
def start_requests(self):
yield scrapy.FormRequest(
url='https://example.com/login',
formdata={
'username': 'myuser',
'password': 'mypass'
},
callback=self.after_login
)
def after_login(self, response):
if 'Welcome' in response.text:
self.logger.info('Login successful!')
else:
self.logger.error('Login failed!')
b) Auto‑fill a form from the page (from_response)
class LoginSpider(scrapy.Spider):
name = 'login'
start_urls = ['https://example.com/login']
def parse(self, response):
# Automatically locate the form, keep hidden fields (e.g., CSRF)
# and submit the supplied data.
yield scrapy.FormRequest.from_response(
response,
formdata={
'username': 'myuser',
'password': 'mypass'
},
callback=self.after_login
)
def after_login(self, response):
# Now you're logged in – continue crawling.
yield response.follow('/dashboard', callback=self.parse_dashboard)
What FormRequest.from_response() does for you
- Finds the first
<form>element (or the one matchingformname/formid). - Extracts all form fields, preserving hidden inputs (e.g., CSRF tokens).
- Overwrites the fields you provide in
formdata. - Submits the request.
5. Pagination with meta (keeping track of the page number)
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
def start_requests(self):
yield scrapy.Request(
'https://example.com/products?page=1',
meta={'page': 1},
callback=self.parse
)
def parse(self, response):
page = response.meta['page']
self.logger.info(f'Scraping page {page}')
# Scrape products on the current page
for product in response.css('.product'):
yield {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get(),
'page': page
}
# Follow the next page, incrementing the page counter
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(
next_page,
meta={'page': page + 1},
callback=self.parse
)
6. Chaining Requests – From a listing to a detail page
import scrapy
class DetailSpider(scrapy.Spider):
name = 'details'
start_urls = ['https://example.com/products']
def parse(self, response):
"""Scrape product listings and queue detail pages."""
for product in response.css('.product'):
item = {
'name': product.css('h2::text').get(),
'price': product.css('.price::text').get()
}
detail_url = product.css('a::attr(href)').get()
yield response.follow(
detail_url,
callback=self.parse_detail,
meta={'item': item} # pass the partially‑filled item forward
)
def parse_detail(self, response):
"""Enrich the item with data from the detail page."""
item = response.meta['item']
item['description'] = response.css('.description::text').get()
item['rating'] = response.css('.rating::text').get()
item['reviews'] = response.css('.reviews::text').get()
yield item