The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider
Source: Dev.to
Scrapy Can Feel Daunting – But It Doesn’t Have To
It’s a massive, powerful framework, and the documentation can be overwhelming for a newcomer. Where do you even begin?
In this definitive guide we’ll walk you through, step‑by‑step, how to build a real, multi‑page crawling spider. You’ll go from an empty folder to a clean JSON file of structured data in about 15 minutes. We’ll use modern async/await Python and cover:
- Project setup
- Finding selectors
- Following links (crawling)
- Saving your data
We’ll build a Scrapy spider that crawls the “Fantasy” category on books.toscrape.com, follows the “Next” button to crawl every page in that category, follows the link for every book, and scrapes the name, price, and URL from all 48 books, saving the result to a clean books.json file.
The Final Spider We’ll Build
# tutorial/spiders/books.py
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
allowed_domains = ["toscrape.com"]
# Starting URL (first page of the Fantasy category)
start_urls = [
"https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"
]
# ------------------------------------------------------------------
# Async version of start_requests – Scrapy will call this automatically
# ------------------------------------------------------------------
async def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse_listpage)
# ------------------------------------------------------------------
# Parse a category list page, follow book links and pagination
# ------------------------------------------------------------------
async def parse_listpage(self, response):
# 1️⃣ Extract all book detail page URLs on the current list page
product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
for url in product_urls:
# `response.follow` correctly joins relative URLs
yield response.follow(url, callback=self.parse_book)
# 2️⃣ Follow the “Next” button, if it exists
next_page_url = response.css("li.next a::attr(href)").get()
if next_page_url:
yield response.follow(next_page_url, callback=self.parse_listpage)
# ------------------------------------------------------------------
# Parse an individual book page and yield the scraped data
# ------------------------------------------------------------------
async def parse_book(self, response):
yield {
"name": response.css("h1::text").get(),
"price": response.css("p.price_color::text").get(),
"url": response.url,
}
1️⃣ Project Setup
Prerequisite: Python 3.x installed.
We’ll use a virtual environment to keep dependencies isolated. You can use the standardvenv+pipworkflow or the modernuvtool.
Create a Project Folder & Virtual Environment
# Create a new folder
mkdir scrapy_project
cd scrapy_project
# Option 1: Standard venv + pip
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Option 2: Using uv (fast, modern alternative)
uv init
Install Scrapy
# Option 1: pip
pip install scrapy
# Option 2: uv
uv add scrapy
# (If you used uv to create the env, it’s already activated)
2️⃣ Generate the Scrapy Project Boilerplate
# The '.' tells Scrapy to create the project in the current folder
scrapy startproject tutorial .
You’ll now see a tutorial/ package and a scrapy.cfg file. The tutorial/ folder contains all the project logic.
Generate the First Spider
# Creates tutorial/spiders/books.py
scrapy genspider books toscrape.com
3️⃣ Adjust Project Settings
Open tutorial/settings.py and make the following changes.
Disable robots.txt (test site only)
# By default Scrapy obeys robots.txt – turn it off for this demo site
ROBOTSTXT_OBEY = False
Speed Up Crawling (test site only)
# Increase concurrency and remove download delay
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0
⚠️ Warning: These settings are safe for the test site
toscrape.com. When scraping real websites, always respect the target’srobots.txtand use polite concurrency/delay values.
4️⃣ Explore the Site with scrapy shell
The Scrapy shell is perfect for discovering CSS selectors.
Open the shell on the Fantasy category page
scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html
You now have a response object you can query.
Find all book links
>>> response.css("article.product_pod h3 a::attr(href)").getall()
[
'../../../../the-host_979/index.html',
'../../../../the-hunted_978/index.html',
# …
]
Find the “Next” page link (pagination)
>>> response.css("li.next a::attr(href)").get()
'page-2.html'
Open a shell on a single book page to get data selectors
scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html
>>> response.css("h1::text").get()
'The Host'
>>> response.css("p.price_color::text").get()
'£25.82'
Now you have all the selectors you need for the spider.
5️⃣ Write the Spider
Replace the boilerplate in tutorial/spiders/books.py with the final spider code shown at the top of this guide (the async version). Save the file.
6️⃣ Run the Spider & Export to JSON
scrapy crawl books -o books.json
Scrapy will crawl all pages in the Fantasy category, follow each book link, extract the name, price, and URL, and write the results to books.json.
You should end up with a clean JSON file containing 48 entries, e.g.:
[
{
"name": "The Host",
"price": "£25.82",
"url": "https://books.toscrape.com/catalogue/the-host_979/index.html"
},
{
"name": "The Hunted",
"price": "£23.45",
"url": "https://books.toscrape.com/catalogue/the-hunted_978/index.html"
}
// …
]
🎉 You Did It!
You now have a fully functional, async Scrapy spider that:
- Starts from a category page
- Follows pagination automatically
- Visits each product page
- Extracts structured data
- Saves everything to a tidy JSON file
Feel free to experiment—add more fields, store data in a database, or adapt the spider for a different site. Happy crawling!
Scrapy Spider Overview
Below is a minimal async Scrapy spider that crawls a books catalogue, follows pagination, and extracts basic information from each product page.
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
start_urls = ["http://books.toscrape.com/"]
# ------------------------------------------------------------------
# 1️⃣ Spider entry point – called once when the spider starts.
# ------------------------------------------------------------------
async def start(self):
# Yield the first request; its response will be handled by `parse_listpage`.
yield scrapy.Request(self.url, callback=self.parse_listpage)
# ------------------------------------------------------------------
# 2️⃣ Parse the *category* (list) page.
# ------------------------------------------------------------------
async def parse_listpage(self, response):
# 1️⃣ Get all product URLs from the current page.
product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
# 2️⃣ Follow each product URL and send the response to `parse_book`.
for url in product_urls:
yield response.follow(url, callback=self.parse_book)
# 3️⃣ Locate the “Next” page link (if any).
next_page_url = response.css("li.next a::attr(href)").get()
# 4️⃣ If a next page exists, follow it and recurse back to this method.
if next_page_url:
yield response.follow(next_page_url, callback=self.parse_listpage)
# ------------------------------------------------------------------
# 3️⃣ Parse the *product* (book) page.
# ------------------------------------------------------------------
async def parse_book(self, response):
# Yield a dictionary containing the data we want to export.
yield {
"name": response.css("h1::text").get(),
"price": response.css("p.price_color::text").get(),
"url": response.url,
}
Note:
response.followautomatically resolves relative URLs (e.g.,page-2.html), so you don’t need to build full URLs yourself.
Running the Spider
- Open a terminal at the project root.
- Execute the spider:
scrapy crawl books
You’ll see Scrapy start up and, in the logs, all 48 items being scraped.
Exporting the Data
Scrapy’s built‑in Feed Exporter makes saving results trivial. Use the -o (output) flag to write the scraped items to a file:
scrapy crawl books -o books.json
Running the spider with this command creates a books.json file in the project root, containing the 48 items in a clean, structured JSON format.
What You’ve Learned
- Set up a modern, async Scrapy project.
- Located CSS selectors for the data you need.
- Followed links and handled pagination automatically.
- Exported scraped data with a single command.
This is just the beginning!
💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ developers in our Discord.
▶️ WATCH: This post was based on our video—watch the full walkthrough on our YouTube channel.
📩 READ: Want more? In Part 2 we’ll cover Scrapy Items and Pipelines. Subscribe to the Extract newsletter so you don’t miss it.