The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

Published: 2 days ago (December 16, 2025 at 01:41 PM EST)

6 min read

Source: Dev.to

Scrapy Can Feel Daunting – But It Doesn’t Have To

It’s a massive, powerful framework, and the documentation can be overwhelming for a newcomer. Where do you even begin?

In this definitive guide we’ll walk you through, step‑by‑step, how to build a real, multi‑page crawling spider. You’ll go from an empty folder to a clean JSON file of structured data in about 15 minutes. We’ll use modern async/await Python and cover:

Project setup
Finding selectors
Following links (crawling)
Saving your data

We’ll build a Scrapy spider that crawls the “Fantasy” category on books.toscrape.com, follows the “Next” button to crawl every page in that category, follows the link for every book, and scrapes the name, price, and URL from all 48 books, saving the result to a clean books.json file.

The Final Spider We’ll Build

# tutorial/spiders/books.py
import scrapy


class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["toscrape.com"]

    # Starting URL (first page of the Fantasy category)
    start_urls = [
        "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"
    ]

    # ------------------------------------------------------------------
    # Async version of start_requests – Scrapy will call this automatically
    # ------------------------------------------------------------------
    async def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse_listpage)

    # ------------------------------------------------------------------
    # Parse a category list page, follow book links and pagination
    # ------------------------------------------------------------------
    async def parse_listpage(self, response):
        # 1️⃣ Extract all book detail page URLs on the current list page
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # `response.follow` correctly joins relative URLs
            yield response.follow(url, callback=self.parse_book)

        # 2️⃣ Follow the “Next” button, if it exists
        next_page_url = response.css("li.next a::attr(href)").get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # ------------------------------------------------------------------
    # Parse an individual book page and yield the scraped data
    # ------------------------------------------------------------------
    async def parse_book(self, response):
        yield {
            "name": response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
            "url": response.url,
        }

1️⃣ Project Setup

Prerequisite: Python 3.x installed.
We’ll use a virtual environment to keep dependencies isolated. You can use the standard venv + pip workflow or the modern uv tool.

Create a Project Folder & Virtual Environment

# Create a new folder
mkdir scrapy_project
cd scrapy_project

# Option 1: Standard venv + pip
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

# Option 2: Using uv (fast, modern alternative)
uv init

Install Scrapy

# Option 1: pip
pip install scrapy

# Option 2: uv
uv add scrapy
# (If you used uv to create the env, it’s already activated)

2️⃣ Generate the Scrapy Project Boilerplate

# The '.' tells Scrapy to create the project in the current folder
scrapy startproject tutorial .

You’ll now see a tutorial/ package and a scrapy.cfg file. The tutorial/ folder contains all the project logic.

Generate the First Spider

# Creates tutorial/spiders/books.py
scrapy genspider books toscrape.com

3️⃣ Adjust Project Settings

Open tutorial/settings.py and make the following changes.

Disable robots.txt (test site only)

# By default Scrapy obeys robots.txt – turn it off for this demo site
ROBOTSTXT_OBEY = False

Speed Up Crawling (test site only)

# Increase concurrency and remove download delay
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0

⚠️ Warning: These settings are safe for the test site toscrape.com. When scraping real websites, always respect the target’s robots.txt and use polite concurrency/delay values.

4️⃣ Explore the Site with `scrapy shell`

The Scrapy shell is perfect for discovering CSS selectors.

Open the shell on the Fantasy category page

scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html

You now have a response object you can query.

Find all book links

>>> response.css("article.product_pod h3 a::attr(href)").getall()
[
    '../../../../the-host_979/index.html',
    '../../../../the-hunted_978/index.html',
    # …
]

Find the “Next” page link (pagination)

>>> response.css("li.next a::attr(href)").get()
'page-2.html'

Open a shell on a single book page to get data selectors

scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html

>>> response.css("h1::text").get()
'The Host'

>>> response.css("p.price_color::text").get()
'£25.82'

Now you have all the selectors you need for the spider.

5️⃣ Write the Spider

Replace the boilerplate in tutorial/spiders/books.py with the final spider code shown at the top of this guide (the async version). Save the file.

6️⃣ Run the Spider & Export to JSON

scrapy crawl books -o books.json

Scrapy will crawl all pages in the Fantasy category, follow each book link, extract the name, price, and URL, and write the results to books.json.

You should end up with a clean JSON file containing 48 entries, e.g.:

[
  {
    "name": "The Host",
    "price": "£25.82",
    "url": "https://books.toscrape.com/catalogue/the-host_979/index.html"
  },
  {
    "name": "The Hunted",
    "price": "£23.45",
    "url": "https://books.toscrape.com/catalogue/the-hunted_978/index.html"
  }
  // …
]

🎉 You Did It!

You now have a fully functional, async Scrapy spider that:

Starts from a category page
Follows pagination automatically
Visits each product page
Extracts structured data
Saves everything to a tidy JSON file

Feel free to experiment—add more fields, store data in a database, or adapt the spider for a different site. Happy crawling!

Scrapy Spider Overview

Below is a minimal async Scrapy spider that crawls a books catalogue, follows pagination, and extracts basic information from each product page.

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["http://books.toscrape.com/"]

    # ------------------------------------------------------------------
    # 1️⃣  Spider entry point – called once when the spider starts.
    # ------------------------------------------------------------------
    async def start(self):
        # Yield the first request; its response will be handled by `parse_listpage`.
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # ------------------------------------------------------------------
    # 2️⃣  Parse the *category* (list) page.
    # ------------------------------------------------------------------
    async def parse_listpage(self, response):
        # 1️⃣  Get all product URLs from the current page.
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()

        # 2️⃣  Follow each product URL and send the response to `parse_book`.
        for url in product_urls:
            yield response.follow(url, callback=self.parse_book)

        # 3️⃣  Locate the “Next” page link (if any).
        next_page_url = response.css("li.next a::attr(href)").get()

        # 4️⃣  If a next page exists, follow it and recurse back to this method.
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # ------------------------------------------------------------------
    # 3️⃣  Parse the *product* (book) page.
    # ------------------------------------------------------------------
    async def parse_book(self, response):
        # Yield a dictionary containing the data we want to export.
        yield {
            "name":  response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
            "url":   response.url,
        }

Note: response.follow automatically resolves relative URLs (e.g., page-2.html), so you don’t need to build full URLs yourself.

Running the Spider

Open a terminal at the project root.
Execute the spider:

scrapy crawl books

You’ll see Scrapy start up and, in the logs, all 48 items being scraped.

Exporting the Data

Scrapy’s built‑in Feed Exporter makes saving results trivial. Use the -o (output) flag to write the scraped items to a file:

scrapy crawl books -o books.json

Running the spider with this command creates a books.json file in the project root, containing the 48 items in a clean, structured JSON format.

What You’ve Learned

Set up a modern, async Scrapy project.
Located CSS selectors for the data you need.
Followed links and handled pagination automatically.
Exported scraped data with a single command.

This is just the beginning!

💬 TALK: Stuck on this Scrapy code? Ask the maintainers and 5k+ developers in our Discord.

▶️ WATCH: This post was based on our video—watch the full walkthrough on our YouTube channel.

📩 READ: Want more? In Part 2 we’ll cover Scrapy Items and Pipelines. Subscribe to the Extract newsletter so you don’t miss it.

The Modern Scrapy Developer's Guide (Part 1): Building Your First Spider

Scrapy Can Feel Daunting – But It Doesn’t Have To

The Final Spider We’ll Build

1️⃣ Project Setup

Create a Project Folder & Virtual Environment

Install Scrapy

2️⃣ Generate the Scrapy Project Boilerplate

Generate the First Spider

3️⃣ Adjust Project Settings

Disable robots.txt (test site only)

Speed Up Crawling (test site only)

4️⃣ Explore the Site with `scrapy shell`

Open the shell on the Fantasy category page

Find all book links

Open a shell on a single book page to get data selectors

5️⃣ Write the Spider

6️⃣ Run the Spider & Export to JSON

🎉 You Did It!

Scrapy Spider Overview

Running the Spider

Exporting the Data

What You’ve Learned

Related posts

Day 26 of improving my Data Science skills

Scraping ZoomInfo with One Universal Script Using SeleniumBase

Contributing to Larger Open Source Project - Scrapy

Stop Scraping HTML - There's a better way.

Scrapy Can Feel Daunting – But It Doesn’t Have To

The Final Spider We’ll Build

1️⃣ Project Setup

Create a Project Folder & Virtual Environment

Install Scrapy

2️⃣ Generate the Scrapy Project Boilerplate

Generate the First Spider

3️⃣ Adjust Project Settings

Disable robots.txt (test site only)

Speed Up Crawling (test site only)

4️⃣ Explore the Site with scrapy shell

Open the shell on the Fantasy category page

Find all book links

Find the “Next” page link (pagination)

Open a shell on a single book page to get data selectors

5️⃣ Write the Spider

6️⃣ Run the Spider & Export to JSON

🎉 You Did It!

Scrapy Spider Overview

Running the Spider

Exporting the Data

What You’ve Learned

Related posts

Day 26 of improving my Data Science skills

Scraping ZoomInfo with One Universal Script Using SeleniumBase

Contributing to Larger Open Source Project - Scrapy

Stop Scraping HTML - There's a better way.

Scrapy Can Feel Daunting – But It Doesn’t Have To

4️⃣ Explore the Site with `scrapy shell`