现代 Scrapy 开发者指南（第1部分）：构建你的第一个 Spider

发布: 1个月前 (2025年12月17日 GMT+8 02:41)

8 分钟阅读

Source: Dev.to

Scrapy Can Feel Daunting – But It Doesn’t Have To

它是一个庞大且功能强大的框架，文档对新手来说可能会让人不知所措。到底该从哪里开始？

在本权威指南中，我们将一步步带你完成一个真实的、多页面爬取的 spider。你只需大约 15 分钟，就能从空文件夹得到一个结构化的 JSON 文件。我们将使用现代的 async/await Python，并覆盖以下内容：

项目设置
查找选择器
跟随链接（爬取）
保存数据

我们将构建一个 Scrapy spider，爬取 books.toscrape.com 上的 “Fantasy” 类别，点击 “Next” 按钮遍历该类别的所有页面，进入每本书的详情页，抓取书名、价格和 URL，并将结果保存为整洁的 books.json 文件。

我们将构建的最终爬虫

# tutorial/spiders/books.py
import scrapy


class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["toscrape.com"]

    # Starting URL (first page of the Fantasy category)
    start_urls = [
        "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"
    ]

    # ------------------------------------------------------------------
    # Async version of start_requests – Scrapy will call this automatically
    # ------------------------------------------------------------------
    async def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse_listpage)

    # ------------------------------------------------------------------
    # Parse a category list page, follow book links and pagination
    # ------------------------------------------------------------------
    async def parse_listpage(self, response):
        # 1️⃣ Extract all book detail page URLs on the current list page
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()
        for url in product_urls:
            # `response.follow` correctly joins relative URLs
            yield response.follow(url, callback=self.parse_book)

        # 2️⃣ Follow the “Next” button, if it exists
        next_page_url = response.css("li.next a::attr(href)").get()
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # ------------------------------------------------------------------
    # Parse an individual book page and yield the scraped data
    # ------------------------------------------------------------------
    async def parse_book(self, response):
        yield {
            "name": response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
            "url": response.url,
        }

1️⃣ 项目设置

前置条件： 已安装 Python 3.x。
我们将使用虚拟环境来隔离依赖。你可以使用标准的 venv + pip 工作流，或使用现代的 uv 工具。

创建项目文件夹和虚拟环境

# Create a new folder
mkdir scrapy_project
cd scrapy_project

# Option 1: Standard venv + pip
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

# Option 2: Using uv (fast, modern alternative)
uv init

安装 Scrapy

# Option 1: pip
pip install scrapy

# Option 2: uv
uv add scrapy
# (If you used uv to create the env, it’s already activated)

2️⃣ 生成 Scrapy 项目模板

# '.' 告诉 Scrapy 在当前文件夹中创建项目
scrapy startproject tutorial .

现在你会看到一个 tutorial/ 包和一个 scrapy.cfg 文件。tutorial/ 文件夹包含所有项目逻辑。

生成第一个 Spider

# 创建 tutorial/spiders/books.py
scrapy genspider books toscrape.com

3️⃣ 调整项目设置

打开 tutorial/settings.py 并进行以下更改。

禁用 robots.txt（仅限测试站点）

# By default Scrapy obeys robots.txt – turn it off for this demo site
ROBOTSTXT_OBEY = False

加快爬取速度（仅限测试站点）

# Increase concurrency and remove download delay
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0

⚠️ 警告: 这些设置对测试站点 toscrape.com 是安全的。爬取真实网站时，请始终遵守目标站点的 robots.txt 并使用合适的并发/延迟值。

4️⃣ 使用 `scrapy shell` 探索站点

Scrapy shell 非常适合发现 CSS 选择器。

在 Fantasy 分类页面打开 shell

scrapy shell https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html

此时你已经拥有一个可以查询的 response 对象。

查找所有图书链接

>>> response.css("article.product_pod h3 a::attr(href)").getall()
[
    '../../../../the-host_979/index.html',
    '../../../../the-hunted_978/index.html',
    # …
]

查找 “下一页” 链接（分页）

>>> response.css("li.next a::attr(href)").get()
'page-2.html'

在单本图书页面打开 shell 以获取数据选择器

scrapy shell https://books.toscrape.com/catalogue/the-host_979/index.html

>>> response.css("h1::text").get()
'The Host'

>>> response.css("p.price_color::text").get()
'£25.82'

现在你已经拥有爬虫所需的所有选择器。

5️⃣ 编写爬虫

将 tutorial/spiders/books.py 中的模板代码替换为本指南顶部展示的 最终爬虫代码（异步版本）。保存文件。

6️⃣ Run the Spider & Export to JSON

scrapy crawl books -o books.json

Scrapy 将爬取 Fantasy 类别的所有页面，跟踪每本书的链接，提取名称、价格和 URL，并将结果写入 books.json。

你应该会得到一个包含 48 条记录的干净 JSON 文件，例如：

[
  {
    "name": "The Host",
    "price": "£25.82",
    "url": "https://books.toscrape.com/catalogue/the-host_979/index.html"
  },
  {
    "name": "The Hunted",
    "price": "£23.45",
    "url": "https://books.toscrape.com/catalogue/the-hunted_978/index.html"
  }
  // …
]

🎉 你成功了！

您现在拥有一个功能完整的异步 Scrapy 爬虫，它可以：

从分类页面开始
自动跟随分页
访问每个产品页面
提取结构化数据
将所有内容保存为整洁的 JSON 文件

随意尝试——添加更多字段、将数据存入数据库，或将爬虫适配到其他站点。祝爬取愉快！

Scrapy 爬虫概览

下面是一个最小的 async Scrapy 爬虫，它爬取图书目录，跟随分页，并从每个产品页面提取基本信息。

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["http://books.toscrape.com/"]

    # ------------------------------------------------------------------
    # 1️⃣  Spider entry point – called once when the spider starts.
    # ------------------------------------------------------------------
    async def start(self):
        # Yield the first request; its response will be handled by `parse_listpage`.
        yield scrapy.Request(self.url, callback=self.parse_listpage)

    # ------------------------------------------------------------------
    # 2️⃣  Parse the *category* (list) page.
    # ------------------------------------------------------------------
    async def parse_listpage(self, response):
        # 1️⃣  Get all product URLs from the current page.
        product_urls = response.css("article.product_pod h3 a::attr(href)").getall()

        # 2️⃣  Follow each product URL and send the response to `parse_book`.
        for url in product_urls:
            yield response.follow(url, callback=self.parse_book)

        # 3️⃣  Locate the “Next” page link (if any).
        next_page_url = response.css("li.next a::attr(href)").get()

        # 4️⃣  If a next page exists, follow it and recurse back to this method.
        if next_page_url:
            yield response.follow(next_page_url, callback=self.parse_listpage)

    # ------------------------------------------------------------------
    # 3️⃣  Parse the *product* (book) page.
    # ------------------------------------------------------------------
    async def parse_book(self, response):
        # Yield a dictionary containing the data we want to export.
        yield {
            "name":  response.css("h1::text").get(),
            "price": response.css("p.price_color::text").get(),
            "url":   response.url,
        }

注意： response.follow 会自动解析相对 URL（例如 page-2.html），因此您无需自行构建完整的 URL。

运行爬虫

在 项目根目录 打开终端。
执行爬虫：

scrapy crawl books

你会看到 Scrapy 启动，并在日志中显示所有 48 条目 被抓取。

导出数据

Scrapy 内置的 Feed Exporter 使保存结果变得非常简单。使用 -o（输出）标志将抓取的条目写入文件：

scrapy crawl books -o books.json

使用此命令运行爬虫会在项目根目录创建一个 books.json 文件，里面以干净、结构化的 JSON 格式包含 48 条目。

你学到了什么

设置了一个现代的、async Scrapy 项目。
找到了所需数据的 CSS 选择器。
自动跟随链接并处理分页。
只用一条命令导出抓取的数据。

这仅仅是个开始！

💬 TALK: 在这段 Scrapy 代码上卡住了吗？在我们的 Discord 中向维护者和 5k+ 开发者求助。

▶️ WATCH: 本文基于我们的视频——在 YouTube 频道观看完整演示。

📩 READ: 想了解更多？在 第 2 部分 我们将介绍 Scrapy Items 和 Pipelines。订阅 Extract 时事通讯，别错过它。

现代 Scrapy 开发者指南（第1部分）：构建你的第一个 Spider

Scrapy Can Feel Daunting – But It Doesn’t Have To

我们将构建的最终爬虫

1️⃣ 项目设置

创建项目文件夹和虚拟环境

安装 Scrapy

2️⃣ 生成 Scrapy 项目模板

生成第一个 Spider

3️⃣ 调整项目设置

禁用 robots.txt（仅限测试站点）

加快爬取速度（仅限测试站点）

4️⃣ 使用 `scrapy shell` 探索站点

在 Fantasy 分类页面打开 shell

查找所有图书链接

查找 “下一页” 链接（分页）

在单本图书页面打开 shell 以获取数据选择器

5️⃣ 编写爬虫

6️⃣ Run the Spider & Export to JSON

🎉 你成功了！

Scrapy 爬虫概览

运行爬虫

导出数据

你学到了什么

相关文章

第28天提升我的Data Science技能

第26天提升我的数据科学技能

提升我的数据科学技能的第30天

数据架构师大师专业工作簿

Scrapy Can Feel Daunting – But It Doesn’t Have To

我们将构建的最终爬虫

1️⃣ 项目设置

创建项目文件夹和虚拟环境

安装 Scrapy

2️⃣ 生成 Scrapy 项目模板

生成第一个 Spider

3️⃣ 调整项目设置

禁用 robots.txt（仅限测试站点）

加快爬取速度（仅限测试站点）

4️⃣ 使用 scrapy shell 探索站点

在 Fantasy 分类页面打开 shell

查找所有图书链接

查找 “下一页” 链接（分页）

在单本图书页面打开 shell 以获取数据选择器

5️⃣ 编写爬虫

6️⃣ Run the Spider & Export to JSON

🎉 你成功了！

Scrapy 爬虫概览

运行爬虫

导出数据

你学到了什么

相关文章

第28天提升我的Data Science技能

第26天提升我的数据科学技能

提升我的数据科学技能的第30天

数据架构师大师专业工作簿

Scrapy Can Feel Daunting – But It Doesn’t Have To

4️⃣ 使用 `scrapy shell` 探索站点