如何在 Scrapy 中使用 rs‑trafilatura

发布: 1个月前 (2026年4月3日 GMT+8 22:23)

2 分钟阅读

原文: Dev.to

Source: Dev.to

安装

pip install rs-trafilatura scrapy

配置管道

在 Scrapy 项目的 settings.py 中添加管道：

ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}

基本爬虫

你的爬虫需要返回包含响应体（字节）和 URL 的 item：

import scrapy

class ContentSpider(scrapy.Spider):
    name = "content"
    start_urls = ["https://example.com"]

    def parse(self, response):
        yield {
            "url": response.url,
            "body": response.body,  # 原始字节 — rs-trafilatura 会自动检测编码
        }

        # 继续跟踪链接
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, self.parse)

管道会检测 body（字节）或 html（字符串）字段，执行抽取，并将结果放在 item["extraction"] 下。

抽取结果示例

{
    "url": "https://example.com/blog/post",
    "body": "...",
    "extraction": {
        "title": "Blog Post Title",
        "author": "John Doe",
        "date": "2026-01-15T00:00:00+00:00",
        "main_content": "The full extracted text...",
        "content_markdown": "# Blog Post Title\n\nThe full extracted text...",
        "page_type": "article",
        "extraction_quality": 0.95,
        "language": "en",
        "sitename": "Example Blog",
        "description": "A blog post about..."
    }
}

启用 Markdown 输出

在 settings.py 中添加：

RS_TRAFILATURA_MARKDOWN = True

启用后，item["extraction"]["content_markdown"] 将包含 GitHub 风格的 Markdown。

按页面类型路由 item

你可以添加自定义管道，根据 page_type 字段对 item 进行路由：

# myproject/pipelines.py
class PageTypeRouter:
    def process_item(self, item, spider):
        ext = item.get("extraction", {})
        page_type = ext.get("page_type", "article")

        if page_type == "product":
            save_product(item)
        elif page_type == "forum":
            save_forum_post(item)
        elif page_type == "article":
            save_article(item)
        else:
            save_generic(item)

        return item

在 settings.py 中配置管道顺序：

ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
    "myproject.pipelines.PageTypeRouter": 400,
}

过滤低质量抽取

# myproject/pipelines.py
import scrapy

class QualityFilter:
    def process_item(self, item, spider):
        ext = item.get("extraction", {})
        quality = ext.get("extraction_quality", 0)

        if quality

rs‑trafilatura GitHub 仓库：
Rust crate：
Scrapy 文档：
基准测试细节：（GitHub，Zenodo）

如何在 Scrapy 中使用 rs‑trafilatura

安装

配置管道

基本爬虫

抽取结果示例

启用 Markdown 输出

按页面类型路由 item

过滤低质量抽取

相关文章

如何将 rs-trafilatura 与 Firecrawl 一起使用

如何在 Python 中无需 API Key 抓取 Twitter/X（2026 指南）

如何使用 rs‑trafilatura 与 crawl4ai

我为何创建 pip-size：关于对性能执着的故事