如何在 Scrapy 中使用 rs‑trafilatura
发布: (2026年4月3日 GMT+8 22:23)
2 分钟阅读
原文: Dev.to
Source: Dev.to
安装
pip install rs-trafilatura scrapy配置管道
在 Scrapy 项目的 settings.py 中添加管道:
ITEM_PIPELINES = {
"rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}基本爬虫
你的爬虫需要返回包含响应体(字节)和 URL 的 item:
import scrapy
class ContentSpider(scrapy.Spider):
name = "content"
start_urls = ["https://example.com"]
def parse(self, response):
yield {
"url": response.url,
"body": response.body, # 原始字节 — rs-trafilatura 会自动检测编码
}
# 继续跟踪链接
for href in response.css("a::attr(href)").getall():
yield response.follow(href, self.parse)管道会检测 body(字节)或 html(字符串)字段,执行抽取,并将结果放在 item["extraction"] 下。
抽取结果示例
{
"url": "https://example.com/blog/post",
"body": "...",
"extraction": {
"title": "Blog Post Title",
"author": "John Doe",
"date": "2026-01-15T00:00:00+00:00",
"main_content": "The full extracted text...",
"content_markdown": "# Blog Post Title\n\nThe full extracted text...",
"page_type": "article",
"extraction_quality": 0.95,
"language": "en",
"sitename": "Example Blog",
"description": "A blog post about..."
}
}启用 Markdown 输出
在 settings.py 中添加:
RS_TRAFILATURA_MARKDOWN = True启用后,item["extraction"]["content_markdown"] 将包含 GitHub 风格的 Markdown。
按页面类型路由 item
你可以添加自定义管道,根据 page_type 字段对 item 进行路由:
# myproject/pipelines.py
class PageTypeRouter:
def process_item(self, item, spider):
ext = item.get("extraction", {})
page_type = ext.get("page_type", "article")
if page_type == "product":
save_product(item)
elif page_type == "forum":
save_forum_post(item)
elif page_type == "article":
save_article(item)
else:
save_generic(item)
return item在 settings.py 中配置管道顺序:
ITEM_PIPELINES = {
"rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
"myproject.pipelines.PageTypeRouter": 400,
}过滤低质量抽取
# myproject/pipelines.py
import scrapy
class QualityFilter:
def process_item(self, item, spider):
ext = item.get("extraction", {})
quality = ext.get("extraction_quality", 0)
if quality - rs‑trafilatura GitHub 仓库:
- Rust crate:
- Scrapy 文档:
- 基准测试细节:(GitHub,Zenodo)