How to Use rs-trafilatura with Scrapy

Published: (April 3, 2026 at 10:23 AM EDT)
2 min read
Source: Dev.to

Source: Dev.to

Installation

pip install rs-trafilatura scrapy

Configure the pipeline

Add the pipeline to your Scrapy project’s settings.py:

ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
}

Basic spider

Your spider yields items with the response body (bytes) and URL:

import scrapy

class ContentSpider(scrapy.Spider):
    name = "content"
    start_urls = ["https://example.com"]

    def parse(self, response):
        yield {
            "url": response.url,
            "body": response.body,  # raw bytes — rs-trafilatura auto‑detects encoding
        }

        # Follow links
        for href in response.css("a::attr(href)").getall():
            yield response.follow(href, self.parse)

The pipeline detects a body (bytes) or html (string) field, runs extraction, and adds the results under item["extraction"].

Extraction result example

{
    "url": "https://example.com/blog/post",
    "body": "...",
    "extraction": {
        "title": "Blog Post Title",
        "author": "John Doe",
        "date": "2026-01-15T00:00:00+00:00",
        "main_content": "The full extracted text...",
        "content_markdown": "# Blog Post Title\n\nThe full extracted text...",
        "page_type": "article",
        "extraction_quality": 0.95,
        "language": "en",
        "sitename": "Example Blog",
        "description": "A blog post about..."
    }
}

Enable Markdown output

Add to settings.py:

RS_TRAFILATURA_MARKDOWN = True

When enabled, item["extraction"]["content_markdown"] contains GitHub‑Flavored Markdown.

Routing items by page type

You can add a custom pipeline that routes items based on the page_type field:

# myproject/pipelines.py
class PageTypeRouter:
    def process_item(self, item, spider):
        ext = item.get("extraction", {})
        page_type = ext.get("page_type", "article")

        if page_type == "product":
            save_product(item)
        elif page_type == "forum":
            save_forum_post(item)
        elif page_type == "article":
            save_article(item)
        else:
            save_generic(item)

        return item

Configure the pipelines order in settings.py:

ITEM_PIPELINES = {
    "rs_trafilatura.scrapy.RsTrafilaturaPipeline": 300,
    "myproject.pipelines.PageTypeRouter": 400,
}

Filtering low‑quality extractions

# myproject/pipelines.py
import scrapy

class QualityFilter:
    def process_item(self, item, spider):
        ext = item.get("extraction", {})
        quality = ext.get("extraction_quality", 0)

        if quality 
  • rs‑trafilatura GitHub repository:
  • Rust crate:
  • Scrapy documentation:
  • Benchmark details: (GitHub, Zenodo)
0 views
Back to Blog

Related posts

Read more »

How to Use rs-trafilatura with Firecrawl

Introduction Firecrawl is an API service for scraping web pages. It handles JavaScript rendering, anti‑bot bypass, and rate limiting — you send it a URL, it re...

How to Use rs-trafilatura with crawl4ai

crawl4ai is an async web crawler built for producing LLM‑friendly output. By default it converts pages to Markdown using its own scraping pipeline, but you can...