How to Use rs-trafilatura with crawl4ai

Published: (April 3, 2026 at 10:19 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Installation

pip install rs-trafilatura crawl4ai

If this is your first time with crawl4ai, install the Playwright browsers as well:

python -m playwright install chromium

Basic usage with RsTrafilaturaStrategy

import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

async def main():
    strategy = RsTrafilaturaStrategy()
    config = CrawlerRunConfig(extraction_strategy=strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com", config=config)

    data = json.loads(result.extracted_content)
    item = data[0]

    print(f"Title: {item['title']}")
    print(f"Page type: {item['page_type']}")
    print(f"Quality: {item['extraction_quality']}")
    print(f"Content: {item['main_content'][:200]}")

asyncio.run(main())

The extracted_content field is a JSON array with a single item containing the extraction result. Crawl4ai serialises it automatically; you just need to json.loads() it.

Extraction result fields

FieldDescription
titlePage title
authorAuthor name (if detected)
datePublication date (ISO 8601)
main_contentClean extracted text
content_markdownMarkdown output (if enabled)
page_typearticle, forum, product, collection, listing, documentation, service
extraction_quality0.0 – 1.0 confidence score
languageDetected language
sitenameSite name
descriptionMeta description

Getting Markdown alongside plain text

strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)

data = json.loads(result.extracted_content)
markdown = data[0]["content_markdown"]

The Markdown is GitHub‑Flavored, preserving headings, lists, tables, bold/italic, code blocks, and links.

Tuning precision vs. recall

  • Stricter filtering (less noise, may miss some content):

    strategy = RsTrafilaturaStrategy(favor_precision=True)
  • More inclusive (captures more content, may include boilerplate):

    strategy = RsTrafilaturaStrategy(favor_recall=True)

Concurrency example

async def main():
    strategy = RsTrafilaturaStrategy(output_markdown=True)
    config = CrawlerRunConfig(extraction_strategy=strategy)

    urls = [
        "https://example.com/blog/post-1",
        "https://example.com/products/widget",
        "https://example.com/docs/getting-started",
        "https://forum.example.com/thread/123",
    ]

    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url, config=config)
            data = json.loads(result.extracted_content)
            item = data[0]
            print(f"[{item['page_type']}] {item['title']} (quality: {item['extraction_quality']:.2f})")

asyncio.run(main())

Each page is classified and extracted with the appropriate profile (product pages get JSON‑LD fallback, forum threads treat comments as content, docs pages have sidebars removed). The extraction runs in a separate thread per page, so it does not block the async crawl loop.

Hybrid pipeline with LLM fallback

from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def extract_with_fallback(crawler, url, config):
    result = await crawler.arun(url=url, config=config)
    data = json.loads(result.extracted_content)
    item = data[0]

    if item["extraction_quality"]

Resources

  • rs‑trafilatura source:
  • Rust crate:
  • crawl4ai repository:
  • WCEB benchmark:
  • Benchmark data (Zenodo): (replace with actual DOI)
0 views
Back to Blog

Related posts

Read more »

How to Use rs-trafilatura with Scrapy

Installation bash pip install rs-trafilatura scrapy Configure the pipeline Add the pipeline to your Scrapy project's settings.py: python ITEM_PIPELINES = { 'rs...

How to Use rs-trafilatura with Firecrawl

Introduction Firecrawl is an API service for scraping web pages. It handles JavaScript rendering, anti‑bot bypass, and rate limiting — you send it a URL, it re...

How to Use rs-trafilatura with spider-rs

Introduction spider is a high‑performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs...