How to Use rs-trafilatura with crawl4ai

Published: 1 month ago (April 3, 2026 at 10:19 AM EDT)

3 min read

Source: Dev.to

Source: Dev.to

Installation

pip install rs-trafilatura crawl4ai

If this is your first time with crawl4ai, install the Playwright browsers as well:

python -m playwright install chromium

Basic usage with `RsTrafilaturaStrategy`

import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy

async def main():
    strategy = RsTrafilaturaStrategy()
    config = CrawlerRunConfig(extraction_strategy=strategy)

    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com", config=config)

    data = json.loads(result.extracted_content)
    item = data[0]

    print(f"Title: {item['title']}")
    print(f"Page type: {item['page_type']}")
    print(f"Quality: {item['extraction_quality']}")
    print(f"Content: {item['main_content'][:200]}")

asyncio.run(main())

The extracted_content field is a JSON array with a single item containing the extraction result. Crawl4ai serialises it automatically; you just need to json.loads() it.

Extraction result fields

Field	Description
`title`	Page title
`author`	Author name (if detected)
`date`	Publication date (ISO 8601)
`main_content`	Clean extracted text
`content_markdown`	Markdown output (if enabled)
`page_type`	`article`, `forum`, `product`, `collection`, `listing`, `documentation`, `service`
`extraction_quality`	0.0 – 1.0 confidence score
`language`	Detected language
`sitename`	Site name
`description`	Meta description

Getting Markdown alongside plain text

strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com", config=config)

data = json.loads(result.extracted_content)
markdown = data[0]["content_markdown"]

The Markdown is GitHub‑Flavored, preserving headings, lists, tables, bold/italic, code blocks, and links.

Tuning precision vs. recall

Stricter filtering (less noise, may miss some content):

strategy = RsTrafilaturaStrategy(favor_precision=True)

More inclusive (captures more content, may include boilerplate):
```
strategy = RsTrafilaturaStrategy(favor_recall=True)
```

Concurrency example

async def main():
    strategy = RsTrafilaturaStrategy(output_markdown=True)
    config = CrawlerRunConfig(extraction_strategy=strategy)

    urls = [
        "https://example.com/blog/post-1",
        "https://example.com/products/widget",
        "https://example.com/docs/getting-started",
        "https://forum.example.com/thread/123",
    ]

    async with AsyncWebCrawler() as crawler:
        for url in urls:
            result = await crawler.arun(url=url, config=config)
            data = json.loads(result.extracted_content)
            item = data[0]
            print(f"[{item['page_type']}] {item['title']} (quality: {item['extraction_quality']:.2f})")

asyncio.run(main())

Each page is classified and extracted with the appropriate profile (product pages get JSON‑LD fallback, forum threads treat comments as content, docs pages have sidebars removed). The extraction runs in a separate thread per page, so it does not block the async crawl loop.

Hybrid pipeline with LLM fallback

from crawl4ai.extraction_strategy import LLMExtractionStrategy

async def extract_with_fallback(crawler, url, config):
    result = await crawler.arun(url=url, config=config)
    data = json.loads(result.extracted_content)
    item = data[0]

    if item["extraction_quality"]

Resources

rs‑trafilatura source:
Rust crate:
crawl4ai repository:
WCEB benchmark:
Benchmark data (Zenodo): (replace with actual DOI)

How to Use rs-trafilatura with crawl4ai

Installation

Basic usage with `RsTrafilaturaStrategy`

Extraction result fields

Getting Markdown alongside plain text

Tuning precision vs. recall

Concurrency example

Hybrid pipeline with LLM fallback

Resources

Related posts

How to Use rs-trafilatura with Scrapy

How to Use rs-trafilatura with Firecrawl

How to Use rs-trafilatura with spider-rs

I Got Tired of Hunting Screenshot Paths in Terminals. So I Fixed Ctrl+V.

Installation

Basic usage with RsTrafilaturaStrategy

Extraction result fields

Getting Markdown alongside plain text

Tuning precision vs. recall

Concurrency example

Hybrid pipeline with LLM fallback

Resources

Related posts

How to Use rs-trafilatura with Scrapy

How to Use rs-trafilatura with Firecrawl

How to Use rs-trafilatura with spider-rs

I Got Tired of Hunting Screenshot Paths in Terminals. So I Fixed Ctrl+V.

Basic usage with `RsTrafilaturaStrategy`