How to Use rs-trafilatura with crawl4ai
Source: Dev.to
Installation
pip install rs-trafilatura crawl4aiIf this is your first time with crawl4ai, install the Playwright browsers as well:
python -m playwright install chromiumBasic usage with RsTrafilaturaStrategy
import json
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from rs_trafilatura.crawl4ai import RsTrafilaturaStrategy
async def main():
strategy = RsTrafilaturaStrategy()
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
data = json.loads(result.extracted_content)
item = data[0]
print(f"Title: {item['title']}")
print(f"Page type: {item['page_type']}")
print(f"Quality: {item['extraction_quality']}")
print(f"Content: {item['main_content'][:200]}")
asyncio.run(main())The extracted_content field is a JSON array with a single item containing the extraction result. Crawl4ai serialises it automatically; you just need to json.loads() it.
Extraction result fields
| Field | Description |
|---|---|
title | Page title |
author | Author name (if detected) |
date | Publication date (ISO 8601) |
main_content | Clean extracted text |
content_markdown | Markdown output (if enabled) |
page_type | article, forum, product, collection, listing, documentation, service |
extraction_quality | 0.0 – 1.0 confidence score |
language | Detected language |
sitename | Site name |
description | Meta description |
Getting Markdown alongside plain text
strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com", config=config)
data = json.loads(result.extracted_content)
markdown = data[0]["content_markdown"]The Markdown is GitHub‑Flavored, preserving headings, lists, tables, bold/italic, code blocks, and links.
Tuning precision vs. recall
Stricter filtering (less noise, may miss some content):
strategy = RsTrafilaturaStrategy(favor_precision=True)More inclusive (captures more content, may include boilerplate):
strategy = RsTrafilaturaStrategy(favor_recall=True)
Concurrency example
async def main():
strategy = RsTrafilaturaStrategy(output_markdown=True)
config = CrawlerRunConfig(extraction_strategy=strategy)
urls = [
"https://example.com/blog/post-1",
"https://example.com/products/widget",
"https://example.com/docs/getting-started",
"https://forum.example.com/thread/123",
]
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url, config=config)
data = json.loads(result.extracted_content)
item = data[0]
print(f"[{item['page_type']}] {item['title']} (quality: {item['extraction_quality']:.2f})")
asyncio.run(main())Each page is classified and extracted with the appropriate profile (product pages get JSON‑LD fallback, forum threads treat comments as content, docs pages have sidebars removed). The extraction runs in a separate thread per page, so it does not block the async crawl loop.
Hybrid pipeline with LLM fallback
from crawl4ai.extraction_strategy import LLMExtractionStrategy
async def extract_with_fallback(crawler, url, config):
result = await crawler.arun(url=url, config=config)
data = json.loads(result.extracted_content)
item = data[0]
if item["extraction_quality"]Resources
- rs‑trafilatura source:
- Rust crate:
- crawl4ai repository:
- WCEB benchmark:
- Benchmark data (Zenodo): (replace with actual DOI)