How to Use rs-trafilatura with Firecrawl
Source: Dev.to
Introduction
Firecrawl is an API service for scraping web pages. It handles JavaScript rendering, anti‑bot bypass, and rate limiting — you send it a URL, it returns the page content. By default, Firecrawl returns Markdown, but if you request the raw HTML you can run rs‑trafilatura on it for page‑type‑aware extraction with quality scoring. This is useful when you need structured metadata (title, author, date, page type) or when you want to know how confident the extraction is.
Installation
pip install rs-trafilatura firecrawlYou also need a Firecrawl API key from .
Basic Usage
from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result
app = FirecrawlApp(api_key="fc-your-api-key")
# Request HTML format (required for rs-trafilatura)
result = app.scrape("https://example.com/blog/post", formats=["html"])
# Extract with rs-trafilatura
extracted = extract_firecrawl_result(result)
print(f"Title: {extracted.title}")
print(f"Author: {extracted.author}")
print(f"Date: {extracted.date}")
print(f"Page type: {extracted.page_type}")
print(f"Quality: {extracted.extraction_quality:.2f}")
print(f"Content: {extracted.main_content[:200]}")The key is formats=["html"] — this tells Firecrawl to return the raw HTML alongside any other formats. Without it, you only get Markdown, and rs‑trafilatura has nothing to extract from.
Page‑Type Differences
| Page type | Firecrawl output | rs‑trafilatura advantage |
|---|---|---|
| Product pages | May include navigation, filters, and “related products” sections. | Recognises the page type and extracts just the product description, falling back to JSON‑LD structured data when needed. |
| Forums | Treats the entire page as content. | Identifies user posts and excludes voting controls, user profile panels, and moderation UI. |
| Service pages | May over‑extract or under‑extract multi‑section layouts. | Multi‑candidate merge handles hero, features, testimonials, and pricing sections. |
Quality Score
Firecrawl doesn’t provide a confidence metric. rs‑trafilatura adds an extraction_quality field (0.0 – 1.0) so you can flag unreliable extractions.
Comparing Markdown Outputs
result = app.scrape("https://example.com", formats=["html", "markdown"])
# Firecrawl's own Markdown
firecrawl_markdown = result.markdown
# rs‑trafilatura extraction
extracted = extract_firecrawl_result(result, output_markdown=True)
rs_markdown = extracted.content_markdown
rs_quality = extracted.extraction_quality
print(f"Firecrawl markdown: {len(firecrawl_markdown)} chars")
print(f"rs‑trafilatura markdown: {len(rs_markdown)} chars")
print(f"Extraction quality: {rs_quality:.2f}")Batch Scraping
Firecrawl supports batch scraping. Combine it with rs‑trafilatura for structured extraction at scale:
from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result
app = FirecrawlApp(api_key="fc-your-api-key")
urls = [
"https://example.com/products/widget",
"https://example.com/docs/getting-started",
"https://example.com/blog/announcement",
"https://forum.example.com/thread/help",
]
batch = app.batch_scrape(urls, formats=["html"])
for doc in batch.data:
extracted = extract_firecrawl_result(doc)
print(f"[{extracted.page_type}] {extracted.title} (quality: {extracted.extraction_quality:.2f})")Note: The batch API returns a result object with a
.dataattribute containing a list of Document objects. Theextract_firecrawl_resultadapter handles both Document objects (v4) and legacy dicts (v1).
Extraction Options
# Stricter filtering — less noise
extracted = extract_firecrawl_result(result, favor_precision=True)
# More inclusive — captures more content
extracted = extract_firecrawl_result(result, favor_recall=True)
# Get Markdown output
extracted = extract_firecrawl_result(result, output_markdown=True)Result Fields
extract_firecrawl_result returns an ExtractResult with the following attributes:
title,author,date— structured metadatamain_content— clean extracted textcontent_markdown— GFM Markdown (when enabled)page_type—article,forum,product,collection,listing,documentation,serviceextraction_quality— 0.0 – 1.0 confidence scorelanguage,sitename,description— additional metadataimages— extracted image data withsrc,alt,caption
Resources
- rs‑trafilatura (Python):
- rs‑trafilatura (Rust crate):
- Firecrawl:
- Benchmark: