How to Use rs-trafilatura with Firecrawl

Published: 1 month ago (April 3, 2026 at 10:22 AM EDT)

3 min read

Source: Dev.to

Source: Dev.to

Introduction

Firecrawl is an API service for scraping web pages. It handles JavaScript rendering, anti‑bot bypass, and rate limiting — you send it a URL, it returns the page content. By default, Firecrawl returns Markdown, but if you request the raw HTML you can run rs‑trafilatura on it for page‑type‑aware extraction with quality scoring. This is useful when you need structured metadata (title, author, date, page type) or when you want to know how confident the extraction is.

Installation

pip install rs-trafilatura firecrawl

You also need a Firecrawl API key from .

Basic Usage

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="fc-your-api-key")

# Request HTML format (required for rs-trafilatura)
result = app.scrape("https://example.com/blog/post", formats=["html"])

# Extract with rs-trafilatura
extracted = extract_firecrawl_result(result)

print(f"Title: {extracted.title}")
print(f"Author: {extracted.author}")
print(f"Date: {extracted.date}")
print(f"Page type: {extracted.page_type}")
print(f"Quality: {extracted.extraction_quality:.2f}")
print(f"Content: {extracted.main_content[:200]}")

The key is formats=["html"] — this tells Firecrawl to return the raw HTML alongside any other formats. Without it, you only get Markdown, and rs‑trafilatura has nothing to extract from.

Page‑Type Differences

Page type	Firecrawl output	rs‑trafilatura advantage
Product pages	May include navigation, filters, and “related products” sections.	Recognises the page type and extracts just the product description, falling back to JSON‑LD structured data when needed.
Forums	Treats the entire page as content.	Identifies user posts and excludes voting controls, user profile panels, and moderation UI.
Service pages	May over‑extract or under‑extract multi‑section layouts.	Multi‑candidate merge handles hero, features, testimonials, and pricing sections.

Quality Score

Firecrawl doesn’t provide a confidence metric. rs‑trafilatura adds an extraction_quality field (0.0 – 1.0) so you can flag unreliable extractions.

Comparing Markdown Outputs

result = app.scrape("https://example.com", formats=["html", "markdown"])

# Firecrawl's own Markdown
firecrawl_markdown = result.markdown

# rs‑trafilatura extraction
extracted = extract_firecrawl_result(result, output_markdown=True)
rs_markdown = extracted.content_markdown
rs_quality = extracted.extraction_quality

print(f"Firecrawl markdown: {len(firecrawl_markdown)} chars")
print(f"rs‑trafilatura markdown: {len(rs_markdown)} chars")
print(f"Extraction quality: {rs_quality:.2f}")

Batch Scraping

Firecrawl supports batch scraping. Combine it with rs‑trafilatura for structured extraction at scale:

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="fc-your-api-key")

urls = [
    "https://example.com/products/widget",
    "https://example.com/docs/getting-started",
    "https://example.com/blog/announcement",
    "https://forum.example.com/thread/help",
]

batch = app.batch_scrape(urls, formats=["html"])

for doc in batch.data:
    extracted = extract_firecrawl_result(doc)
    print(f"[{extracted.page_type}] {extracted.title} (quality: {extracted.extraction_quality:.2f})")

Note: The batch API returns a result object with a .data attribute containing a list of Document objects. The extract_firecrawl_result adapter handles both Document objects (v4) and legacy dicts (v1).

Extraction Options

# Stricter filtering — less noise
extracted = extract_firecrawl_result(result, favor_precision=True)

# More inclusive — captures more content
extracted = extract_firecrawl_result(result, favor_recall=True)

# Get Markdown output
extracted = extract_firecrawl_result(result, output_markdown=True)

Result Fields

extract_firecrawl_result returns an ExtractResult with the following attributes:

title, author, date — structured metadata
main_content — clean extracted text
content_markdown — GFM Markdown (when enabled)
page_type — article, forum, product, collection, listing, documentation, service
extraction_quality — 0.0 – 1.0 confidence score
language, sitename, description — additional metadata
images — extracted image data with src, alt, caption

Resources

rs‑trafilatura (Python):
rs‑trafilatura (Rust crate):
Firecrawl:
Benchmark:

How to Use rs-trafilatura with Firecrawl

Introduction

Installation

Basic Usage

Page‑Type Differences

Quality Score

Comparing Markdown Outputs

Batch Scraping

Extraction Options

Result Fields

Resources

Related posts

How to Scrape Twitter/X Without an API Key in Python (2026 Guide)

How to Use rs-trafilatura with Scrapy

I Built 3 APIs for Turkey’s Used-Car Market with Apify

How to Use rs-trafilatura with crawl4ai