How to Use rs-trafilatura with Firecrawl

Published: (April 3, 2026 at 10:22 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

Firecrawl is an API service for scraping web pages. It handles JavaScript rendering, anti‑bot bypass, and rate limiting — you send it a URL, it returns the page content. By default, Firecrawl returns Markdown, but if you request the raw HTML you can run rs‑trafilatura on it for page‑type‑aware extraction with quality scoring. This is useful when you need structured metadata (title, author, date, page type) or when you want to know how confident the extraction is.

Installation

pip install rs-trafilatura firecrawl

You also need a Firecrawl API key from .

Basic Usage

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="fc-your-api-key")

# Request HTML format (required for rs-trafilatura)
result = app.scrape("https://example.com/blog/post", formats=["html"])

# Extract with rs-trafilatura
extracted = extract_firecrawl_result(result)

print(f"Title: {extracted.title}")
print(f"Author: {extracted.author}")
print(f"Date: {extracted.date}")
print(f"Page type: {extracted.page_type}")
print(f"Quality: {extracted.extraction_quality:.2f}")
print(f"Content: {extracted.main_content[:200]}")

The key is formats=["html"] — this tells Firecrawl to return the raw HTML alongside any other formats. Without it, you only get Markdown, and rs‑trafilatura has nothing to extract from.

Page‑Type Differences

Page typeFirecrawl outputrs‑trafilatura advantage
Product pagesMay include navigation, filters, and “related products” sections.Recognises the page type and extracts just the product description, falling back to JSON‑LD structured data when needed.
ForumsTreats the entire page as content.Identifies user posts and excludes voting controls, user profile panels, and moderation UI.
Service pagesMay over‑extract or under‑extract multi‑section layouts.Multi‑candidate merge handles hero, features, testimonials, and pricing sections.

Quality Score

Firecrawl doesn’t provide a confidence metric. rs‑trafilatura adds an extraction_quality field (0.0 – 1.0) so you can flag unreliable extractions.

Comparing Markdown Outputs

result = app.scrape("https://example.com", formats=["html", "markdown"])

# Firecrawl's own Markdown
firecrawl_markdown = result.markdown

# rs‑trafilatura extraction
extracted = extract_firecrawl_result(result, output_markdown=True)
rs_markdown = extracted.content_markdown
rs_quality = extracted.extraction_quality

print(f"Firecrawl markdown: {len(firecrawl_markdown)} chars")
print(f"rs‑trafilatura markdown: {len(rs_markdown)} chars")
print(f"Extraction quality: {rs_quality:.2f}")

Batch Scraping

Firecrawl supports batch scraping. Combine it with rs‑trafilatura for structured extraction at scale:

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="fc-your-api-key")

urls = [
    "https://example.com/products/widget",
    "https://example.com/docs/getting-started",
    "https://example.com/blog/announcement",
    "https://forum.example.com/thread/help",
]

batch = app.batch_scrape(urls, formats=["html"])

for doc in batch.data:
    extracted = extract_firecrawl_result(doc)
    print(f"[{extracted.page_type}] {extracted.title} (quality: {extracted.extraction_quality:.2f})")

Note: The batch API returns a result object with a .data attribute containing a list of Document objects. The extract_firecrawl_result adapter handles both Document objects (v4) and legacy dicts (v1).

Extraction Options

# Stricter filtering — less noise
extracted = extract_firecrawl_result(result, favor_precision=True)

# More inclusive — captures more content
extracted = extract_firecrawl_result(result, favor_recall=True)

# Get Markdown output
extracted = extract_firecrawl_result(result, output_markdown=True)

Result Fields

extract_firecrawl_result returns an ExtractResult with the following attributes:

  • title, author, date — structured metadata
  • main_content — clean extracted text
  • content_markdown — GFM Markdown (when enabled)
  • page_typearticle, forum, product, collection, listing, documentation, service
  • extraction_quality — 0.0 – 1.0 confidence score
  • language, sitename, description — additional metadata
  • images — extracted image data with src, alt, caption

Resources

  • rs‑trafilatura (Python):
  • rs‑trafilatura (Rust crate):
  • Firecrawl:
  • Benchmark:
0 views
Back to Blog

Related posts

Read more »

How to Use rs-trafilatura with Scrapy

Installation bash pip install rs-trafilatura scrapy Configure the pipeline Add the pipeline to your Scrapy project's settings.py: python ITEM_PIPELINES = { 'rs...

How to Use rs-trafilatura with crawl4ai

crawl4ai is an async web crawler built for producing LLM‑friendly output. By default it converts pages to Markdown using its own scraping pipeline, but you can...