How to Use rs-trafilatura with spider-rs

Published: (April 3, 2026 at 10:21 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

spider is a high‑performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs‑trafilatura slots in as the extraction layer, providing page‑type‑aware content extraction with quality scoring on every crawled page.

Adding the dependencies

Add both crates to your Cargo.toml:

[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }
spider = "2"
tokio = { version = "1", features = ["full"] }

The spider feature flag enables rs_trafilatura::spider_integration, which provides convenience functions that accept spider’s Page type directly.

Simple extraction (crawl then process)

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    website.crawl().await;

    for page in website.get_pages().into_iter().flatten() {
        match extract_page(&page) {
            Ok(result) => {
                println!(
                    "[{}] {} (confidence: {:.2})",
                    result.metadata.page_type.unwrap_or_default(),
                    result.metadata.title.unwrap_or_default(),
                    result.extraction_quality,
                );
                println!("  Content: {} chars", result.content_text.len());
            }
            Err(e) => eprintln!("  Extraction failed: {e}"),
        }
    }
}

extract_page takes a &Page and returns Result. The page URL is automatically passed to the classifier for page‑type detection.

Streaming extraction (process pages as they arrive)

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    let mut rx = website.subscribe(0).unwrap();

    let handle = tokio::spawn(async move {
        let mut count = 0;
        while let Ok(page) = rx.recv().await {
            if let Ok(result) = extract_page(&page) {
                count += 1;
                println!(
                    "[{count}] {} → {} ({:.2})",
                    page.get_url(),
                    result.metadata.page_type.unwrap_or_default(),
                    result.extraction_quality,
                );
            }
        }
        println!("Extracted {count} pages");
    });

    website.crawl().await;
    website.unsubscribe();
    let _ = handle.await;
}

Each page is extracted in the spawned task as soon as spider fetches it. Extraction takes ~44 ms per page, so it easily keeps up with typical crawl rates.

Fine‑grained control with extract_page_with_options

use rs_trafilatura::{Options, spider_integration::extract_page_with_options};
use rs_trafilatura::page_type::PageType;

let options = Options {
    output_markdown: true,          // Get GFM Markdown output
    include_images: true,           // Extract image metadata
    favor_precision: true,         // Stricter filtering
    page_type: Some(PageType::Product), // Force page type
    ..Options::default()
};

let result = extract_page_with_options(&page, &options)?;

if let Some(md) = &result.content_markdown {
    println!("Markdown:\n{}", md);
}

for img in &result.images {
    println!("Image: {} (hero: {})", img.src, img.is_hero);
}

If you provide a URL in the options, it takes precedence over the page URL for classification; otherwise the page URL is used automatically.

Filtering by extraction quality

for page in website.get_pages().into_iter().flatten() {
    let url = page.get_url().to_string();
    let result = extract_page(&page)?;

    if result.extraction_quality {
        // Add your filtering logic here
    }
}

Result fields

FieldTypeDescription
content_markdownOption<String>GFM Markdown (when enabled)
content_htmlOption<String>Extracted content as HTML
metadata.titleOption<String>Page title
metadata.authorOption<String>Author name
metadata.dateOption<String>Publication date
metadata.page_typeOption<PageType>Detected page type
extraction_qualityf640.0–1.0 confidence score
imagesVec<Image>Image URLs, alt text, captions

Comparison with spider_transformations

spider ships with its own spider_transformations crate that can convert pages to Markdown or plain text. It works, but it is a basic readability‑style extractor lacking:

  • ML page‑type classification
  • Type‑specific extraction profiles (forum comment handling, multi‑section merge, JSON‑LD fallback)
  • Extraction quality scoring
  • Structured metadata extraction from JSON‑LD, Open Graph, and Dublin Core

rs‑trafilatura provides all of these features. For article‑heavy crawls, spider_transformations may be sufficient; for crawls that encounter diverse page types, rs‑trafilatura yields substantially better results.

  • rs-trafilatura:
  • Python package:
  • spider:
  • Benchmark:
0 views
Back to Blog

Related posts

Read more »

How to Use rs-trafilatura with crawl4ai

crawl4ai is an async web crawler built for producing LLM‑friendly output. By default it converts pages to Markdown using its own scraping pipeline, but you can...