How to Use rs-trafilatura with spider-rs

Published: 1 month ago (April 3, 2026 at 10:21 AM EDT)

3 min read

Source: Dev.to

Source: Dev.to

Introduction

spider is a high‑performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs‑trafilatura slots in as the extraction layer, providing page‑type‑aware content extraction with quality scoring on every crawled page.

Adding the dependencies

Add both crates to your Cargo.toml:

[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }
spider = "2"
tokio = { version = "1", features = ["full"] }

The spider feature flag enables rs_trafilatura::spider_integration, which provides convenience functions that accept spider’s Page type directly.

Simple extraction (crawl then process)

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    website.crawl().await;

    for page in website.get_pages().into_iter().flatten() {
        match extract_page(&page) {
            Ok(result) => {
                println!(
                    "[{}] {} (confidence: {:.2})",
                    result.metadata.page_type.unwrap_or_default(),
                    result.metadata.title.unwrap_or_default(),
                    result.extraction_quality,
                );
                println!("  Content: {} chars", result.content_text.len());
            }
            Err(e) => eprintln!("  Extraction failed: {e}"),
        }
    }
}

extract_page takes a &Page and returns Result. The page URL is automatically passed to the classifier for page‑type detection.

Streaming extraction (process pages as they arrive)

use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;

#[tokio::main]
async fn main() {
    let mut website = Website::new("https://example.com");
    let mut rx = website.subscribe(0).unwrap();

    let handle = tokio::spawn(async move {
        let mut count = 0;
        while let Ok(page) = rx.recv().await {
            if let Ok(result) = extract_page(&page) {
                count += 1;
                println!(
                    "[{count}] {} → {} ({:.2})",
                    page.get_url(),
                    result.metadata.page_type.unwrap_or_default(),
                    result.extraction_quality,
                );
            }
        }
        println!("Extracted {count} pages");
    });

    website.crawl().await;
    website.unsubscribe();
    let _ = handle.await;
}

Each page is extracted in the spawned task as soon as spider fetches it. Extraction takes ~44 ms per page, so it easily keeps up with typical crawl rates.

Fine‑grained control with `extract_page_with_options`

use rs_trafilatura::{Options, spider_integration::extract_page_with_options};
use rs_trafilatura::page_type::PageType;

let options = Options {
    output_markdown: true,          // Get GFM Markdown output
    include_images: true,           // Extract image metadata
    favor_precision: true,         // Stricter filtering
    page_type: Some(PageType::Product), // Force page type
    ..Options::default()
};

let result = extract_page_with_options(&page, &options)?;

if let Some(md) = &result.content_markdown {
    println!("Markdown:\n{}", md);
}

for img in &result.images {
    println!("Image: {} (hero: {})", img.src, img.is_hero);
}

If you provide a URL in the options, it takes precedence over the page URL for classification; otherwise the page URL is used automatically.

Filtering by extraction quality

for page in website.get_pages().into_iter().flatten() {
    let url = page.get_url().to_string();
    let result = extract_page(&page)?;

    if result.extraction_quality {
        // Add your filtering logic here
    }
}

Result fields

Field	Type	Description
`content_markdown`	`Option<String>`	GFM Markdown (when enabled)
`content_html`	`Option<String>`	Extracted content as HTML
`metadata.title`	`Option<String>`	Page title
`metadata.author`	`Option<String>`	Author name
`metadata.date`	`Option<String>`	Publication date
`metadata.page_type`	`Option<PageType>`	Detected page type
`extraction_quality`	`f64`	0.0–1.0 confidence score
`images`	`Vec<Image>`	Image URLs, alt text, captions

Comparison with `spider_transformations`

spider ships with its own spider_transformations crate that can convert pages to Markdown or plain text. It works, but it is a basic readability‑style extractor lacking:

ML page‑type classification
Type‑specific extraction profiles (forum comment handling, multi‑section merge, JSON‑LD fallback)
Extraction quality scoring
Structured metadata extraction from JSON‑LD, Open Graph, and Dublin Core

rs‑trafilatura provides all of these features. For article‑heavy crawls, spider_transformations may be sufficient; for crawls that encounter diverse page types, rs‑trafilatura yields substantially better results.

How to Use rs-trafilatura with spider-rs

Introduction

Adding the dependencies

Simple extraction (crawl then process)

Streaming extraction (process pages as they arrive)

Fine‑grained control with `extract_page_with_options`

Filtering by extraction quality

Result fields

Comparison with `spider_transformations`

Links

Related posts

Building a ‘simple’ async service in Rust (and why it wasn’t simple)

docs.rs: building fewer targets by default

How to Use rs-trafilatura with crawl4ai

Building a Decentralized Mesh Network in Rust — Lessons from the Global South

Introduction

Adding the dependencies

Simple extraction (crawl then process)

Streaming extraction (process pages as they arrive)

Fine‑grained control with extract_page_with_options

Filtering by extraction quality

Result fields

Comparison with spider_transformations

Links

Related posts

Building a ‘simple’ async service in Rust (and why it wasn’t simple)

docs.rs: building fewer targets by default

How to Use rs-trafilatura with crawl4ai

Building a Decentralized Mesh Network in Rust — Lessons from the Global South

Fine‑grained control with `extract_page_with_options`

Comparison with `spider_transformations`