How to Use rs-trafilatura with spider-rs
Source: Dev.to
Introduction
spider is a high‑performance async web crawler written in Rust. It discovers, fetches, and queues URLs — but content extraction is left to you. rs‑trafilatura slots in as the extraction layer, providing page‑type‑aware content extraction with quality scoring on every crawled page.
Adding the dependencies
Add both crates to your Cargo.toml:
[dependencies]
rs-trafilatura = { version = "0.2", features = ["spider"] }
spider = "2"
tokio = { version = "1", features = ["full"] }The spider feature flag enables rs_trafilatura::spider_integration, which provides convenience functions that accept spider’s Page type directly.
Simple extraction (crawl then process)
use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
website.crawl().await;
for page in website.get_pages().into_iter().flatten() {
match extract_page(&page) {
Ok(result) => {
println!(
"[{}] {} (confidence: {:.2})",
result.metadata.page_type.unwrap_or_default(),
result.metadata.title.unwrap_or_default(),
result.extraction_quality,
);
println!(" Content: {} chars", result.content_text.len());
}
Err(e) => eprintln!(" Extraction failed: {e}"),
}
}
}extract_page takes a &Page and returns Result. The page URL is automatically passed to the classifier for page‑type detection.
Streaming extraction (process pages as they arrive)
use spider::website::Website;
use rs_trafilatura::spider_integration::extract_page;
#[tokio::main]
async fn main() {
let mut website = Website::new("https://example.com");
let mut rx = website.subscribe(0).unwrap();
let handle = tokio::spawn(async move {
let mut count = 0;
while let Ok(page) = rx.recv().await {
if let Ok(result) = extract_page(&page) {
count += 1;
println!(
"[{count}] {} → {} ({:.2})",
page.get_url(),
result.metadata.page_type.unwrap_or_default(),
result.extraction_quality,
);
}
}
println!("Extracted {count} pages");
});
website.crawl().await;
website.unsubscribe();
let _ = handle.await;
}Each page is extracted in the spawned task as soon as spider fetches it. Extraction takes ~44 ms per page, so it easily keeps up with typical crawl rates.
Fine‑grained control with extract_page_with_options
use rs_trafilatura::{Options, spider_integration::extract_page_with_options};
use rs_trafilatura::page_type::PageType;
let options = Options {
output_markdown: true, // Get GFM Markdown output
include_images: true, // Extract image metadata
favor_precision: true, // Stricter filtering
page_type: Some(PageType::Product), // Force page type
..Options::default()
};
let result = extract_page_with_options(&page, &options)?;
if let Some(md) = &result.content_markdown {
println!("Markdown:\n{}", md);
}
for img in &result.images {
println!("Image: {} (hero: {})", img.src, img.is_hero);
}If you provide a URL in the options, it takes precedence over the page URL for classification; otherwise the page URL is used automatically.
Filtering by extraction quality
for page in website.get_pages().into_iter().flatten() {
let url = page.get_url().to_string();
let result = extract_page(&page)?;
if result.extraction_quality {
// Add your filtering logic here
}
}Result fields
| Field | Type | Description |
|---|---|---|
content_markdown | Option<String> | GFM Markdown (when enabled) |
content_html | Option<String> | Extracted content as HTML |
metadata.title | Option<String> | Page title |
metadata.author | Option<String> | Author name |
metadata.date | Option<String> | Publication date |
metadata.page_type | Option<PageType> | Detected page type |
extraction_quality | f64 | 0.0–1.0 confidence score |
images | Vec<Image> | Image URLs, alt text, captions |
Comparison with spider_transformations
spider ships with its own spider_transformations crate that can convert pages to Markdown or plain text. It works, but it is a basic readability‑style extractor lacking:
- ML page‑type classification
- Type‑specific extraction profiles (forum comment handling, multi‑section merge, JSON‑LD fallback)
- Extraction quality scoring
- Structured metadata extraction from JSON‑LD, Open Graph, and Dublin Core
rs‑trafilatura provides all of these features. For article‑heavy crawls, spider_transformations may be sufficient; for crawls that encounter diverse page types, rs‑trafilatura yields substantially better results.
Links
rs-trafilatura:- Python package:
spider:- Benchmark: