The Ultimate Ruby Scraping Stack: From Nokogiri to Ferrum

Published: 3 days ago (March 7, 2026 at 07:17 PM EST)

3 min read

Source: Dev.to

Cover image for The Ultimate Ruby Scraping Stack: From Nokogiri to Ferrum

The Decision Tree

Does the page return HTML directly? → Use Nokogiri.
Is it a JavaScript Single Page App (SPA)? → Check the Network Tab for an API.
Is the data hidden behind complex JS/User Interaction? → Use Ferrum.
Are you scraping thousands of pages? → Use Kimurai.

Level 1: The Speed King (HTTP + Nokogiri)

If the data is in the source code (View Source), don’t overcomplicate it. Nokogiri is a C‑extension based parser that is incredibly fast.

The Stack: HTTP (gem) + Nokogiri

require 'http'
require 'nokogiri'

response = HTTP.get("https://news.ycombinator.com/")
doc = Nokogiri::HTML(response.body)

doc.css('.titleline > a').each do |link|
  puts "#{link.text}: #{link['href']}"
end

Why it wins: It uses almost no RAM and can process hundreds of pages per minute.

Level 2: The Modern Headless Choice (Ferrum)

If you must use a browser (to click buttons or wait for Vue/React to render), stop using Selenium. It’s slow and requires a clunky “WebDriver” middleman.

Use Ferrum, which talks directly to Chrome via the Chrome DevTools Protocol (CDP).

require "ferrum"

browser = Ferrum::Browser.new(headless: true)
browser.goto("https://example.com/dynamic-charts")

# Wait for a specific element to appear
browser.network.wait_for_idle
# Or: browser.at_css(".data-loaded")

puts browser.at_css(".price-display").text
browser.quit

Why it wins: Faster than Selenium, easier to install on Linux (just needs Chromium), and offers fine‑grained control over network and headers.

Level 3: High‑Volume Orchestration (Kimurai)

When building a full‑scale crawler that needs proxies, rotating User‑Agents, and multi‑threading, use a framework instead of reinventing the wheel.

Kimurai brings “Scrapy‑like” power to Ruby.

class MySpider < Kimurai::Base
  @name = "ecommerce_spider"
  @engine = :mechanize # or :ferrum
  @start_urls = ["https://store.com/products"]

  def parse(response, url:, data: {})
    response.css(".product-card").each do |product|
      # Process data here
    end
  end
end

MySpider.crawl!

Pro‑Tips for the Serious Scraper

Use “Search” instead of “CSS”

Nokogiri supports xpath, which is more powerful than CSS selectors. To find a button based on its text:

doc.xpath("//button[contains(text(), 'Submit')]")

Identity Management

Always set a realistic User-Agent. Servers may block the default Ruby/Faraday agents.

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."

Persistence

Stream data directly to CSV or JSONL instead of printing to the console, so you don’t lose progress if the script crashes.

require 'csv'

CSV.open("data.csv", "ab") do |csv|
  csv << [title, price, url]
end

The Ethics Check

Check robots.txt – respect the Crawl-delay.
Don’t DDOS – use sleep(rand(1..3)) to mimic human behavior.
Prefer an API – if a JSON API exists, use it; it’s better for everyone.

Summary

Static? Use Nokogiri.
Dynamic? Use Ferrum.
Massive? Use Kimurai.
Smart? Find the hidden API.

What’s the hardest site you’ve ever tried to scrape? Let’s solve it in the comments! 👇

The Ultimate Ruby Scraping Stack: From Nokogiri to Ferrum

The Decision Tree

Level 1: The Speed King (HTTP + Nokogiri)

Level 2: The Modern Headless Choice (Ferrum)

Level 3: High‑Volume Orchestration (Kimurai)

Pro‑Tips for the Serious Scraper

Use “Search” instead of “CSS”

Identity Management

Persistence

The Ethics Check

Summary

Related posts

Kafka FinOps: How to Do Chargeback Reporting

How I Built a Secure Reverse Proxy with Nginx

Claude CodeでゼロダウンタイムDBマイグレーションを設計する：Expand-Contract・後方互換

Prevent Token Cost Spikes in LLM Apps with Token Budget Guard