The Ultimate Ruby Scraping Stack: From Nokogiri to Ferrum

Published: (March 7, 2026 at 07:17 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Cover image for The Ultimate Ruby Scraping Stack: From Nokogiri to Ferrum

The Decision Tree

  1. Does the page return HTML directly? → Use Nokogiri.
  2. Is it a JavaScript Single Page App (SPA)? → Check the Network Tab for an API.
  3. Is the data hidden behind complex JS/User Interaction? → Use Ferrum.
  4. Are you scraping thousands of pages? → Use Kimurai.

Level 1: The Speed King (HTTP + Nokogiri)

If the data is in the source code (View Source), don’t overcomplicate it. Nokogiri is a C‑extension based parser that is incredibly fast.

The Stack: HTTP (gem) + Nokogiri

require 'http'
require 'nokogiri'

response = HTTP.get("https://news.ycombinator.com/")
doc = Nokogiri::HTML(response.body)

doc.css('.titleline > a').each do |link|
  puts "#{link.text}: #{link['href']}"
end

Why it wins: It uses almost no RAM and can process hundreds of pages per minute.

Level 2: The Modern Headless Choice (Ferrum)

If you must use a browser (to click buttons or wait for Vue/React to render), stop using Selenium. It’s slow and requires a clunky “WebDriver” middleman.

Use Ferrum, which talks directly to Chrome via the Chrome DevTools Protocol (CDP).

require "ferrum"

browser = Ferrum::Browser.new(headless: true)
browser.goto("https://example.com/dynamic-charts")

# Wait for a specific element to appear
browser.network.wait_for_idle
# Or: browser.at_css(".data-loaded")

puts browser.at_css(".price-display").text
browser.quit

Why it wins: Faster than Selenium, easier to install on Linux (just needs Chromium), and offers fine‑grained control over network and headers.

Level 3: High‑Volume Orchestration (Kimurai)

When building a full‑scale crawler that needs proxies, rotating User‑Agents, and multi‑threading, use a framework instead of reinventing the wheel.

Kimurai brings “Scrapy‑like” power to Ruby.

class MySpider < Kimurai::Base
  @name = "ecommerce_spider"
  @engine = :mechanize # or :ferrum
  @start_urls = ["https://store.com/products"]

  def parse(response, url:, data: {})
    response.css(".product-card").each do |product|
      # Process data here
    end
  end
end

MySpider.crawl!

Pro‑Tips for the Serious Scraper

Use “Search” instead of “CSS”

Nokogiri supports xpath, which is more powerful than CSS selectors. To find a button based on its text:

doc.xpath("//button[contains(text(), 'Submit')]")

Identity Management

Always set a realistic User-Agent. Servers may block the default Ruby/Faraday agents.

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."

Persistence

Stream data directly to CSV or JSONL instead of printing to the console, so you don’t lose progress if the script crashes.

require 'csv'

CSV.open("data.csv", "ab") do |csv|
  csv << [title, price, url]
end

The Ethics Check

  • Check robots.txt – respect the Crawl-delay.
  • Don’t DDOS – use sleep(rand(1..3)) to mimic human behavior.
  • Prefer an API – if a JSON API exists, use it; it’s better for everyone.

Summary

  • Static? Use Nokogiri.
  • Dynamic? Use Ferrum.
  • Massive? Use Kimurai.
  • Smart? Find the hidden API.

What’s the hardest site you’ve ever tried to scrape? Let’s solve it in the comments! 👇

0 views
Back to Blog

Related posts

Read more »