The Ultimate Ruby Scraping Stack: From Nokogiri to Ferrum
Source: Dev.to

The Decision Tree
- Does the page return HTML directly? → Use Nokogiri.
- Is it a JavaScript Single Page App (SPA)? → Check the Network Tab for an API.
- Is the data hidden behind complex JS/User Interaction? → Use Ferrum.
- Are you scraping thousands of pages? → Use Kimurai.
Level 1: The Speed King (HTTP + Nokogiri)
If the data is in the source code (View Source), don’t overcomplicate it. Nokogiri is a C‑extension based parser that is incredibly fast.
The Stack: HTTP (gem) + Nokogiri
require 'http'
require 'nokogiri'
response = HTTP.get("https://news.ycombinator.com/")
doc = Nokogiri::HTML(response.body)
doc.css('.titleline > a').each do |link|
puts "#{link.text}: #{link['href']}"
end
Why it wins: It uses almost no RAM and can process hundreds of pages per minute.
Level 2: The Modern Headless Choice (Ferrum)
If you must use a browser (to click buttons or wait for Vue/React to render), stop using Selenium. It’s slow and requires a clunky “WebDriver” middleman.
Use Ferrum, which talks directly to Chrome via the Chrome DevTools Protocol (CDP).
require "ferrum"
browser = Ferrum::Browser.new(headless: true)
browser.goto("https://example.com/dynamic-charts")
# Wait for a specific element to appear
browser.network.wait_for_idle
# Or: browser.at_css(".data-loaded")
puts browser.at_css(".price-display").text
browser.quit
Why it wins: Faster than Selenium, easier to install on Linux (just needs Chromium), and offers fine‑grained control over network and headers.
Level 3: High‑Volume Orchestration (Kimurai)
When building a full‑scale crawler that needs proxies, rotating User‑Agents, and multi‑threading, use a framework instead of reinventing the wheel.
Kimurai brings “Scrapy‑like” power to Ruby.
class MySpider < Kimurai::Base
@name = "ecommerce_spider"
@engine = :mechanize # or :ferrum
@start_urls = ["https://store.com/products"]
def parse(response, url:, data: {})
response.css(".product-card").each do |product|
# Process data here
end
end
end
MySpider.crawl!
Pro‑Tips for the Serious Scraper
Use “Search” instead of “CSS”
Nokogiri supports xpath, which is more powerful than CSS selectors. To find a button based on its text:
doc.xpath("//button[contains(text(), 'Submit')]")
Identity Management
Always set a realistic User-Agent. Servers may block the default Ruby/Faraday agents.
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36..."
Persistence
Stream data directly to CSV or JSONL instead of printing to the console, so you don’t lose progress if the script crashes.
require 'csv'
CSV.open("data.csv", "ab") do |csv|
csv << [title, price, url]
end
The Ethics Check
- Check
robots.txt– respect theCrawl-delay. - Don’t DDOS – use
sleep(rand(1..3))to mimic human behavior. - Prefer an API – if a JSON API exists, use it; it’s better for everyone.
Summary
- Static? Use Nokogiri.
- Dynamic? Use Ferrum.
- Massive? Use Kimurai.
- Smart? Find the hidden API.
What’s the hardest site you’ve ever tried to scrape? Let’s solve it in the comments! 👇