I Built 23 Free Web Scrapers on Apify — Here is What I Learned
Source: Dev.to
I Built 23 Free Web Scrapers on Apify — Here Is What I Learned
Building in public is one thing, but building scrapers in public is a whole different beast. Over the last few months I’ve developed and released 23 free web scrapers on the Apify platform. From Amazon and TikTok to Google Maps and LinkedIn, I’ve touched almost every corner of the web where data lives.
If you’re an indie dev, a data enthusiast, or someone looking to break into the world of web automation, this is my story of why I built them, the technical hurdles I faced, and what it’s actually like to maintain a fleet of scrapers in 2026.
Why I Started
Most developers start building scrapers for a specific project. I started because I saw a gap.
- While there are plenty of enterprise‑grade scraping solutions, many indie developers, students, and small researchers just need a quick, reliable way to get data without a $200/month subscription.
- I wanted to build a “Swiss Army Knife” of data‑extraction tools. By releasing them for free on the Apify Store, I wasn’t just building tools; I was building a portfolio and a reputation. In the world of Scraper‑as‑a‑Service, your best marketing is a tool that actually works.
Goals
- Master the art of scraping – you don’t really know how a site works until you try to automate it.
- Help the community – data shouldn’t be gated by technical complexity.
- Explore the ecosystem – Apify handles the infrastructure, so I could focus 100 % on the logic.
The Most Popular Scrapers
Out of the 23 actors I’ve built, five have consistently dominated the charts in terms of usage. Below is a quick rundown of each, the problem they solve, and the technical challenges that made them interesting.
1. Amazon Product Scraper (the “OG”)
Use case: Price monitoring, competitor analysis, market research.
What it extracts:
- ASIN
- Title
- Price
- Ratings & reviews
- BSR (Best‑Seller Rank)
The Challenge
Amazon is a master of A/B testing. On any given day you might see three different versions of a product page. Some have the price in a “ with a specific class; others hide it inside a “Buy Box” iframe.
The Lesson
Instead of brittle CSS selectors, I learned to target the JSON blobs hidden in the page. Look for the script that registers the product state:
window.P.register('twister-js-init-dpx-data', {...})
Parsing that JSON is far more stable than hunting for the right “.
2. Google Maps Business Scraper
Use case: Pull business names, addresses, phone numbers, ratings, and contact info.
The Innovation
Most users wanted more than just the Google Maps data—they wanted to contact the businesses. I added an “Include Website” option. When enabled, the scraper follows the business’s website link and attempts to find:
- Email addresses
- Social‑media profiles
Technical Hurdle
Scraping 1 000 different websites is harder than scraping one big site like Google. Every site has its own anti‑bot measures. I implemented a recursive crawler that:
- Searches for “Contact Us” and “About” pages
- Strictly limits depth to avoid “spider traps”
3. TikTok Profile Scraper
Use case: Collect profile data and the first 30 videos for analytics, trend spotting, or influencer outreach.
The Breakthrough
TikTok’s internal structure changes weekly, and a browser‑based scraper constantly hits “Verify you are human” sliders.
I discovered the __UNIVERSAL_DATA_FOR_REHYDRATION__ script tag. When a TikTok profile loads, the server sends a massive JSON object containing the profile data and the first 30 videos.
...
Result: Parsing this JSON instead of the DOM made the scraper 10× more stable and significantly faster—turning a “Browser” problem into a “JSON” problem.
4. LinkedIn Jobs Scraper (the “final boss”)
Use case: Pull 100+ job postings in minutes without triggering a login wall.
The Strategy
While others struggled with complex browser automation, I focused on a human‑mimicry implementation using Playwright.
- Fingerprinting: Used Crawlee’s built‑in fingerprint rotation (headers, screen resolutions, WebGL signatures).
- Scrolling: Simulated variable‑speed scrolls, pausing occasionally as if a human is reading the job description.
Result: A reliable scraper that can harvest large job feeds without being blocked.
5. Shopify Store Scraper
Use case: Dropshippers and e‑commerce researchers who need the full product catalog and store theme details.
The Trick
Most Shopify stores expose a /products.json endpoint. It’s often hidden or paginated, but it provides perfectly structured data.
Workflow:
- Detect if the site runs on Shopify.
- Hit the
/products.jsonendpoint directly. - Skip heavy page rendering, saving minutes per store.
My Stack After 23 Iterations
If you aren’t using Crawlee, you’re playing on hard mode. It’s the engine behind all my scrapers and handles the boring stuff—request retries, proxy rotation, and session management—so I can focus on the parsing logic.
| Component | When to Use | Why |
|---|---|---|
| CheerioCrawler | Static or hydrated pages | Light, fast, uses ~1/10th of the RAM |
| PlaywrightCrawler | Dynamic pages that require JS execution | Handles complex interactions, heavy lifting |
| Crawlee (core) | All scrapers | Unified API for retries, proxies, sessions, and scaling |
The Eternal Debate: Cheerio vs. Playwright
- Cheerio (Static/Hydrated) – Use whenever possible. Most modern sites “hydrate” their data into a JSON object inside a “ tag. Find that tag and you don’t need a browser.
- Playwright (Dynamic) – Use only when the page literally won’t show data until a button is clicked or a script runs. It’s your friend for truly dynamic content.
Anti‑Bot Countermeasures in 2026
Simple IP rotation isn’t enough anymore. Sites like Cloudflare and Akamai inspect the TLS fingerprint—the way your computer “shakes hands” with the server.
| Technique | Description | Typical Targets |
|---|---|---|
| Residential Proxies | Appear as traffic from home Wi‑Fi networks | LinkedIn, Amazon |
| Header Order | Browsers send headers in a very specific order; mismatched order can raise suspicion | Almost any high‑security site |
| Browser Fingerprinting | Rotate screen resolution, WebGL signatures, user‑agent strings, etc. | All major platforms |
| Rate Limiting & Random Delays | Mimic human pacing to avoid detection | TikTok, Google Maps |
Final Thoughts
After building and maintaining 23 free scrapers, my stack has become very opinionated, but it works:
- Crawlee for the heavy lifting (retries, proxies, sessions)
- Cheerio for speed on static/hydrated pages
- Playwright for the occasional dynamic nightmare
If you’re starting out, focus on stable data sources (JSON blobs, hidden APIs) before resorting to full browser automation. And always respect the target site’s robots.txt and terms of service—scraping responsibly builds a healthier ecosystem for everyone.
Happy scraping! 🚀
Canvas Fingerprinting
Browsers render graphics differently based on your OS and GPU. Tools like Crawlee help spoof these so every request looks like it’s coming from a unique, “real” machine.
Building the Scraper
The scraper itself is the easy part. Maintenance is where the real work happens.
The “Tuesday” Problem
Big‑tech companies often push updates on Tuesdays. I’ve woken up many Wednesday mornings to find five scrapers broken because a single CSS class changed from price-value to p-val.
The Solution
- Build for failure – wrap parsers in
try‑catchblocks and use detailed logging. - Use a Sentinel pattern: the scraper regularly checks if it’s still finding the “core” fields (e.g., Price or Title). If the “missing field” rate exceeds 20 %, trigger an alert.
Why I Release These for Free
| Benefit | Explanation |
|---|---|
| Lead Magnet | A “free” scraper acts as a business card. Dozens of companies have reached out for custom integrations or private versions after seeing the code quality. |
| Apify Platform Credits | Users still pay for the compute and proxies they consume, which fuels the ecosystem and brings more paying users. |
| Portfolio Effect | When I apply for contracts I can say, “I maintain 23 scrapers with 10 000+ monthly runs.” That proof of scale is invaluable. |
Scraping in 2026 → AI
I now use LLMs (e.g., GPT‑4o) to help with “fallback” parsing.
My Stack
- Language: TypeScript (type safety is non‑negotiable for complex parsers)
- Framework: Crawlee
Libraries
cheerio– lightning‑fast HTML parsingplaywright– heavy‑duty browser automationgot-scraping– HTTP requests that mimic real browsers
Platform
Apify – hosting, scheduling, and proxy rotation.
Reflections
Building 23 scrapers taught me more about web architecture than years of standard web development. It’s a constant cat‑and‑mouse game, but there’s something incredibly satisfying about turning the messy, unstructured web into a clean CSV file.
“The web is the world’s largest database, but it’s a database with a terrible API. Web scraping is the bridge that fixes that.”
See the Scrapers in Action
You can find the whole collection here:
👉 [My Apify Store Profile](https://apify.com/store)
Whether you need to monitor Amazon prices, find leads on Google Maps, or track TikTok trends, these tools are ready for you.
What’s Next?
Probably 23 more. The demand for data isn’t slowing down, and as long as there are websites, there will be a need for people who know how to (respectfully) scrape them.
I’m an indie developer focusing on web automation and data extraction. If you found this useful, follow me for more technical deep dives into the world of automation!