Why I ditched cloud scrapers and built a local-first Reddit tool

Published: 1 month ago (December 18, 2025 at 07:05 PM EST)

3 min read

Source: Dev.to

The cloud scraper trap

I tried a bunch of cloud‑based tools, monitoring services, browser extensions that phone home to some server, and even some Python scripts running on my VPS.

They all had the same problem: Reddit blocks server IPs—aggressively. My VPS got blocked within 5 minutes of running a simple scraper. Rotating proxies and residential IPs didn’t help; Reddit kept catching on. Every few weeks I’d get emails from my monitoring tool saying “we’re experiencing issues with Reddit.”

The obvious solution

A friend said offhand, “Why don’t you just run it on your computer?”

I had objections: distribution is harder, recurring billing and usage tracking are more complex. But if the app runs from my laptop, Reddit sees my home IP—just a normal person browsing. No detection to evade. It just works.

What I built

Python + PyQt6 desktop app with SQLite for storage.
Reddit Toolbox

The core is embarrassingly simple:

import requests
import sqlite3

def scrape_subreddit(name, limit=100):
    url = f"https://reddit.com/r/{name}.json?limit={limit}"

    # That's it. Just a GET request from user's IP.
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
    })

    if response.status_code == 200:
        return response.json()['data']['children']
    else:
        # Fallback to RSS if JSON is blocked
        return scrape_via_rss(name)

The RSS fallback is key. Sometimes Reddit blocks JSON for certain patterns but leaves RSS open. Having both means it rarely fails completely.

The features that actually matter

1. Batch scraping with filters

Paste 5 subreddit names, scrape 200 posts each in ~10 seconds. Filter by:

Max comment count (e.g., ≤ 8)
Min score (filter out down‑voted content)

2. Right‑click AI replies

Generate a starting draft for a reply (always heavily rewritten).

3. User analysis

Before DMing someone, check their history: account age, karma, active subreddits—a quick sanity check.

The monetization question

With web apps you control logins, feature gates, and server‑side limits. With a desktop app the user has the binary and can do whatever.

I considered DRM, license keys, hardware fingerprinting, but realized the kind of person who would crack a $15/mo tool was never going to pay anyway.

So I kept it simple: the app checks subscription status once per session via an API call to Supabase. If it fails, it falls back to a free tier (15 scrapes/day). Could someone bypass this? Sure, but the people who need it for real work are happy to pay.

Trade‑offs I accepted

No cross‑device sync – data lives on one machine.
Manual updates – working on an auto‑updater but not there yet.
Zero telemetry – no insight into how people actually use it, which feels nice.

Results

Zero support tickets about blocking (none compared to daily issues with cloud tools).
App size ~50 MB (vs. 150 MB+ for an Electron equivalent).
Users appreciate not needing a login to try it: “Finally a tool that doesn’t want my email first.” That email made my week.

When local‑first makes sense

Local‑first isn’t a fit for everything. You still need a server for:

Real‑time collaboration
Multi‑device sync
Social features

But for single‑user tools that talk to APIs actively fighting scrapers, a local‑first approach is worth considering.

The tool is called Reddit Toolbox. A free tier is available if you want to try it.

Happy to answer questions about PyQt, the architecture, or why I now have strong opinions about User‑Agent strings.