Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite

Published: 1 month ago (December 13, 2025 at 07:30 PM EST)

4 min read

Source: Dev.to

TL;DR

I built a complete Reddit scraper suite that requires zero API keys. It includes a Streamlit dashboard, a REST API for integration with tools like Grafana and Metabase, a plugin system for post‑processing, scheduled scraping, notifications, and more. Best of all—it’s completely open source.

🔗 GitHub:

The Problem

If you’ve ever tried to scrape Reddit data for analysis, research, or personal projects, you’ve likely encountered:

Reddit’s API is heavily rate‑limited (especially after the 2023 API changes)
API keys require approval and are increasingly restricted
Existing scrapers are often single‑purpose – scrape posts or comments, not both
No easy way to visualize or analyze the data after scraping
Running scrapes manually is tedious; automation is needed

The Solution: Universal Reddit Scraper Suite

After weeks of development, I created a full‑featured scraper that offers:

Feature	What It Does
📊 Full Scraping	Posts, comments, images, videos, galleries—everything
🚫 No API Keys	Uses Reddit’s public JSON endpoints and mirrors
📈 Web Dashboard	Beautiful 7‑tab Streamlit UI for analysis
🚀 REST API	Connect Metabase, Grafana, DuckDB, and more
🔌 Plugin System	Extensible post‑processing (sentiment analysis, deduplication, keywords)
📅 Scheduled Scraping	Cron‑style automation
📧 Notifications	Discord & Telegram alerts when scrapes complete
🐳 Docker Ready	One command to deploy anywhere

Architecture Deep Dive

How It Works Without API Keys

Instead of Reddit’s official (and restricted) API, the scraper leverages:

Public JSON endpoints – Every Reddit page has a .json suffix that returns structured data.
Multiple mirror fallbacks – When one source is rate‑limited, the scraper automatically rotates through alternatives.

MIRRORS = [
    "https://old.reddit.com/",
    "https://redlib.catsarch.com/",
    "https://redlib.vsls.cz/",
    "https://r.nf/",
    "https://libreddit.northboot.xyz/",
    "https://redlib.tux.pizza/",
]

If a source fails, the next mirror is tried automatically—no manual intervention required.

The Core Scraping Engine

The scraper operates in three modes:

Full Mode – Complete package

python main.py --mode full --limit 100

Scrapes posts, downloads all media (images, videos, galleries), and fetches comments with full thread hierarchy.

History Mode – Fast metadata‑only

python main.py --mode history --limit 500

Builds a dataset of post metadata without downloading media.

Monitor Mode – Live watching

python main.py --mode monitor

Continuously checks for new posts every 5 minutes, ideal for tracking breaking news or trending discussions.

The Dashboard Experience

The 7‑tab Streamlit dashboard makes data exploration intuitive.

Overview Tab

Total posts and comments
Cumulative score across all posts
Media post breakdown
Posts‑over‑time chart
Top 10 posts by score

Analytics Tab

Sentiment Analysis – VADER‑based scoring for the entire dataset
Keyword Cloud – Most frequently used terms
Best Posting Times – Data‑driven insights on optimal engagement windows

Search Tab

Full‑text search with filters for:

Minimum score
Post type (text, image, video, gallery, link)
Author
Custom sorting

Comments Analysis

View top‑scoring comments
Identify most active commenters
Track comment patterns over time

Scraper Controls

Start new scrapes directly from the dashboard. Configurable options include:

Target subreddit/user
Post limits
Mode (full/history)
Media and comment toggles

Job History

Observability into every scrape job:

Status (running, completed, failed)
Duration metrics
Post/comment/media counts
Error logging

Integrations

Pre‑configured instructions for connecting to:

Metabase
Grafana
DreamFactory
DuckDB

The Plugin Architecture

A simple yet powerful system for extensible post‑processing.

class Plugin:
    """Base class for all plugins."""
    name = "base"
    description = "Base plugin"
    enabled = True

    def process_posts(self, posts):
        return posts

    def process_comments(self, comments):
        return comments

Built‑in Plugins

Sentiment Tagger

Adds VADER sentiment scores and labels to posts and comments.

class SentimentTagger(Plugin):
    name = "sentiment_tagger"
    description = "Adds sentiment scores and labels to posts"

    def process_posts(self, posts):
        for post in posts:
            text = f"{post.get('title', '')} {post.get('selftext', '')}"
            score, label = analyze_sentiment(text)
            post['sentiment_score'] = score
            post['sentiment_label'] = label
        return posts

Deduplicator

Removes duplicate posts that may appear across multiple scraping sessions.

Keyword Extractor

Pulls out the most significant terms from scraped content for trend analysis.

Creating Your Own Plugin

Add a new Python file to the plugins/ directory:

from plugins import Plugin

class MyCustomPlugin(Plugin):
    name = "my_plugin"
    description = "Does something cool"
    enabled = True

    def process_posts(self, posts):
        # Your logic here
        return posts

Enable plugins during scraping:

python main.py --mode full --plugins sentiment_tagger,deduplicator

REST API for External Integrations

Run the API server:

python main.py --api

API base URL:
Documentation:

Key Endpoints

Endpoint	Description
`GET /posts`	List posts with filters (subreddit, limit, offset)
`GET /comments`	List comments
`GET /subreddits`	All scraped subreddits
`GET /jobs`	Job history
`GET /query?sql=...`	Raw SQL queries for power users
`GET /grafana/query`	Grafana‑compatible time‑series data

Real‑World Integration: Grafana Dashboard

Install the JSON API or Infinity plugin in Grafana.
Add a datasource pointing to .
Use the /grafana/query endpoint for time‑series panels, e.g.:

SELECT date(created_utc) AS time, COUNT(*) AS posts
FROM posts
GROUP BY date(created_utc);

You now have a real‑time dashboard tracking Reddit activity.

Scheduled Scraping & Notifications

Automation Made Easy

Set up recurring scrapes with cron‑style scheduling:

# Scrape every 60 minutes
python main.py --schedule delhi --every 60

Custom options:

python main.py --schedule delhi --every 30 --mode full --limit 50

Get Notified

Configure Discord or Telegram alerts when scrapes complete.

export DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..."
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_CHAT_ID="987654321"

The scraper will send summary notifications to the chosen platform.

Dry Run Mode: Test Before You Commit

Simulate a scrape without persisting any data:

python main.py --mode full --limit 50 --dry-run

Sample output:

🧪 DRY RUN MODE - No data will be saved
🧪 DRY RUN COMPLETE!
📊 Would scrape: 100 posts
💬 Would fetch: 250 comments