Building the Ultimate Reddit Scraper: A Full-Featured, API-Free Data Collection Suite
Source: Dev.to
TL;DR
I built a complete Reddit scraper suite that requires zero API keys. It includes a Streamlit dashboard, a REST API for integration with tools like Grafana and Metabase, a plugin system for post‑processing, scheduled scraping, notifications, and more. Best of all—it’s completely open source.
🔗 GitHub:
The Problem
If you’ve ever tried to scrape Reddit data for analysis, research, or personal projects, you’ve likely encountered:
- Reddit’s API is heavily rate‑limited (especially after the 2023 API changes)
- API keys require approval and are increasingly restricted
- Existing scrapers are often single‑purpose – scrape posts or comments, not both
- No easy way to visualize or analyze the data after scraping
- Running scrapes manually is tedious; automation is needed
The Solution: Universal Reddit Scraper Suite
After weeks of development, I created a full‑featured scraper that offers:
| Feature | What It Does |
|---|---|
| 📊 Full Scraping | Posts, comments, images, videos, galleries—everything |
| 🚫 No API Keys | Uses Reddit’s public JSON endpoints and mirrors |
| 📈 Web Dashboard | Beautiful 7‑tab Streamlit UI for analysis |
| 🚀 REST API | Connect Metabase, Grafana, DuckDB, and more |
| 🔌 Plugin System | Extensible post‑processing (sentiment analysis, deduplication, keywords) |
| 📅 Scheduled Scraping | Cron‑style automation |
| 📧 Notifications | Discord & Telegram alerts when scrapes complete |
| 🐳 Docker Ready | One command to deploy anywhere |
Architecture Deep Dive
How It Works Without API Keys
Instead of Reddit’s official (and restricted) API, the scraper leverages:
- Public JSON endpoints – Every Reddit page has a
.jsonsuffix that returns structured data. - Multiple mirror fallbacks – When one source is rate‑limited, the scraper automatically rotates through alternatives.
MIRRORS = [
"https://old.reddit.com/",
"https://redlib.catsarch.com/",
"https://redlib.vsls.cz/",
"https://r.nf/",
"https://libreddit.northboot.xyz/",
"https://redlib.tux.pizza/",
]
If a source fails, the next mirror is tried automatically—no manual intervention required.
The Core Scraping Engine
The scraper operates in three modes:
Full Mode – Complete package
python main.py --mode full --limit 100
Scrapes posts, downloads all media (images, videos, galleries), and fetches comments with full thread hierarchy.
History Mode – Fast metadata‑only
python main.py --mode history --limit 500
Builds a dataset of post metadata without downloading media.
Monitor Mode – Live watching
python main.py --mode monitor
Continuously checks for new posts every 5 minutes, ideal for tracking breaking news or trending discussions.
The Dashboard Experience
The 7‑tab Streamlit dashboard makes data exploration intuitive.
Overview Tab
- Total posts and comments
- Cumulative score across all posts
- Media post breakdown
- Posts‑over‑time chart
- Top 10 posts by score
Analytics Tab
- Sentiment Analysis – VADER‑based scoring for the entire dataset
- Keyword Cloud – Most frequently used terms
- Best Posting Times – Data‑driven insights on optimal engagement windows
Search Tab
Full‑text search with filters for:
- Minimum score
- Post type (text, image, video, gallery, link)
- Author
- Custom sorting
Comments Analysis
- View top‑scoring comments
- Identify most active commenters
- Track comment patterns over time
Scraper Controls
Start new scrapes directly from the dashboard. Configurable options include:
- Target subreddit/user
- Post limits
- Mode (full/history)
- Media and comment toggles
Job History
Observability into every scrape job:
- Status (running, completed, failed)
- Duration metrics
- Post/comment/media counts
- Error logging
Integrations
Pre‑configured instructions for connecting to:
- Metabase
- Grafana
- DreamFactory
- DuckDB
The Plugin Architecture
A simple yet powerful system for extensible post‑processing.
class Plugin:
"""Base class for all plugins."""
name = "base"
description = "Base plugin"
enabled = True
def process_posts(self, posts):
return posts
def process_comments(self, comments):
return comments
Built‑in Plugins
Sentiment Tagger
Adds VADER sentiment scores and labels to posts and comments.
class SentimentTagger(Plugin):
name = "sentiment_tagger"
description = "Adds sentiment scores and labels to posts"
def process_posts(self, posts):
for post in posts:
text = f"{post.get('title', '')} {post.get('selftext', '')}"
score, label = analyze_sentiment(text)
post['sentiment_score'] = score
post['sentiment_label'] = label
return posts
Deduplicator
Removes duplicate posts that may appear across multiple scraping sessions.
Keyword Extractor
Pulls out the most significant terms from scraped content for trend analysis.
Creating Your Own Plugin
Add a new Python file to the plugins/ directory:
from plugins import Plugin
class MyCustomPlugin(Plugin):
name = "my_plugin"
description = "Does something cool"
enabled = True
def process_posts(self, posts):
# Your logic here
return posts
Enable plugins during scraping:
python main.py --mode full --plugins sentiment_tagger,deduplicator
REST API for External Integrations
Run the API server:
python main.py --api
- API base URL:
- Documentation:
Key Endpoints
| Endpoint | Description |
|---|---|
GET /posts | List posts with filters (subreddit, limit, offset) |
GET /comments | List comments |
GET /subreddits | All scraped subreddits |
GET /jobs | Job history |
GET /query?sql=... | Raw SQL queries for power users |
GET /grafana/query | Grafana‑compatible time‑series data |
Real‑World Integration: Grafana Dashboard
- Install the JSON API or Infinity plugin in Grafana.
- Add a datasource pointing to .
- Use the
/grafana/queryendpoint for time‑series panels, e.g.:
SELECT date(created_utc) AS time, COUNT(*) AS posts
FROM posts
GROUP BY date(created_utc);
You now have a real‑time dashboard tracking Reddit activity.
Scheduled Scraping & Notifications
Automation Made Easy
Set up recurring scrapes with cron‑style scheduling:
# Scrape every 60 minutes
python main.py --schedule delhi --every 60
Custom options:
python main.py --schedule delhi --every 30 --mode full --limit 50
Get Notified
Configure Discord or Telegram alerts when scrapes complete.
export DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..."
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_CHAT_ID="987654321"
The scraper will send summary notifications to the chosen platform.
Dry Run Mode: Test Before You Commit
Simulate a scrape without persisting any data:
python main.py --mode full --limit 50 --dry-run
Sample output:
🧪 DRY RUN MODE - No data will be saved
🧪 DRY RUN COMPLETE!
📊 Would scrape: 100 posts
💬 Would fetch: 250 comments