构建终极 Reddit 爬虫：全功能、无 API 的数据收集套件

发布: 1个月前 (2025年12月14日 GMT+8 08:30)

7 分钟阅读

Source: Dev.to

TL;DR

我构建了一个完整的 Reddit 爬虫套件，零 API 密钥需求。它包括一个 Streamlit 仪表盘、用于与 Grafana、Metabase 等工具集成的 REST API、可扩展的插件系统、定时爬取、通知等功能。最棒的是——它完全开源。

🔗 GitHub:

The Problem

如果你曾经尝试过爬取 Reddit 数据用于分析、研究或个人项目，你可能会遇到：

Reddit 的 API 限流严重（尤其是 2023 年 API 改动后）
API 密钥需要审批，且限制日益严格
现有爬虫往往单一功能——只爬帖子或评论，不能两者兼顾
爬取完数据后缺少便捷的可视化或分析手段
手动运行爬取非常繁琐，亟需自动化

The Solution: Universal Reddit Scraper Suite

经过数周开发，我打造了一个功能齐全的爬虫，提供：

Feature	What It Does
📊 Full Scraping	帖子、评论、图片、视频、相册——全部内容
🚫 No API Keys	使用 Reddit 的公开 JSON 接口和镜像站
📈 Web Dashboard	美观的 7‑标签 Streamlit UI，便于分析
🚀 REST API	可连接 Metabase、Grafana、DuckDB 等
🔌 Plugin System	可扩展的后处理（情感分析、去重、关键词提取）
📅 Scheduled Scraping	类 Cron 的自动化调度
📧 Notifications	爬取完成后通过 Discord 与 Telegram 发送提醒
🐳 Docker Ready	一条命令即可在任意环境部署

Architecture Deep Dive

How It Works Without API Keys

爬虫不使用 Reddit 官方（且受限）的 API，而是利用：

Public JSON endpoints – 每个 Reddit 页面都有 .json 后缀，可返回结构化数据。
Multiple mirror fallbacks – 当某个源被限流时，爬虫会自动切换到其他镜像。

MIRRORS = [
    "https://old.reddit.com/",
    "https://redlib.catsarch.com/",
    "https://redlib.vsls.cz/",
    "https://r.nf/",
    "https://libreddit.northboot.xyz/",
    "https://redlib.tux.pizza/",
]

如果某个源失效，系统会自动尝试下一个镜像——无需人工干预。

The Core Scraping Engine

爬虫提供三种模式：

Full Mode – Complete package

python main.py --mode full --limit 100

爬取帖子，下载所有媒体（图片、视频、相册），并获取完整层级的评论。

History Mode – Fast metadata‑only

python main.py --mode history --limit 500

仅构建帖子元数据集，不下载媒体。

Monitor Mode – Live watching

python main.py --mode monitor

每 5 分钟检查一次新帖，适合追踪突发新闻或热点讨论。

The Dashboard Experience

这款 7‑标签 Streamlit 仪表盘让数据探索变得直观。

Overview Tab

帖子与评论总数
所有帖子的累计得分
媒体帖子的比例分布
帖子随时间变化图表
按得分排名的前 10 条帖子

Analytics Tab

Sentiment Analysis – 基于 VADER 的情感评分
Keyword Cloud – 高频词云
Best Posting Times – 数据驱动的最佳发帖时间洞察

Search Tab

全文检索并支持以下过滤条件：

最低得分
帖子类型（文本、图片、视频、相册、链接）
作者
自定义排序

Comments Analysis

查看得分最高的评论
识别最活跃的评论者
跟踪评论随时间的变化模式

Scraper Controls

可直接在仪表盘启动新爬取。可配置选项包括：

目标 subreddit / 用户
帖子数量上限
模式（full/history）
媒体与评论开关

Job History

对每一次爬取任务进行可观测性监控：

状态（运行中、已完成、失败）
时长指标
帖子/评论/媒体数量
错误日志

Integrations

提供预配置说明，帮助连接：

Metabase
Grafana
DreamFactory
DuckDB

The Plugin Architecture

一个简洁却强大的可扩展后处理系统。

class Plugin:
    """Base class for all plugins."""
    name = "base"
    description = "Base plugin"
    enabled = True

    def process_posts(self, posts):
        return posts

    def process_comments(self, comments):
        return comments

Built‑in Plugins

Sentiment Tagger

为帖子和评论添加 VADER 情感分数和标签。

class SentimentTagger(Plugin):
    name = "sentiment_tagger"
    description = "Adds sentiment scores and labels to posts"

    def process_posts(self, posts):
        for post in posts:
            text = f"{post.get('title', '')} {post.get('selftext', '')}"
            score, label = analyze_sentiment(text)
            post['sentiment_score'] = score
            post['sentiment_label'] = label
        return posts

Deduplicator

去除在多次爬取过程中可能出现的重复帖子。

Keyword Extractor

提取爬取内容中的关键术语，用于趋势分析。

Creating Your Own Plugin

在 plugins/ 目录下添加新的 Python 文件：

from plugins import Plugin

class MyCustomPlugin(Plugin):
    name = "my_plugin"
    description = "Does something cool"
    enabled = True

    def process_posts(self, posts):
        # Your logic here
        return posts

在爬取时启用插件：

python main.py --mode full --plugins sentiment_tagger,deduplicator

REST API for External Integrations

启动 API 服务器：

python main.py --api

API 基础 URL:
文档:

Key Endpoints

Endpoint	Description
`GET /posts`	列出帖子，可使用过滤条件（subreddit、limit、offset）
`GET /comments`	列出评论
`GET /subreddits`	所有已爬取的 subreddit
`GET /jobs`	任务历史
`GET /query?sql=...`	为高级用户提供原始 SQL 查询
`GET /grafana/query`	Grafana 兼容的时间序列数据接口

Real‑World Integration: Grafana Dashboard

在 Grafana 中安装 JSON API 或 Infinity 插件。
添加指向 . 的数据源。
在时间序列面板中使用 /grafana/query 接口，例如：

SELECT date(created_utc) AS time, COUNT(*) AS posts
FROM posts
GROUP BY date(created_utc);

这样即可拥有实时的 Reddit 活动监控仪表盘。

Scheduled Scraping & Notifications

Automation Made Easy

使用类 Cron 的调度设置定期爬取：

# 每 60 分钟爬取一次
python main.py --schedule delhi --every 60

自定义选项：

python main.py --schedule delhi --every 30 --mode full --limit 50

Get Notified

配置 Discord 或 Telegram，在爬取完成后发送提醒。

export DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/..."
export TELEGRAM_BOT_TOKEN="123456:ABC..."
export TELEGRAM_CHAT_ID="987654321"

爬虫会将摘要通知发送到指定平台。

Dry Run Mode: Test Before You Commit

在不保存任何数据的情况下模拟爬取：

python main.py --mode full --limit 50 --dry-run

示例输出：

🧪 DRY RUN MODE - No data will be saved
🧪 DRY RUN COMPLETE!
📊 Would scrape: 100 posts
💬 Would fetch: 250 comments