I Built a Python Probe That Scrapes Zenn's Hidden Stats API and Commits PV Numbers to SQLite Every Hour (and Crashes on Purpose)
Source: Dev.to
If you ship technical posts on Zenn, you already know the pain: the dashboard shows pageviews, but there is no public stats endpoint and no export button. After three weeks of running this probe, I have an hourly PV time-series for every article in a 40 KB SQLite file, a Discord ping when a post breaks 500 views, and zero manual checking.
By the end of this article you will be able to: (1) call the same private JSON endpoint Zenn’s own dashboard uses, (2) write a probe that raises and dies on the first sign of trouble instead of logging a warning and continuing with garbage, and (3) run it from GitHub Actions so a crash auto-restarts on the next cron tick. Everything below is copy-paste runnable.
Why the /api/me/articles endpoint beats scraping HTML
My first version parsed the public article page with BeautifulSoup and read the “❤️ N” badge. It survived for nine days, then Zenn shipped a CSS change, the selector returned None, and my KPI table filled with 0s for two days before I noticed. The numbers looked plausible, which is the worst failure mode.
The fix was to stop scraping HTML entirely. When you open your Zenn dashboard with DevTools → Network open, the page fetches https://zenn.dev/api/me/articles?page=1. That JSON includes page_view_count, liked_count, comments_count, and published_at per article — the exact fields the dashboard renders. It needs your session cookie (connect.sid), which you grab once from DevTools → Application → Cookies and store as a secret.
The key design decision: this probe never swallows an error. A missing field, an empty list, an HTTP 401, or a PV count that went down since last run all raise an exception. The probe is meant to be killed and restarted, not to limp along producing fake KPI data. The recovery layer is the scheduler, not a try/except: pass.
The probe: zenn_probe.py that commits to SQLite or dies
Here is the full collector. It fetches every page, validates the shape of the response, and writes a row per article per run. If anything is off, it throws — and because the GitHub Actions job is marked failed, the next hourly run is a clean retry.
# zenn_probe.py
import os
import sqlite3
import sys
from datetime import datetime, timezone
import requests
ZENN_API = "https://zenn.dev/api/me/articles"
DB_PATH = os.environ.get("KPI_DB", "kpi.sqlite3")
SESSION_COOKIE = os.environ["ZENN_SESSION"] # KeyError on purpose if unset
class ProbeError(RuntimeError):
"""Raised when the response is structurally untrustworthy."""
def fetch_all_articles() -> list[dict]:
headers = {
"Cookie": f"connect.sid={SESSION_COOKIE}",
"User-Agent": "kpi-probe/1.0 (+personal-analytics)",
}
articles, page = [], 1
while True:
resp = requests.get(f"{ZENN_API}?page={page}", headers=headers, timeout=10)
if resp.status_code == 401:
raise ProbeError("401: ZENN_SESSION expired — refresh connect.sid")
resp.raise_for_status()
payload = resp.json()
batch = payload.get("articles")
if batch is None:
raise ProbeError(f"no 'articles' key on page {page}: {list(payload)[:5]}")
if not batch:
break
articles.extend(batch)
if not payload.get("next_page"):
break
page += 1
if page > 50: # guard against an infinite next_page loop
raise ProbeError("pagination exceeded 50 pages — aborting")
if not articles:
raise ProbeError("zero articles returned — auth or API shape changed")
return articles
def validate(article: dict) -> dict:
required = ("id", "slug", "title", "page_view_count", "liked_count")
missing = [k for k in required if k not in article]
if missing:
raise ProbeError(f"missing fields {missing} on {article.get('slug', '?')}")
pv = article["page_view_count"]
if not isinstance(pv, int) or pv None:
conn.execute(
"""
CREATE TABLE IF NOT EXISTS pv_history (
slug TEXT NOT NULL,
title TEXT NOT NULL,
page_views INTEGER NOT NULL,
likes INTEGER NOT NULL,
captured_at TEXT NOT NULL,
PRIMARY KEY (slug, captured_at)
)
"""
)
def last_pv(conn: sqlite3.Connection, slug: str) -> int | None:
row = conn.execute(
"SELECT page_views FROM pv_history WHERE slug=? "
"ORDER BY captured_at DESC LIMIT 1",
(slug,),
).fetchone()
return row[0] if row else None
def commit_run() -> None:
now = datetime.now(timezone.utc).isoformat(timespec="seconds")
articles = [validate(a) for a in fetch_all_articles()]
conn = sqlite3.connect(DB_PATH)
try:
init_db(conn)
for a in articles:
prev = last_pv(conn, a["slug"])
pv = a["page_view_count"]
# Zenn PV is monotonic. A drop means cache/garbage — fail loud.
if prev is not None and pv {pv}, refusing commit")
conn.execute(
"INSERT OR IGNORE INTO pv_history VALUES (?,?,?,?,?)",
(a["slug"], a["title"], pv, a["liked_count"], now),
)
conn.commit()
finally:
conn.close()
print(f"committed {len(articles)} articles at {now}")
if __name__ == "__main__":
try:
commit_run()
except ProbeError as exc:
print(f"PROBE FAILED: {exc}", file=sys.stderr)
sys.exit(1) # non-zero -> GitHub Actions marks the run failed
Enter fullscreen mode
Exit fullscreen mode
The three validations that have actually saved me, ranked by how often they fired in three weeks:
The monotonic PV check (`pv str | None: conn = sqlite3.connect(DB_PATH) try: rows = conn.execute(DIFF_SQL).fetchall() finally: conn.close()
movers = [r for r in rows if r[3] > 0]
crossed = [r for r in rows if r[2] >= ALERT_AT > r[2] - r[3]]
if not movers and not crossed:
return None # stay silent, no Discord spam
lines = ["**Zenn PV — last hour**"]
for slug, title, pv, delta in movers[:5]:
lines.append(f"• +{delta} → {pv:,} | {title[:40]}")
for slug, title, pv, _ in crossed:
lines.append(f"🎉 **{title[:40]}** crossed {ALERT_AT} PV ({pv:,})")
return "\n".join(lines)
def main() -> None: report = build_report() if report is None: print(“no movement — skipping Discord”) return resp = requests.post(WEBHOOK, json={“content”: report}, timeout=10) resp.raise_for_status() print(“posted report to Discord”)
if name == “main”: main()
Enter fullscreen mode
Exit fullscreen mode
The `ROW_NUMBER() OVER (PARTITION BY slug ...)` trick is the part people miss. Because the table stores a full snapshot every hour, `rn = 1` is the newest sample and `rn = 2` is the previous one, so the self-join gives you the exact hourly delta without storing any derived state. When I first wrote this with a `GROUP BY slug, MAX(captured_at)` it silently joined the wrong rows whenever two snapshots landed in the same second — the window function fixed it in one line.
Wiring it to GitHub Actions so a crash self-heals
The last piece is the scheduler that turns `sys.exit(1)` into free retries. This workflow runs hourly, commits the updated SQLite file back to the repo, and — critically — the probe failing does **not** corrupt the DB because the failed run never reaches `conn.commit()`.
.github/workflows/zenn-probe.yml
name: zenn-pv-probe on: schedule: - cron: “7 * * * *” # :07 every hour, off the busy :00 spike workflow_dispatch: {}
jobs: probe: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: “3.12” - run: pip install requests - name: collect PV env: ZENN_SESSION: ${{ secrets.ZENN_SESSION }} KPI_DB: kpi.sqlite3 run: python zenn_probe.py - name: report to discord if: success() env: DISCORD_WEBHOOK: ${{ secrets.DISCORD_WEBHOOK }} run: python zenn_report.py - name: persist db if: success() run: | git config user.name kpi-bot git config user.email kpi-bot@users.noreply.github.com git add kpi.sqlite3 git diff —cached —quiet || git commit -m “pv snapshot” git push
Enter fullscreen mode
Exit fullscreen mode
One sharp edge that cost me an afternoon: I originally scheduled the cron at `0 * * * *`. GitHub's shared scheduler is heavily oversubscribed on the top of the hour, and my runs drifted 5–15 minutes late or got dropped entirely. Moving to `7 * * * *` made the firing time reliable to within a minute. If you run anything on GitHub cron, **never use `:00`.**
The second edge: committing a binary SQLite file to git grows history forever. After three weeks my `.git` was 6 MB for a 40 KB DB. If you keep this running for months, push the DB to a release asset or an S3 bucket instead of committing it — or squash the history periodically.
What three weeks of data actually told me
The payoff was not the automation — it was the shape of the curve. My top post gained 80% of its lifetime PV in the first 36 hours, then flatlined. Before I had hourly data I assumed Zenn traffic accumulated steadily; it does not. That single insight changed when I cross-post to Dev.to (now: within 12 hours, while the Zenn curve is still climbing, not three days later when it's already dead).
The probe is ~120 lines, costs nothing on GitHub's free tier (24 runs/day is well under the 2,000-minute limit), and the fail-loud design means I have never once had to ask "is this number real?" If you write on Zenn and care about what's working, an hour of setup buys you a KPI table you can actually trust — and `EXPLAIN`-able SQL beats squinting at a dashboard every time.
If you want to go deeper on the SQLite window-function patterns and CI observability behind this, *Designing Data-Intensive Applications* is the one book I keep reopening for the time-series modeling chapters.