I built a local screen reader that reads your screen aloud — no cloud, no API keys

Published: (April 11, 2026 at 08:30 AM EDT)
2 min read
Source: Dev.to

Source: Dev.to

What it does

  • 🖱️ Draw a rectangle on any part of your screen
  • 📸 Snapshot that region every N seconds
  • 🔍 Pixel‑diff check – skips frames where nothing changed
  • 🧠 LightOnOCR‑2‑1B reads the text (runs on AMD GPU via ROCm)
  • 🗣️ Kokoro‑82M speaks it through your speakers (runs on CPU)
🖥️ screen → 🔍 diff → 🧠 OCR → ✨ clean text → 🗣️ TTS → 🔊 speaker

Auto page‑turn (the killer feature)

Draw a second rectangle over any button on screen. After TTS finishes speaking and the screen stays idle, sttts automatically clicks it. I use this with Kindle for PC – it reads the entire book hands‑free, turning pages automatically.

# Draw OCR region, then draw the next‑page button
uv run python capture.py --next-btn -i 2

Models used

Both models download automatically from HuggingFace on first run. No API keys, no subscriptions.

Smart idle detection

Pixel‑level diff comparison means OCR and TTS only fire when something actually changed. Reading a static page? Silent. New content loaded? Speaks immediately.

# Only trigger OCR when >1% of pixels changed
uv run python capture.py --diff-threshold 1.0

Quick start

# Install system dependencies
sudo apt-get install -y slop xdotool libportaudio2 libsndfile1

# Install uv (a fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository and run
git clone https://github.com/paradisecy/sttts
cd sttts
uv sync
uv run python capture.py

Use cases

  • 📖 Hands‑free ebook reading (Kindle, ePub readers, PDFs)
  • 📊 Spoken updates for financial dashboards
  • ♿ Accessibility tool for apps lacking native screen‑reader support
  • 💻 Listen to terminal output or logs while working
  • 🌐 Hear any webpage without a browser extension

Tech stack

  • Python 3.13
  • PyTorch 2.8 + ROCm 6.3 (AMD GPU)

Key libraries

  • mss – fast screen capture
  • transformers – OCR model handling
  • kokoro – TTS model
  • sounddevice – audio playback
  • slop + xdotool – region selection and mouse clicks

⭐ GitHub:

0 views
Back to Blog

Related posts

Read more »