A supervisor-tree library for building predictable and resilient programs
Source: Dev.to
Claim
Battle‑tested idea, not AI‑slop.
I designed the architecture and wrote most of the implementation myself (>80%). AI assistance was used mainly for creating unit tests and standalone utilities.
I’m releasing Runsmith, an Erlang/OTP‑style supervisor‑tree framework for Python services/systems composed of multiple long‑running programs.
What Runsmith Provides
- Worker abstraction – each unit becomes a worker with an explicit finite‑state‑machine (FSM) lifecycle.
- Supervisor tree – monitors every worker continuously, detecting stalls and timeouts as well as crashes, and confines restarts to the failed unit so the rest of the system keeps running.
- Rich concurrency models – workers can run in threads, coroutines, or custom execution backends, not just separate processes.
- Fine‑grained health probes – failures are detected via constraint violations or health checks, not only abnormal process exits.
- Nested fault domains – Erlang/OTP‑style supervisor‑tree enables hierarchical fault isolation.
Origin Story
I built the backend for a safety‑protection camera system used in manufacturing plants, where any downtime is unacceptable. The system comprised several processes that needed to run indefinitely and recover from failures without affecting each other:
- Web app – serves the HTTP API and Server‑Sent Events streams.
- Algorithm worker – runs computer‑vision inference on incoming frames.
- Camera controller – interacts with the camera device library and polls frames.
- Background task runner – executes scheduled jobs such as periodic data vacuuming.
- ONVIF service
During development I encountered issues such as:
- The algorithm worker stalling mid‑inference due to third‑party driver failures.
- The FastAPI web app event loop becoming starved because of poorly written synchronous code.
My first implementation was a “messy soup” of state flags, retry logic, watchdogs, and probes. It worked but was hard to maintain and reason about. I needed a framework where supervision is a first‑class concept and fault isolation is structural rather than bolted on.
Runsmith is the result of that need—a unified structure for modeling long‑running, stateful function units.
Comparison with supervisord
No, Runsmith is not supervisord.
supervisordis an OS‑level process control daemon that manages external programs by PID and static configuration.- Runsmith is an in‑process, programmable Python library where the supervised unit is a typed worker with an explicit lifecycle.
Advantages over supervisord
- Rich concurrency models – workers can run in threads, coroutines, or custom backends, not just separate OS processes.
- Fine‑grained health probes – failures are detected via constraint violations and health checks, not only abnormal exits.
- Supervisor‑tree architecture – supports nested fault domains for hierarchical fault isolation, mirroring Erlang/OTP’s approach.