Show HN: RunAnwhere – Faster AI Inference on Apple Silicon
Source: Hacker News
Introduction
Hi HN, we’re Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech‑to‑text, and text‑to‑speech – MetalRT beats llama.cpp, Apple’s MLX, Ollama, and sherpa‑onnx on every modality we tested. It uses custom Metal shaders and has no framework overhead.
We’ve also open‑sourced RCLI, the fastest end‑to‑end voice AI pipeline on Apple Silicon: mic to spoken response, entirely on‑device, with no cloud or API keys.
Getting Started
# Install via Homebrew
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
# Set up models (≈1 GB download)
rcli setup
# Run interactive mode (push‑to‑talk)
rcli
Or install with a single script:
curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash
Benchmarks
LLM Decoding
| Model | Tokens/s | vs mlx‑lm | vs llama.cpp |
|---|---|---|---|
| Qwen3‑0.6B | 658 | 1.19× faster (552) | 1.67× faster (295) |
| Qwen3‑4B | 186 | 1.09× faster (170) | 2.14× faster (87) |
| LFM2.5‑1.2B | 570 | 1.12× faster (509) | 1.53× faster (372) |
| Time‑to‑first‑token | 6.6 ms | – | – |
Speech‑to‑Text (STT)
- 70 seconds of audio transcribed in 101 ms → 714× real‑time, 4.6× faster than
mlx‑whisper.
Text‑to‑Speech (TTS)
- 178 ms synthesis, 2.8× faster than
mlx‑audioandsherpa‑onnx.
Motivation
Demoing on‑device AI is easy; shipping it is brutal. Voice is the hardest test because it chains STT → LLM → TTS sequentially, and any slow stage hurts the user experience. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure adds latency.
The core challenge is latency compounding: three models in series can easily exceed 600 ms, which feels broken. Every stage must be fast, run on a single device, and avoid network round‑trips.
Technical Approach
We went straight to Metal:
- Custom GPU compute shaders for quantized matmul, attention, and activation, compiled ahead of time.
- Zero allocations during inference – all memory is pre‑allocated at init.
- A single unified engine (MetalRT) handles LLM, STT, and TTS natively on Apple Silicon, avoiding the graph schedulers, runtime dispatchers, and memory managers that other engines layer on top of the GPU.
MetalRT is the first engine to handle all three modalities natively on Apple Silicon.
Resources
- LLM benchmarks:
- Speech benchmarks:
- Voice pipeline optimizations:
- RAG optimizations:
Open‑Source Project
- Repository: (MIT license)
- Features:
- Three concurrent threads with lock‑free ring buffers
- Double‑buffered TTS
- 38 macOS actions by voice
- Local RAG (~4 ms over 5 K+ chunks)
- 20 hot‑swappable models
- Full‑screen TUI with per‑op latency readouts
- Falls back to
llama.cppwhen MetalRT isn’t installed
Demo
Watch the demo video:
Discussion Prompt
What would you build if on‑device AI were genuinely as fast as cloud?
Comments
(86 points, 23 comments)