Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon
Source: Hacker News
Overview
We’re Sanchit and Shubham (YC W26). We built MetalRT, a fast inference engine for Apple Silicon that accelerates LLMs, speech‑to‑text (STT), and text‑to‑speech (TTS). MetalRT outperforms llama.cpp, Apple MLX, Ollama, and sherpa‑onnx across every modality we tested, thanks to custom Metal shaders and zero‑allocation inference.
We also open‑sourced RCLI, the fastest end‑to‑end voice‑AI pipeline on Apple Silicon. It runs entirely on‑device—from microphone input to spoken response—without any cloud calls or API keys.
Getting Started
# Install via Homebrew
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
# Set up models (≈1 GB download)
rcli setup
# Run the interactive mode (push‑to‑talk)
rcli
Alternatively, install with a single script:
curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash
Benchmarks
LLM Decoding
| Model | Tokens/s (MetalRT) | Tokens/s (Apple MLX) | Tokens/s (llama.cpp) |
|---|---|---|---|
| Qwen3‑0.6B | 658 | 552 | 295 |
| Qwen3‑4B | 186 | 170 | 87 |
| LFM2.5‑1.2B | 570 | 509 | 372 |
| Time‑to‑first‑token | 6.6 ms | — | — |
MetalRT is 1.67× faster than llama.cpp and 1.19× faster than Apple MLX (using the same model files).
Speech‑to‑Text (STT)
- 70 seconds of audio transcribed in 101 ms (≈714× real‑time).
- 4.6× faster than mlx‑whisper.
Text‑to‑Speech (TTS)
- Synthesis latency: 178 ms.
- 2.8× faster than mlx‑audio and sherpa‑onnx.
Technical Details
Why Metal?
Most inference engines insert layers of abstraction—graph schedulers, runtime dispatchers, memory managers—between the model and the GPU. MetalRT eliminates these layers:
- Custom Metal compute shaders for quantized matrix multiplication, attention, and activation, compiled ahead of time.
- Zero‑allocation inference: all memory is pre‑allocated at initialization, avoiding allocations during execution.
- Unified engine: a single runtime handles LLM, STT, and TTS, removing the need to stitch separate runtimes together.
Voice Pipeline Optimizations
- Three concurrent threads with lock‑free ring buffers.
- Double‑buffered TTS for continuous playback.
- 38 macOS voice actions, local RAG (~4 ms over 5 K+ chunks).
- 20 hot‑swappable models.
- Full‑screen TUI displaying per‑operation latency.
- Automatic fallback to llama.cpp when MetalRT isn’t available.
Open‑Source Project
- Repository: (MIT license)
- Demo video:
Further Reading
- LLM benchmarks:
- Speech benchmarks:
- Voice pipeline details:
- RAG optimizations:
Discussion Prompt
What would you build if on‑device AI were genuinely as fast as cloud?