Show HN: RunAnwhere – Faster AI Inference on Apple Silicon

Published: 14 hours ago (March 10, 2026 at 01:14 PM EDT)

3 min read

Source: Hacker News

Introduction

Hi HN, we’re Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech‑to‑text, and text‑to‑speech – MetalRT beats llama.cpp, Apple’s MLX, Ollama, and sherpa‑onnx on every modality we tested. It uses custom Metal shaders and has no framework overhead.

We’ve also open‑sourced RCLI, the fastest end‑to‑end voice AI pipeline on Apple Silicon: mic to spoken response, entirely on‑device, with no cloud or API keys.

Getting Started

# Install via Homebrew
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli

# Set up models (≈1 GB download)
rcli setup

# Run interactive mode (push‑to‑talk)
rcli

Or install with a single script:

curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

Benchmarks

LLM Decoding

Model	Tokens/s	vs mlx‑lm	vs llama.cpp
Qwen3‑0.6B	658	1.19× faster (552)	1.67× faster (295)
Qwen3‑4B	186	1.09× faster (170)	2.14× faster (87)
LFM2.5‑1.2B	570	1.12× faster (509)	1.53× faster (372)
Time‑to‑first‑token	6.6 ms	–	–

Speech‑to‑Text (STT)

70 seconds of audio transcribed in 101 ms → 714× real‑time, 4.6× faster than mlx‑whisper.

Text‑to‑Speech (TTS)

178 ms synthesis, 2.8× faster than mlx‑audio and sherpa‑onnx.

Motivation

Demoing on‑device AI is easy; shipping it is brutal. Voice is the hardest test because it chains STT → LLM → TTS sequentially, and any slow stage hurts the user experience. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure adds latency.

The core challenge is latency compounding: three models in series can easily exceed 600 ms, which feels broken. Every stage must be fast, run on a single device, and avoid network round‑trips.

Technical Approach

We went straight to Metal:

Custom GPU compute shaders for quantized matmul, attention, and activation, compiled ahead of time.
Zero allocations during inference – all memory is pre‑allocated at init.
A single unified engine (MetalRT) handles LLM, STT, and TTS natively on Apple Silicon, avoiding the graph schedulers, runtime dispatchers, and memory managers that other engines layer on top of the GPU.

MetalRT is the first engine to handle all three modalities natively on Apple Silicon.

Resources

LLM benchmarks:
Speech benchmarks:
Voice pipeline optimizations:
RAG optimizations:

Open‑Source Project

Repository: (MIT license)
Features:
- Three concurrent threads with lock‑free ring buffers
- Double‑buffered TTS
- 38 macOS actions by voice
- Local RAG (~4 ms over 5 K+ chunks)
- 20 hot‑swappable models
- Full‑screen TUI with per‑op latency readouts
- Falls back to llama.cpp when MetalRT isn’t installed

Demo

Watch the demo video:

Discussion Prompt

What would you build if on‑device AI were genuinely as fast as cloud?

Comments

(86 points, 23 comments)