Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

Published: 14 hours ago (March 10, 2026 at 01:14 PM EDT)

2 min read

Source: Hacker News

Overview

We’re Sanchit and Shubham (YC W26). We built MetalRT, a fast inference engine for Apple Silicon that accelerates LLMs, speech‑to‑text (STT), and text‑to‑speech (TTS). MetalRT outperforms llama.cpp, Apple MLX, Ollama, and sherpa‑onnx across every modality we tested, thanks to custom Metal shaders and zero‑allocation inference.

We also open‑sourced RCLI, the fastest end‑to‑end voice‑AI pipeline on Apple Silicon. It runs entirely on‑device—from microphone input to spoken response—without any cloud calls or API keys.

Getting Started

# Install via Homebrew
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli

# Set up models (≈1 GB download)
rcli setup

# Run the interactive mode (push‑to‑talk)
rcli

Alternatively, install with a single script:

curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

Benchmarks

LLM Decoding

Model	Tokens/s (MetalRT)	Tokens/s (Apple MLX)	Tokens/s (llama.cpp)
Qwen3‑0.6B	658	552	295
Qwen3‑4B	186	170	87
LFM2.5‑1.2B	570	509	372
Time‑to‑first‑token	6.6 ms	—	—

MetalRT is 1.67× faster than llama.cpp and 1.19× faster than Apple MLX (using the same model files).

Speech‑to‑Text (STT)

70 seconds of audio transcribed in 101 ms (≈714× real‑time).
4.6× faster than mlx‑whisper.

Text‑to‑Speech (TTS)

Synthesis latency: 178 ms.
2.8× faster than mlx‑audio and sherpa‑onnx.

Technical Details

Why Metal?

Most inference engines insert layers of abstraction—graph schedulers, runtime dispatchers, memory managers—between the model and the GPU. MetalRT eliminates these layers:

Custom Metal compute shaders for quantized matrix multiplication, attention, and activation, compiled ahead of time.
Zero‑allocation inference: all memory is pre‑allocated at initialization, avoiding allocations during execution.
Unified engine: a single runtime handles LLM, STT, and TTS, removing the need to stitch separate runtimes together.

Voice Pipeline Optimizations

Three concurrent threads with lock‑free ring buffers.
Double‑buffered TTS for continuous playback.
38 macOS voice actions, local RAG (~4 ms over 5 K+ chunks).
20 hot‑swappable models.
Full‑screen TUI displaying per‑operation latency.
Automatic fallback to llama.cpp when MetalRT isn’t available.

Open‑Source Project

Repository: (MIT license)
Demo video:

Discussion Prompt

What would you build if on‑device AI were genuinely as fast as cloud?

Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

Overview

Getting Started

Benchmarks

LLM Decoding

Speech‑to‑Text (STT)

Text‑to‑Speech (TTS)

Technical Details

Why Metal?

Voice Pipeline Optimizations

Open‑Source Project

Further Reading

Discussion Prompt

Related posts

Cloudflare Crawl Endpoint

RISC-V Is Sloooow

Mother of All Grease Fires (1994)

HyperCard discovery: Neuromancer, Count Zero, Mona Lisa Overdrive (2022)