Launch HN: RunAnywhere (YC W26) – Faster AI Inference on Apple Silicon

Published: (March 10, 2026 at 01:14 PM EDT)
2 min read

Source: Hacker News

Overview

We’re Sanchit and Shubham (YC W26). We built MetalRT, a fast inference engine for Apple Silicon that accelerates LLMs, speech‑to‑text (STT), and text‑to‑speech (TTS). MetalRT outperforms llama.cpp, Apple MLX, Ollama, and sherpa‑onnx across every modality we tested, thanks to custom Metal shaders and zero‑allocation inference.

We also open‑sourced RCLI, the fastest end‑to‑end voice‑AI pipeline on Apple Silicon. It runs entirely on‑device—from microphone input to spoken response—without any cloud calls or API keys.


Getting Started

# Install via Homebrew
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli

# Set up models (≈1 GB download)
rcli setup

# Run the interactive mode (push‑to‑talk)
rcli

Alternatively, install with a single script:

curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

Benchmarks

LLM Decoding

ModelTokens/s (MetalRT)Tokens/s (Apple MLX)Tokens/s (llama.cpp)
Qwen3‑0.6B658552295
Qwen3‑4B18617087
LFM2.5‑1.2B570509372
Time‑to‑first‑token6.6 ms

MetalRT is 1.67× faster than llama.cpp and 1.19× faster than Apple MLX (using the same model files).

Speech‑to‑Text (STT)

  • 70 seconds of audio transcribed in 101 ms (≈714× real‑time).
  • 4.6× faster than mlx‑whisper.

Text‑to‑Speech (TTS)

  • Synthesis latency: 178 ms.
  • 2.8× faster than mlx‑audio and sherpa‑onnx.

Technical Details

Why Metal?

Most inference engines insert layers of abstraction—graph schedulers, runtime dispatchers, memory managers—between the model and the GPU. MetalRT eliminates these layers:

  • Custom Metal compute shaders for quantized matrix multiplication, attention, and activation, compiled ahead of time.
  • Zero‑allocation inference: all memory is pre‑allocated at initialization, avoiding allocations during execution.
  • Unified engine: a single runtime handles LLM, STT, and TTS, removing the need to stitch separate runtimes together.

Voice Pipeline Optimizations

  • Three concurrent threads with lock‑free ring buffers.
  • Double‑buffered TTS for continuous playback.
  • 38 macOS voice actions, local RAG (~4 ms over 5 K+ chunks).
  • 20 hot‑swappable models.
  • Full‑screen TUI displaying per‑operation latency.
  • Automatic fallback to llama.cpp when MetalRT isn’t available.

Open‑Source Project

  • Repository: (MIT license)
  • Demo video:

Further Reading

  • LLM benchmarks:
  • Speech benchmarks:
  • Voice pipeline details:
  • RAG optimizations:

Discussion Prompt

What would you build if on‑device AI were genuinely as fast as cloud?

0 views
Back to Blog

Related posts

Read more »

Cloudflare Crawl Endpoint

Article URL: https://developers.cloudflare.com/changelog/post/2026-03-10-br-crawl-endpoint/ Comments URL: https://news.ycombinator.com/item?id=47329557 Points:...

RISC-V Is Sloooow

Triaging I went through the Fedora RISC‑V trackerhttps://abologna.gitlab.io/fedora-riscv-tracker/ entries, triaged most of them currently 17 entries remain in...

Mother of All Grease Fires (1994)

Background I work in the very center of Palo Alto, in a computer‑company office building that is surrounded by restaurants, hotels, a bank, an art‑supply store...