Show HN: RunAnwhere – Faster AI Inference on Apple Silicon

Published: (March 10, 2026 at 01:14 PM EDT)
3 min read

Source: Hacker News

Introduction

Hi HN, we’re Sanchit and Shubham (YC W26). We built a fast inference engine for Apple Silicon. LLMs, speech‑to‑text, and text‑to‑speech – MetalRT beats llama.cpp, Apple’s MLX, Ollama, and sherpa‑onnx on every modality we tested. It uses custom Metal shaders and has no framework overhead.

We’ve also open‑sourced RCLI, the fastest end‑to‑end voice AI pipeline on Apple Silicon: mic to spoken response, entirely on‑device, with no cloud or API keys.

Getting Started

# Install via Homebrew
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli

# Set up models (≈1 GB download)
rcli setup

# Run interactive mode (push‑to‑talk)
rcli

Or install with a single script:

curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

Benchmarks

LLM Decoding

ModelTokens/svs mlx‑lmvs llama.cpp
Qwen3‑0.6B6581.19× faster (552)1.67× faster (295)
Qwen3‑4B1861.09× faster (170)2.14× faster (87)
LFM2.5‑1.2B5701.12× faster (509)1.53× faster (372)
Time‑to‑first‑token6.6 ms

Speech‑to‑Text (STT)

  • 70 seconds of audio transcribed in 101 ms714× real‑time, 4.6× faster than mlx‑whisper.

Text‑to‑Speech (TTS)

  • 178 ms synthesis, 2.8× faster than mlx‑audio and sherpa‑onnx.

Motivation

Demoing on‑device AI is easy; shipping it is brutal. Voice is the hardest test because it chains STT → LLM → TTS sequentially, and any slow stage hurts the user experience. Most teams fall back to cloud APIs not because local models are bad, but because local inference infrastructure adds latency.

The core challenge is latency compounding: three models in series can easily exceed 600 ms, which feels broken. Every stage must be fast, run on a single device, and avoid network round‑trips.

Technical Approach

We went straight to Metal:

  • Custom GPU compute shaders for quantized matmul, attention, and activation, compiled ahead of time.
  • Zero allocations during inference – all memory is pre‑allocated at init.
  • A single unified engine (MetalRT) handles LLM, STT, and TTS natively on Apple Silicon, avoiding the graph schedulers, runtime dispatchers, and memory managers that other engines layer on top of the GPU.

MetalRT is the first engine to handle all three modalities natively on Apple Silicon.

Resources

  • LLM benchmarks:
  • Speech benchmarks:
  • Voice pipeline optimizations:
  • RAG optimizations:

Open‑Source Project

  • Repository: (MIT license)
  • Features:
    • Three concurrent threads with lock‑free ring buffers
    • Double‑buffered TTS
    • 38 macOS actions by voice
    • Local RAG (~4 ms over 5 K+ chunks)
    • 20 hot‑swappable models
    • Full‑screen TUI with per‑op latency readouts
    • Falls back to llama.cpp when MetalRT isn’t installed

Demo

Watch the demo video:

Discussion Prompt

What would you build if on‑device AI were genuinely as fast as cloud?

Comments

(86 points, 23 comments)

0 views
Back to Blog

Related posts

Read more »

Cloudflare Crawl Endpoint

Article URL: https://developers.cloudflare.com/changelog/post/2026-03-10-br-crawl-endpoint/ Comments URL: https://news.ycombinator.com/item?id=47329557 Points:...

RISC-V Is Sloooow

Triaging I went through the Fedora RISC‑V trackerhttps://abologna.gitlab.io/fedora-riscv-tracker/ entries, triaged most of them currently 17 entries remain in...

Mother of All Grease Fires (1994)

Background I work in the very center of Palo Alto, in a computer‑company office building that is surrounded by restaurants, hotels, a bank, an art‑supply store...