TorchAO vs ONNX Runtime: 8-bit Quantization Benchmark
Source: Dev.to
I ran the same 8‑bit quantized Llama 3.2 1B model through TorchAO and ONNX Runtime, expecting ONNX to dominate like it usually does for mobile inference. TorchAO finished a 512‑token generation in 4.2 seconds, while ONNX Runtime took 6.8 seconds.
That’s a 38 % speed difference on identical hardware with the same quantization scheme. Here’s what actually happened when I tried to replicate the “ONNX is always faster” wisdom from half the blog posts out there.

Photo by Steve Johnson on Pexels
The Setup Nobody Talks About: Why Quantization Method Matters More Than Framework
Most benchmarks compare frameworks but ignore that quantization calibration is where you win or lose. I used W8A8 (8‑bit weights, 8‑bit activations) on Llama 3.2 1B because it’s small enough to profile thoroughly but large enough to show real inference patterns.
Here’s the quantization formula both frameworks implement:
$$ x_q = \text{round}\left(\frac{x}{s}\right) + z $$
Continue reading the full article on TildAlice.