How Taalas “prints” LLM onto a chip?
Source: Hacker News
A startup called Taalas, recently released an ASIC chip running Llama 3.1 8B (3/6‑bit quant) at an inference rate of 17 000 tokens per second. They claim it is 10× cheaper in ownership cost than GPU‑based inference systems, consumes 10× less electricity, and is about 10× faster than state‑of‑the‑art inference.
I dug into their blog, LocalLLaMA discussions, and hardware concepts to understand how a model can be “printed” onto a chip. Below is a summary of what I learned.
Basics
Taalas is a 2.5‑year‑old company and this is its first chip. The chip is a fixed‑function ASIC (Application‑Specific Integrated Circuit) – think of it like a CD‑ROM or a printed book: it holds one model and cannot be rewritten.
How NVIDIA GPUs process LLMs (the memory bottleneck)
LLMs consist of sequential layers. For example, Llama 3.1 8B has 32 layers, each containing large weight matrices (the model’s knowledge).
- A prompt is tokenised and converted into an embedding vector.
- On a GPU, the vector enters the compute cores.
- The GPU fetches Layer 1 weights from VRAM/HBM, performs matrix multiplication, and writes the intermediate activations back to VRAM.
- It then fetches Layer 2 weights, repeats the multiplication, stores the result, and so on through all 32 layers to generate a single token.
- To generate the next token, the whole 32‑layer pass is repeated.
This constant shuttling of data between compute units and external memory creates a memory‑bandwidth bottleneck (often called the “Von Neumann bottleneck” or “memory wall”), adding latency and consuming significant energy.
Breaking the wall
Taalas eliminates this bottleneck by engraving the 32 layers directly onto silicon. The model’s weights become physical transistors etched into the chip.
Image: The Taalas Way (illustration of weights hard‑wired on chip)
They also claim to have invented a hardware scheme that stores 4‑bit data and performs the associated multiplication with a single transistor – referred to here as their “magic multiplier” (see the EE Times article).
When an input vector arrives:
- It flows into the transistors that implement Layer 1.
- Multiplication occurs via the magic multiplier.
- Instead of writing the result to external memory, the electrical signal streams directly to the transistors of Layer 2 (through pipeline registers).
- This pipeline continues through all layers until the final output token is produced.
On‑chip memory usage
The chip does not use external DRAM/HBM. It includes a modest amount of on‑chip SRAM for:
- KV cache – temporary storage for the context window of an ongoing conversation.
- LoRA adapters – lightweight fine‑tuning parameters.
SRAM is chosen because mixing DRAM with logic gates is costly and complex, and SRAM is not subject to the current DRAM supply‑chain constraints.
Cost of custom chips
Fabricating a dedicated chip for each model is expensive, but Taalas mitigates this by:
- Designing a base chip with a generic grid of logic gates and transistors.
- Customising only the top two mask layers to encode a specific model’s weights.
This approach is slower than designing a chip from scratch but far faster than a full custom tape‑out. The Llama 3.1 8B implementation took about two months to develop—a relatively quick turnaround in the custom‑chip world.
Outlook
For developers running local models on laptops without powerful GPUs, Taalas’s approach promises a path toward affordable, high‑performance inference hardware. If mass‑produced, such ASICs could dramatically lower the cost and energy footprint of running large language models.