Taalas가 LLM을 칩에 '프린트'하는 방법?
Source: Hacker News
A startup called Taalas recently released an ASIC chip that runs Llama 3.1 8B (3/6‑bit quant) at an inference rate of 17 000 tokens per second—roughly the amount of text that would fill 30 A4 pages in a single second. The company claims the solution is about 10× cheaper in ownership cost than GPU‑based inference, consumes 10× less electricity, and is 10× faster than state‑of‑the‑art GPU inference.
The key idea is that the model’s weights are hard‑wired onto the silicon, eliminating the need to fetch large weight matrices from external memory for each token. Below is a breakdown of how this works compared with conventional GPU inference.
Basics
Taalas, a 2.5‑year‑old company, has built its first chip as a fixed‑function ASIC (Application‑Specific Integrated Circuit). Much like a CD‑ROM or a printed book, the chip holds a single model and cannot be re‑programmed for a different one.
How NVIDIA GPUs process stuff? (Inefficiency 101)
LLMs consist of sequential layers; for example, Llama 3.1 8B has 32 layers. Each layer contains large weight matrices that encode the model’s knowledge.
- Prompt → embeddings – The user’s input is turned into a vector of numbers.
- Layer‑by‑layer computation – On a GPU, the input vector is sent to the compute cores. For each layer:
- The layer’s weights are fetched from VRAM/HBM (the GPU’s external memory).
- Matrix multiplication is performed.
- The resulting activations are written back to VRAM.
- Token generation – This 32‑layer pass produces a single token. Generating the next token repeats the entire sequence.
Because the GPU constantly shuttles data between the compute cores and external memory, the memory bus becomes a latency and energy bottleneck—often referred to as the Von Neumann bottleneck or memory wall.
Breaking the wall!
Taalas eliminates the memory‑wall by engraving the 32 layers directly onto the chip. The model’s weights become physical transistors etched into silicon.
The company also claims to have invented a hardware scheme that stores 4‑bit data and performs the associated multiplication using a single transistor (referred to here as the “magic multiplier”).
When a prompt arrives:
- It is converted into a vector.
- The vector flows through the transistors that implement Layer 1, where the magic multiplier performs the multiplication.
- Instead of writing the result to external RAM, the electrical signal streams directly to the transistors of Layer 2 (via pipeline registers).
- This pipeline continues through all layers until the final output token is produced.
So, they don’t use any RAM?
The chip does not rely on external DRAM/HBM. It uses a modest amount of on‑chip SRAM for:
- KV cache – temporary storage for the context window of an ongoing conversation.
- LoRA adapters – lightweight fine‑tuning parameters.
SRAM is chosen because mixing DRAM with logic gates is costly and complex, and SRAM is not subject to the current supply‑chain constraints affecting DRAM.
Isn’t fabricating a custom chip for every model super expensive?
In principle, yes. However, Taalas designs a base chip with a massive, generic grid of logic gates. To map a specific model, they only need to customize the top two metal layers/masks, which is far cheaper and faster than designing a chip from scratch.
- Development time for the Llama 3.1 8B implementation: ≈ 2 months.
- In the fast‑moving AI landscape, this is relatively quick, though still slower than software‑only updates.
The approach promises a path toward mass‑produced, ultra‑fast inference hardware, which could be a game‑changer for users running local models without high‑end GPUs.
References
- Taalas blog
- EE Times article on Taalas’s “magic multiplier”