How I Built a Chrome Extension That Runs Llama, DeepSeek, and Mistral Locally Using WebGPU (No Ollama, No Server)
Source: Dev.to
Why this project?
So far we’ve only seen WebGPU‑based LLM demos as GitHub repos or standalone sites. This is the first Chrome extension that brings the same experience to users who prefer “install‑and‑go” in their browser—no dev setup, no API keys, no server.
My motivation
- Privacy concerns – Cloud AI services send every prompt to remote servers. I wanted something that stays on my machine.
- Complexity of “local AI” – Tools like Ollama are great, but they require terminal commands, model downloads, and often a permissive work laptop. I needed something my non‑technical friends (or even my mom) could use.
- Cost – $20 / month adds up for occasional use (grammar fixes, summarising docs, occasional coding help). I wanted a free, private alternative.
What I built
A Chrome extension that runs LLM inference inside the browser—no server, no Ollama, no Docker, no nonsense. Just install and chat.
Three main benefits
| Benefit | What it means |
|---|---|
| Privacy | Messages and model weights never leave the browser. |
| Cost | After a one‑time model download, inference is free (no API calls). |
| Offline | Once cached, the model works on a plane, in the subway, or anywhere without internet. |
Trade‑off: Only smaller, quantised models (e.g., Llama‑Quant, SmolLM, Phi, DeepSeek‑R1 distillates) are feasible. For everyday drafting, summarising, and coding assistance, they’re more than enough.
High‑level architecture
Next.js front‑end
│
└─ useChat (Vercel AI SDK)
│
└─ BrowserAIChatTransport ← custom transport
│
├─ selects provider & model (Zustand store)
├─ obtains language model (ModelManager)
├─ (optional) wraps reasoning models with extractReasoningMiddleware
└─ calls streamText → UIMessageStream
- UI: Same code works as a regular web app or as a Chrome side‑panel extension (static export).
- Backend: The “backend” is your GPU via WebGPU, not a Node server.
- Transport:
useChatonly cares about a transport that takes messages and returns a stream, so the UI is oblivious to the underlying provider.
Provider support
| Provider | Description | Best use‑case |
|---|---|---|
| WebLLM (MLC) | WebGPU‑based, supports larger models (Llama 3.2, Qwen, DeepSeek R1). | Fast inference on a decent GPU. |
| Transformers.js | CPU via WASM, smaller footprint. | Light models like SmolLM. |
| Browser AI (Prompt API) | Chrome’s built‑in Gemini Nano. | No download required; works out‑of‑the‑box. |
All providers implement the same LanguageModelV3 interface. The ModelManager:
- Instantiates the correct adapter.
- Caches model instances (so switching tabs doesn’t re‑download).
- Emits progress callbacks for the UI.
Model IDs are stored in a single models module, filtered for low VRAM and tagged for “supports reasoning” or “supports vision.” This lets both the transport and UI know each model’s capabilities.
Model download & initialization
Downloading weights can be gigabytes, and a blank screen is a bad UX. I created a useModelInitialization hook that:
- Checks cache (
availability() === "available"). - If missing, triggers a minimal
streamTextcall to start the download. - Pipes progress updates to the UI.
Progress can come from two sources:
- Model manager’s callback.
streamText’sdata-modelDownloadProgressevents.
Both streams are merged into a single progress bar for a smooth experience.
Handling reasoning models
Models like DeepSeek R1 emit a … block before the final answer. I wanted to expose that “thought process” in the UI.
- The AI SDK’s
extractReasoningMiddlewareparses those tags. - On the UI side, each message part is inspected:
- If it’s reasoning → render a “ component (collapsible).
- Otherwise → render normal text.
Thus the same stream can produce two different displays.
Code snippets
Transport implementation (simplified)
// BrowserAIChatTransport.ts
const baseModel = modelManager.getModel(provider, modelId);
const model = isReasoningModel(modelId)
? wrapLanguageModel({
model: baseModel,
middleware: extractReasoningMiddleware({
tagName: "think",
startWithReasoning: true,
}),
})
: baseModel;
const result = streamText({
model,
messages: modelMessages,
...streamOptions, // only pass options the provider actually supports
});
return result.toUIMessageStream();
Options handling
One gotcha: Not every model supports every option (e.g.,
topP,presencePenalty). The transport only forwards options that (a) the current provider supports and (b) are explicitly set. I learned that the hard way.
Takeaways
- WebGPU makes in‑browser LLM inference practical for modest‑size models.
- A single transport abstraction lets the same UI talk to multiple backends without code duplication.
- Progress handling is crucial for a good user experience when downloading large model files.
- Reasoning middleware provides a neat way to surface a model’s internal thought process.
- Chrome extensions can serve as a convenient distribution channel for private, offline AI tools.
What’s next?
- Add support for vision‑enabled models (e.g., OCR, image captioning).
- Experiment with quantisation tricks to squeeze larger models into lower‑VRAM devices.
- Polish the UI/UX for mobile Chrome (side‑panel vs. full‑screen).
Feel free to try the extension at noaibills.app and open an issue if you run into anything!
Overview
- Static export – Next.js builds with
output: "export"and drops everything intoextension/ui/. The side‑panel loadsui/index.html. - CSP issues – Chrome extensions disallow inline scripts. A post‑build script extracts every inline “ from the HTML, saves them as separate files, and rewrites the HTML to reference those files.
- WASM loading –
transformers.jsneeds ONNX Runtime WASM files, which can’t be fetched from a CDN inside an extension. The build script copies them intoextension/transformers/and they are declared inweb_accessible_resources.
Result: one codebase, one build process.
- Development runs at
localhost:3000. - Production builds a Chrome extension.
Persisting Conversations
I wanted chats to survive tab closes and browser restarts, so I used Dexie (a thin IndexedDB wrapper) with a simple schema:
| Field | Type |
|---|---|
id | string (conversation id) |
title | string |
model | string |
provider | string |
createdAt | Date |
messages | array of message objects |
When a user selects a conversation from history, the app rehydrates everything—including the model that was used—so the user can continue exactly where they left off.
Migration from legacy storage
Older versions stored data in localStorage. On first load the app:
- Detects legacy data.
- Bulk‑inserts it into IndexedDB.
- Deletes the old
localStorageentry.
No chat history is lost.
Architecture Highlights
- Single‑transport pattern – Adding a new provider is just “wire up the adapter, add the model IDs.” The UI stays untouched.
- Browser limitations – CSP, WASM loading, and storage quotas are all solvable with the right build scripts; just allocate time for them.
- Progress feedback – Users will wait for a 2 GB download if they see a progress bar. A blank screen leads to abandonment.
Use‑Case Positioning
- Local AI is sufficient for most everyday tasks (drafts, summaries, quick coding questions).
- It’s not a replacement for large cloud models like GPT‑4, but a 3 B‑parameter model running locally handles ~80 % of routine text work.
Target audience
- Organizations with strict data‑privacy policies that block cloud AI and cannot install desktop tools such as Ollama or LMStudio.
- Teams needing quick drafts, grammar checks, or basic reasoning without API costs or internet dependency.
For tasks requiring real‑time knowledge or deep reasoning, cloud models remain the better choice.
Takeaways for Builders
If you’re building something similar, the patterns below should generalize to any in‑browser runtime:
- Static export with post‑build fixes for CSP and WASM.
- Model manager + single‑transport abstraction for providers.
- IndexedDB (via Dexie) for persistent conversation storage and migration from legacy formats.
Give it a try: noaibills.app
Feel free to reach out with questions or feedback—I’d love to hear from you!