How I Built a Chrome Extension That Runs Llama, DeepSeek, and Mistral Locally Using WebGPU (No Ollama, No Server)

Published: 3 days ago (February 12, 2026 at 12:15 AM EST)

7 min read

Source: Dev.to

Why this project?

So far we’ve only seen WebGPU‑based LLM demos as GitHub repos or standalone sites. This is the first Chrome extension that brings the same experience to users who prefer “install‑and‑go” in their browser—no dev setup, no API keys, no server.

My motivation

Privacy concerns – Cloud AI services send every prompt to remote servers. I wanted something that stays on my machine.
Complexity of “local AI” – Tools like Ollama are great, but they require terminal commands, model downloads, and often a permissive work laptop. I needed something my non‑technical friends (or even my mom) could use.
Cost – $20 / month adds up for occasional use (grammar fixes, summarising docs, occasional coding help). I wanted a free, private alternative.

What I built

A Chrome extension that runs LLM inference inside the browser—no server, no Ollama, no Docker, no nonsense. Just install and chat.

Three main benefits

Benefit	What it means
Privacy	Messages and model weights never leave the browser.
Cost	After a one‑time model download, inference is free (no API calls).
Offline	Once cached, the model works on a plane, in the subway, or anywhere without internet.

Trade‑off: Only smaller, quantised models (e.g., Llama‑Quant, SmolLM, Phi, DeepSeek‑R1 distillates) are feasible. For everyday drafting, summarising, and coding assistance, they’re more than enough.

High‑level architecture

Next.js front‑end
   │
   └─ useChat (Vercel AI SDK)
          │
          └─ BrowserAIChatTransport  ← custom transport
                 │
                 ├─ selects provider & model (Zustand store)
                 ├─ obtains language model (ModelManager)
                 ├─ (optional) wraps reasoning models with extractReasoningMiddleware
                 └─ calls streamText → UIMessageStream

UI: Same code works as a regular web app or as a Chrome side‑panel extension (static export).
Backend: The “backend” is your GPU via WebGPU, not a Node server.
Transport: useChat only cares about a transport that takes messages and returns a stream, so the UI is oblivious to the underlying provider.

Provider support

Provider	Description	Best use‑case
WebLLM (MLC)	WebGPU‑based, supports larger models (Llama 3.2, Qwen, DeepSeek R1).	Fast inference on a decent GPU.
Transformers.js	CPU via WASM, smaller footprint.	Light models like SmolLM.
Browser AI (Prompt API)	Chrome’s built‑in Gemini Nano.	No download required; works out‑of‑the‑box.

All providers implement the same LanguageModelV3 interface. The ModelManager:

Instantiates the correct adapter.
Caches model instances (so switching tabs doesn’t re‑download).
Emits progress callbacks for the UI.

Model IDs are stored in a single models module, filtered for low VRAM and tagged for “supports reasoning” or “supports vision.” This lets both the transport and UI know each model’s capabilities.

Model download & initialization

Downloading weights can be gigabytes, and a blank screen is a bad UX. I created a useModelInitialization hook that:

Checks cache (availability() === "available").
If missing, triggers a minimal streamText call to start the download.
Pipes progress updates to the UI.

Progress can come from two sources:

Model manager’s callback.
streamText’s data-modelDownloadProgress events.

Both streams are merged into a single progress bar for a smooth experience.

Handling reasoning models

Models like DeepSeek R1 emit a … block before the final answer. I wanted to expose that “thought process” in the UI.

The AI SDK’s extractReasoningMiddleware parses those tags.
On the UI side, each message part is inspected:
- If it’s reasoning → render a “ component (collapsible).
- Otherwise → render normal text.

Thus the same stream can produce two different displays.

Code snippets

Transport implementation (simplified)

// BrowserAIChatTransport.ts
const baseModel = modelManager.getModel(provider, modelId);

const model = isReasoningModel(modelId)
  ? wrapLanguageModel({
      model: baseModel,
      middleware: extractReasoningMiddleware({
        tagName: "think",
        startWithReasoning: true,
      }),
    })
  : baseModel;

const result = streamText({
  model,
  messages: modelMessages,
  ...streamOptions, // only pass options the provider actually supports
});

return result.toUIMessageStream();

Options handling

One gotcha: Not every model supports every option (e.g., topP, presencePenalty). The transport only forwards options that (a) the current provider supports and (b) are explicitly set. I learned that the hard way.

Takeaways

WebGPU makes in‑browser LLM inference practical for modest‑size models.
A single transport abstraction lets the same UI talk to multiple backends without code duplication.
Progress handling is crucial for a good user experience when downloading large model files.
Reasoning middleware provides a neat way to surface a model’s internal thought process.
Chrome extensions can serve as a convenient distribution channel for private, offline AI tools.

What’s next?

Add support for vision‑enabled models (e.g., OCR, image captioning).
Experiment with quantisation tricks to squeeze larger models into lower‑VRAM devices.
Polish the UI/UX for mobile Chrome (side‑panel vs. full‑screen).

Feel free to try the extension at noaibills.app and open an issue if you run into anything!

Overview

Static export – Next.js builds with output: "export" and drops everything into extension/ui/. The side‑panel loads ui/index.html.
CSP issues – Chrome extensions disallow inline scripts. A post‑build script extracts every inline “ from the HTML, saves them as separate files, and rewrites the HTML to reference those files.
WASM loading – transformers.js needs ONNX Runtime WASM files, which can’t be fetched from a CDN inside an extension. The build script copies them into extension/transformers/ and they are declared in web_accessible_resources.

Result: one codebase, one build process.

Development runs at localhost:3000.
Production builds a Chrome extension.

Persisting Conversations

I wanted chats to survive tab closes and browser restarts, so I used Dexie (a thin IndexedDB wrapper) with a simple schema:

Field	Type
`id`	string (conversation id)
`title`	string
`model`	string
`provider`	string
`createdAt`	Date
`messages`	array of message objects

When a user selects a conversation from history, the app rehydrates everything—including the model that was used—so the user can continue exactly where they left off.

Migration from legacy storage

Older versions stored data in localStorage. On first load the app:

Detects legacy data.
Bulk‑inserts it into IndexedDB.
Deletes the old localStorage entry.

No chat history is lost.

Architecture Highlights

Single‑transport pattern – Adding a new provider is just “wire up the adapter, add the model IDs.” The UI stays untouched.
Browser limitations – CSP, WASM loading, and storage quotas are all solvable with the right build scripts; just allocate time for them.
Progress feedback – Users will wait for a 2 GB download if they see a progress bar. A blank screen leads to abandonment.

Use‑Case Positioning

Local AI is sufficient for most everyday tasks (drafts, summaries, quick coding questions).
It’s not a replacement for large cloud models like GPT‑4, but a 3 B‑parameter model running locally handles ~80 % of routine text work.

Target audience

Organizations with strict data‑privacy policies that block cloud AI and cannot install desktop tools such as Ollama or LMStudio.
Teams needing quick drafts, grammar checks, or basic reasoning without API costs or internet dependency.

For tasks requiring real‑time knowledge or deep reasoning, cloud models remain the better choice.

Takeaways for Builders

If you’re building something similar, the patterns below should generalize to any in‑browser runtime:

Static export with post‑build fixes for CSP and WASM.
Model manager + single‑transport abstraction for providers.
IndexedDB (via Dexie) for persistent conversation storage and migration from legacy formats.

Give it a try: noaibills.app

Feel free to reach out with questions or feedback—I’d love to hear from you!