Transformers.js v4 Preview: Now Available on NPM!

Published: 3 days ago (February 8, 2026 at 07:00 PM EST)

6 min read

Source: Hugging Face Blog

Performance & Runtime Improvements
Repository Restructuring
PNPM Workspaces
Modular Class Structure
Examples Repository
Prettier
Formatting and Consistency
New Models and Architectures
New Build System
Standalone Tokenizers.js Library
Miscellaneous Improvements
Acknowledgements

We’re excited to announce that Transformers.js v4 (preview) is now available on npm! After nearly a year of development (we started in March 2025 🤯), we’re finally ready for you to test it out. Previously, users had to install v4 directly from source via GitHub; now it’s as simple as running a single command:

npm i @huggingface/transformers@next

We’ll continue publishing v4 releases under the next tag on npm until the full release, so expect regular updates!

Performance & Runtime Improvements

The biggest change is the adoption of a new WebGPU Runtime, completely rewritten in C++. We worked closely with the ONNX Runtime team to test this runtime across our ~200 supported model architectures, as well as many new v4‑exclusive architectures.

In addition to better operator support (for performance, accuracy, and coverage), the new WebGPU runtime lets the same transformers.js code run in a wide variety of JavaScript environments—including browsers, server‑side runtimes, and desktop applications. That means you can now run WebGPU‑accelerated models directly in Node, Bun, and Deno!

WebGPU runtime illustration

We’ve proven that it’s possible to run state‑of‑the‑art AI models 100 % locally in the browser. Now we’re focused on performance: making these models run as fast as possible, even in resource‑constrained environments. This required rethinking our export strategy, especially for large language models. We achieve this by re‑implementing new models operation‑by‑operation, leveraging specialized ONNX Runtime Contrib Operators such as:

com.microsoft.GroupQueryAttention
com.microsoft.MatMulNBits
com.microsoft.QMoE

These operators maximize performance. For example, by adopting the com.microsoft.MultiHeadAttention operator we achieved ~4× speed‑up for BERT‑based embedding models.

Speed‑up comparison chart

This update also enables full offline support by caching WASM files locally in the browser, allowing users to run Transformers.js applications without an internet connection after the initial download.

Repository Restructuring

Developing a new major version gave us the opportunity to invest in the codebase and tackle long‑overdue refactoring efforts.

PNPM Workspaces

Until now, the GitHub repository served as our npm package. That worked while the repository exposed a single library, but we needed a more flexible structure for future sub‑packages that depend heavily on the Transformers.js core (e.g., library‑specific implementations or smaller utilities).

We therefore converted the repository to a monorepo using pnpm workspaces. This allows us to ship smaller packages that depend on @huggingface/transformers without the overhead of maintaining separate repositories.

Modular Class Structure

Another major refactor targeted the ever‑growing models.js file. In v3, all available models were defined in a single file spanning over 8 000 lines, making maintenance difficult. For v4 we split this into smaller, focused modules with a clear distinction between:

Utility functions
Core logic
Model‑specific implementations

The new structure improves readability and makes it much easier to add new models. Developers can now focus on model‑specific logic without navigating through thousands of unrelated lines of code.

Examples Repository

In v3, many Transformers.js example projects lived directly in the main repository. For v4 we’ve moved them to a dedicated repository: . This keeps the core library clean and makes it easier for users to find and contribute examples without sifting through the main codebase.

Prettier

We updated the Prettier configuration and reformatted all files to follow a consistent style. (The remainder of this section continues in the full article.)

Formatting and Consistency

All files in the repository now use a single, shared Prettier configuration. This ensures consistent formatting throughout the codebase, with all future PRs automatically following the same style. No more debates about formatting—Prettier handles it all, keeping the code clean and readable for everyone.

New Models and Architectures

Thanks to our new export strategy and ONNX Runtime’s expanding support for custom operators, we’ve added many new models and architectures to Transformers.js v4. These include popular models such as:

GPT‑OSS
Chatterbox
GraniteMoeHybrid
LFM2‑MoE
HunYuanDenseV1
Apertus
Olmo3
FalconH1
Youtu‑LLM

Many of these required us to implement support for advanced architectural patterns, including:

Mamba (state‑space models)
Multi‑head Latent Attention (MLA)
Mixture of Experts (MoE)

All of these models are compatible with WebGPU, allowing users to run them directly in the browser or server‑side JavaScript environments with hardware acceleration.

New Build System

We’ve migrated our build system from Webpack to esbuild, and the results have been incredible:

Build time: reduced from ~2 seconds to ~200 ms (≈10× faster)
Bundle size: average reduction of ~10 % across all builds
transformers.web.js: now 53 % smaller, leading to faster downloads and quicker startup times for users

Standalone `Tokenizers.js` Library

A frequent request from users was to extract the tokenization logic into a separate library. With v4, that’s exactly what we’ve done.

@huggingface/tokenizers is a complete refactor of the tokenization logic, designed to work seamlessly across browsers and server‑side runtimes. At just 8.8 kB (gzipped) with zero dependencies, it’s incredibly lightweight while remaining fully type‑safe.

Example

import { Tokenizer } from "@huggingface/tokenizers";

// Load from Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
  `https://huggingface.co/${modelId}/resolve/main/tokenizer.json`
).then(res => res.json());

const tokenizerConfig = await fetch(
  `https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`
).then(res => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize("Hello World");
// ['Hello', 'ĠWorld']

const encoded = tokenizer.encode("Hello World");
// { ids: [9906, 4435], tokens: ['Hello', 'ĠWorld'], ... }

This separation keeps the core of Transformers.js focused and lean while offering a versatile, standalone tool that any WebML project can use independently.

Miscellaneous Improvements

We’ve made several quality‑of‑life improvements across the library:

Dynamic pipeline types that adapt based on inputs, providing better developer experience and type safety.
Enhanced logging for more control and clearer feedback during model execution.
Support for larger models exceeding 8 B parameters. In our tests, we ran GPT‑OSS 20B (q4f16) at ~60 tokens per second on an M4 Pro Max.

Acknowledgements

We want to extend our heartfelt thanks to everyone who contributed to this major release, especially:

The ONNX Runtime team for their incredible work on the new WebGPU runtime and their support throughout development.
All external contributors and early testers.

Transformers.js v4 Preview: Now Available on NPM!

Table of Contents

Performance & Runtime Improvements

Repository Restructuring

PNPM Workspaces

Modular Class Structure

Examples Repository

Prettier

Formatting and Consistency

New Models and Architectures

New Build System

Standalone `Tokenizers.js` Library

Example

Miscellaneous Improvements

Acknowledgements

Related posts

Image Classification with CNNs – Part 3: Understanding Max Pooling and Results

MiniMax's new open M2.5 and M2.5 Lightning near state-of-the-art while costing 1/20th of Claude Opus 4.6

I Built a Feedback Loop That Coaches LLMs at Runtime Using NumPy

Attackers prompted Gemini over 100,000 times while trying to clone it, Google says

Table of Contents

Performance & Runtime Improvements

Repository Restructuring

PNPM Workspaces

Modular Class Structure

Examples Repository

Prettier

Formatting and Consistency

New Models and Architectures

New Build System

Standalone Tokenizers.js Library

Example

Miscellaneous Improvements

Acknowledgements

Related posts

Image Classification with CNNs – Part 3: Understanding Max Pooling and Results

MiniMax's new open M2.5 and M2.5 Lightning near state-of-the-art while costing 1/20th of Claude Opus 4.6

I Built a Feedback Loop That Coaches LLMs at Runtime Using NumPy

Attackers prompted Gemini over 100,000 times while trying to clone it, Google says

Standalone `Tokenizers.js` Library