Transformers.js v4 Preview: Now Available on NPM!
Source: Hugging Face Blog
![]()
Table of Contents
- Performance & Runtime Improvements
- Repository Restructuring
- PNPM Workspaces
- Modular Class Structure
- Examples Repository
- Prettier
- Formatting and Consistency
- New Models and Architectures
- New Build System
- Standalone
Tokenizers.jsLibrary - Miscellaneous Improvements
- Acknowledgements
We’re excited to announce that Transformers.js v4 (preview) is now available on npm! After nearly a year of development (we started in March 2025 🤯), we’re finally ready for you to test it out. Previously, users had to install v4 directly from source via GitHub; now it’s as simple as running a single command:
npm i @huggingface/transformers@next
We’ll continue publishing v4 releases under the next tag on npm until the full release, so expect regular updates!
Performance & Runtime Improvements
The biggest change is the adoption of a new WebGPU Runtime, completely rewritten in C++. We worked closely with the ONNX Runtime team to test this runtime across our ~200 supported model architectures, as well as many new v4‑exclusive architectures.
In addition to better operator support (for performance, accuracy, and coverage), the new WebGPU runtime lets the same transformers.js code run in a wide variety of JavaScript environments—including browsers, server‑side runtimes, and desktop applications. That means you can now run WebGPU‑accelerated models directly in Node, Bun, and Deno!

We’ve proven that it’s possible to run state‑of‑the‑art AI models 100 % locally in the browser. Now we’re focused on performance: making these models run as fast as possible, even in resource‑constrained environments. This required rethinking our export strategy, especially for large language models. We achieve this by re‑implementing new models operation‑by‑operation, leveraging specialized ONNX Runtime Contrib Operators such as:
com.microsoft.GroupQueryAttentioncom.microsoft.MatMulNBitscom.microsoft.QMoE
These operators maximize performance. For example, by adopting the com.microsoft.MultiHeadAttention operator we achieved ~4× speed‑up for BERT‑based embedding models.

This update also enables full offline support by caching WASM files locally in the browser, allowing users to run Transformers.js applications without an internet connection after the initial download.
Repository Restructuring
Developing a new major version gave us the opportunity to invest in the codebase and tackle long‑overdue refactoring efforts.
PNPM Workspaces
Until now, the GitHub repository served as our npm package. That worked while the repository exposed a single library, but we needed a more flexible structure for future sub‑packages that depend heavily on the Transformers.js core (e.g., library‑specific implementations or smaller utilities).
We therefore converted the repository to a monorepo using pnpm workspaces. This allows us to ship smaller packages that depend on @huggingface/transformers without the overhead of maintaining separate repositories.
Modular Class Structure
Another major refactor targeted the ever‑growing models.js file. In v3, all available models were defined in a single file spanning over 8 000 lines, making maintenance difficult. For v4 we split this into smaller, focused modules with a clear distinction between:
- Utility functions
- Core logic
- Model‑specific implementations
The new structure improves readability and makes it much easier to add new models. Developers can now focus on model‑specific logic without navigating through thousands of unrelated lines of code.
Examples Repository
In v3, many Transformers.js example projects lived directly in the main repository. For v4 we’ve moved them to a dedicated repository: . This keeps the core library clean and makes it easier for users to find and contribute examples without sifting through the main codebase.
Prettier
We updated the Prettier configuration and reformatted all files to follow a consistent style. (The remainder of this section continues in the full article.)
Formatting and Consistency
All files in the repository now use a single, shared Prettier configuration. This ensures consistent formatting throughout the codebase, with all future PRs automatically following the same style. No more debates about formatting—Prettier handles it all, keeping the code clean and readable for everyone.
New Models and Architectures
Thanks to our new export strategy and ONNX Runtime’s expanding support for custom operators, we’ve added many new models and architectures to Transformers.js v4. These include popular models such as:
- GPT‑OSS
- Chatterbox
- GraniteMoeHybrid
- LFM2‑MoE
- HunYuanDenseV1
- Apertus
- Olmo3
- FalconH1
- Youtu‑LLM
Many of these required us to implement support for advanced architectural patterns, including:
- Mamba (state‑space models)
- Multi‑head Latent Attention (MLA)
- Mixture of Experts (MoE)
All of these models are compatible with WebGPU, allowing users to run them directly in the browser or server‑side JavaScript environments with hardware acceleration.
New Build System
We’ve migrated our build system from Webpack to esbuild, and the results have been incredible:
- Build time: reduced from ~2 seconds to ~200 ms (≈10× faster)
- Bundle size: average reduction of ~10 % across all builds
- transformers.web.js: now 53 % smaller, leading to faster downloads and quicker startup times for users
Standalone Tokenizers.js Library
A frequent request from users was to extract the tokenization logic into a separate library. With v4, that’s exactly what we’ve done.
@huggingface/tokenizers is a complete refactor of the tokenization logic, designed to work seamlessly across browsers and server‑side runtimes. At just 8.8 kB (gzipped) with zero dependencies, it’s incredibly lightweight while remaining fully type‑safe.
Example
import { Tokenizer } from "@huggingface/tokenizers";
// Load from Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`
).then(res => res.json());
const tokenizerConfig = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`
).then(res => res.json());
// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Tokenize text
const tokens = tokenizer.tokenize("Hello World");
// ['Hello', 'ĠWorld']
const encoded = tokenizer.encode("Hello World");
// { ids: [9906, 4435], tokens: ['Hello', 'ĠWorld'], ... }
This separation keeps the core of Transformers.js focused and lean while offering a versatile, standalone tool that any WebML project can use independently.
Miscellaneous Improvements
We’ve made several quality‑of‑life improvements across the library:
-
Dynamic pipeline types that adapt based on inputs, providing better developer experience and type safety.

-
Enhanced logging for more control and clearer feedback during model execution.
-
Support for larger models exceeding 8 B parameters. In our tests, we ran GPT‑OSS 20B (q4f16) at ~60 tokens per second on an M4 Pro Max.
Acknowledgements
We want to extend our heartfelt thanks to everyone who contributed to this major release, especially:
- The ONNX Runtime team for their incredible work on the new WebGPU runtime and their support throughout development.
- All external contributors and early testers.