Kreuzberg v4.0.0-RC.8 is Available

Published: 3 days ago (December 15, 2025 at 08:06 AM EST)

4 min read

Source: Dev.to

Announcement

Kreuzberg v4.0.0‑rc.8 is now available on all channels. The final v4.0.0 release is scheduled for the beginning of next year (in a few weeks).

What is Kreuzberg?

Kreuzberg is a document‑intelligence toolkit that extracts text, metadata, tables, images, and structured data from 56+ file formats.

v1‑v3: pure Python implementation.
v4: complete rewrite in Rust (2024 edition) with native bindings for multiple languages, delivering zero‑cost abstractions, memory safety, and native performance.

Language Bindings

Language / Runtime	Binding Type
Rust	Native library
Python	PyO3 native bindings
TypeScript (Node.js)	NAPI‑RS native bindings
TypeScript (Deno / Browser / Edge)	WebAssembly
Ruby	Magnus FFI
Java 25+	Panama Foreign Function & Memory API
C#	P/Invoke
Go	cgo bindings

Upcoming bindings

PHP
Elixir (via Rustler, with Erlang & Gleam interop)

Installation Options

CLI – install via cargo or Homebrew.
HTTP REST API server (Axum).
Model Context Protocol (MCP) server for Claude Desktop / Continue.dev.
Docker images (publicly available).

Architectural Improvements

Zero‑copy operations using Rust’s ownership model.
True async concurrency with the Tokio runtime (no GIL).
Streaming parsers for constant‑memory processing of multi‑GB files.
SIMD‑accelerated text processing for token reduction and string ops.
Memory‑safe FFI boundaries for all language bindings.
Trait‑based plugin system for extensibility.

v3 → v4 Comparison

Feature	v3 (Python)	v4 (Rust)
Core Language	Pure Python	Rust (2024 edition)
Supported Formats	30‑40 (via Pandoc)	56+ (native parsers)
Language Bindings	Python only	7 languages (Rust, Python, TS, Ruby, Java, Go, C#)
Dependencies	Requires Pandoc (system binary)	Zero system dependencies
Embeddings	Not supported	FastEmbed with ONNX (3 presets + custom)
Semantic Chunking	External library	Built‑in (text + markdown‑aware)
Token Reduction	TF‑IDF based	Enhanced with 3 configurable modes
Language Detection	Optional (fast‑langdetect)	Built‑in (68 languages)
Keyword Extraction	Optional (KeyBERT)	Built‑in (YAKE + RAKE)
OCR Backends	Tesseract/EasyOCR/PaddleOCR	Same + tighter integration
Plugin System	Limited extractor registry	Full trait‑based system (4 plugin types)
Page Tracking	Character‑based indices	Byte‑based indices with O(1) lookup
Servers	REST API (Litestar)	HTTP (Axum) + MCP + MCP‑SSE
Installation Size	~100 MB base	16‑31 MB complete
Memory Model	Python heap	RAII with streaming
Concurrency	asyncio (GIL‑limited)	Tokio work‑stealing

From Pandoc to Native Parsers

v3 limitations (Pandoc):

System‑level dependency and installation overhead.
Subprocess spawn for every document.
No streaming support.
Limited metadata extraction.
~500 MB installation footprint.

v4 advantages (native Rust parsers):

Zero external dependencies.
Direct parsing with full control over extraction.
Substantially richer metadata (e.g., DOCX properties, section structure, styles).
Streaming support for massive files (tested on multi‑GB XML).
Example: PPTX extractor now streams gigabyte‑scale presentations with constant memory usage.

Expanded Format Support

Newly supported legacy formats:

.doc (Word 97‑2003)
.ppt (PowerPoint 97‑2003)
.xls (Excel 97‑2003)
.eml (Email)
.msg (Outlook)

Academic/technical formats:

LaTeX (.tex)
BibTeX (.bib)
Typst (.typ)
JATS XML (scientific articles)
DocBook XML
FictionBook (.fb2)
OPML (.opml)

Improved Office support:

Excel binary/macros (.xlsb, .xlsm)
Richer metadata extraction from DOCX/PPTX/XLSX
Full table extraction from presentations
Image extraction with deduplication

New Features for RAG & LLM Workflows

Embeddings (FastEmbed)

ONNX Runtime acceleration.
Presets: fast (384 d), balanced (512 d), quality (768 d/1024 d).
Custom ONNX model support.
Local generation (no external API calls, no rate limits).
Automatic model download & caching.

Example (Python)

from kreuzberg import ExtractionConfig, EmbeddingConfig, EmbeddingModelType

config = ExtractionConfig(
    embeddings=EmbeddingConfig(
        model=EmbeddingModelType.preset("balanced"),
        normalize=True
    )
)

result = kreuzberg.extract_bytes(pdf_bytes, config=config)
# result.embeddings contains a vector for each chunk

Semantic Text Chunking (Built‑in)

Structure‑aware chunking respecting document semantics.
Two strategies:
1. Generic – whitespace/punctuation aware.
2. Markdown – preserves headings, lists, code blocks, tables.
Configurable chunk size & overlap.
Unicode‑safe (handles CJK, emojis).
Automatic mapping of chunks to pages with byte‑accurate offsets.

Byte‑Accurate Page Tracking (Breaking Change)

v3: Character‑based indices (incorrect for UTF‑8 multi‑byte characters).
v4: Byte‑based indices (byte_start / byte_end) – correct for all string operations.
O(1) lookup: “which page contains byte offset X?”
Per‑page content extraction and page markers (e.g., --- Page 5 ---).

Enhanced Token Reduction

Three configurable modes to reduce LLM context size:

Mode	Approx. Reduction
Light	~15 %
Moderate	~30 %
Aggressive	~50 %

Implemented with TF‑IDF sentence scoring, position‑aware weighting, language‑specific stop‑word filtering, and SIMD acceleration.

Language Detection (Built‑in)

Supports 68 languages with confidence scores.
Handles mixed‑language documents.
ISO 639‑1 & ISO 639‑3 codes.
Configurable confidence thresholds.

Keyword Extraction (Built‑in)

YAKE – unsupervised, language‑independent.
RAKE – fast statistical method.
Configurable n‑grams (1‑3 words).
Relevance scoring with language‑specific stopwords.

Plugin System

Four extensible plugin types:

DocumentExtractor – custom file‑format handlers.
OcrBackend – integrate custom OCR engines (including Python models).
PostProcessor – data transformation & enrichment.
Validator – pre‑extraction validation.

Plugins are defined in Rust and work across all language bindings; Python/TypeScript can provide thread‑safe callbacks into the Rust core.

Production‑Ready Servers

HTTP REST API – Axum server with OpenAPI documentation.
MCP Server – Direct integration with Claude Desktop, Continue.dev, and other MCP clients.
MCP‑SSE Transport (RC.8) – Server‑Sent Events for environments without WebSocket support.

All server modes share the same feature set: extraction, batch processing, and caching.

For more details, refer to the official Kreuzberg documentation and release notes.