Kreuzberg v4.0.0-RC.8 is Available

Published: (December 15, 2025 at 08:06 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

Announcement

Kreuzberg v4.0.0‑rc.8 is now available on all channels. The final v4.0.0 release is scheduled for the beginning of next year (in a few weeks).

What is Kreuzberg?

Kreuzberg is a document‑intelligence toolkit that extracts text, metadata, tables, images, and structured data from 56+ file formats.

  • v1‑v3: pure Python implementation.
  • v4: complete rewrite in Rust (2024 edition) with native bindings for multiple languages, delivering zero‑cost abstractions, memory safety, and native performance.

Language Bindings

Language / RuntimeBinding Type
RustNative library
PythonPyO3 native bindings
TypeScript (Node.js)NAPI‑RS native bindings
TypeScript (Deno / Browser / Edge)WebAssembly
RubyMagnus FFI
Java 25+Panama Foreign Function & Memory API
C#P/Invoke
Gocgo bindings

Upcoming bindings

  • PHP
  • Elixir (via Rustler, with Erlang & Gleam interop)

Installation Options

  • CLI – install via cargo or Homebrew.
  • HTTP REST API server (Axum).
  • Model Context Protocol (MCP) server for Claude Desktop / Continue.dev.
  • Docker images (publicly available).

Architectural Improvements

  • Zero‑copy operations using Rust’s ownership model.
  • True async concurrency with the Tokio runtime (no GIL).
  • Streaming parsers for constant‑memory processing of multi‑GB files.
  • SIMD‑accelerated text processing for token reduction and string ops.
  • Memory‑safe FFI boundaries for all language bindings.
  • Trait‑based plugin system for extensibility.

v3 → v4 Comparison

Featurev3 (Python)v4 (Rust)
Core LanguagePure PythonRust (2024 edition)
Supported Formats30‑40 (via Pandoc)56+ (native parsers)
Language BindingsPython only7 languages (Rust, Python, TS, Ruby, Java, Go, C#)
DependenciesRequires Pandoc (system binary)Zero system dependencies
EmbeddingsNot supportedFastEmbed with ONNX (3 presets + custom)
Semantic ChunkingExternal libraryBuilt‑in (text + markdown‑aware)
Token ReductionTF‑IDF basedEnhanced with 3 configurable modes
Language DetectionOptional (fast‑langdetect)Built‑in (68 languages)
Keyword ExtractionOptional (KeyBERT)Built‑in (YAKE + RAKE)
OCR BackendsTesseract/EasyOCR/PaddleOCRSame + tighter integration
Plugin SystemLimited extractor registryFull trait‑based system (4 plugin types)
Page TrackingCharacter‑based indicesByte‑based indices with O(1) lookup
ServersREST API (Litestar)HTTP (Axum) + MCP + MCP‑SSE
Installation Size~100 MB base16‑31 MB complete
Memory ModelPython heapRAII with streaming
Concurrencyasyncio (GIL‑limited)Tokio work‑stealing

From Pandoc to Native Parsers

v3 limitations (Pandoc):

  • System‑level dependency and installation overhead.
  • Subprocess spawn for every document.
  • No streaming support.
  • Limited metadata extraction.
  • ~500 MB installation footprint.

v4 advantages (native Rust parsers):

  • Zero external dependencies.
  • Direct parsing with full control over extraction.
  • Substantially richer metadata (e.g., DOCX properties, section structure, styles).
  • Streaming support for massive files (tested on multi‑GB XML).
  • Example: PPTX extractor now streams gigabyte‑scale presentations with constant memory usage.

Expanded Format Support

Newly supported legacy formats:

  • .doc (Word 97‑2003)
  • .ppt (PowerPoint 97‑2003)
  • .xls (Excel 97‑2003)
  • .eml (Email)
  • .msg (Outlook)

Academic/technical formats:

  • LaTeX (.tex)
  • BibTeX (.bib)
  • Typst (.typ)
  • JATS XML (scientific articles)
  • DocBook XML
  • FictionBook (.fb2)
  • OPML (.opml)

Improved Office support:

  • Excel binary/macros (.xlsb, .xlsm)
  • Richer metadata extraction from DOCX/PPTX/XLSX
  • Full table extraction from presentations
  • Image extraction with deduplication

New Features for RAG & LLM Workflows

Embeddings (FastEmbed)

  • ONNX Runtime acceleration.
  • Presets: fast (384 d), balanced (512 d), quality (768 d/1024 d).
  • Custom ONNX model support.
  • Local generation (no external API calls, no rate limits).
  • Automatic model download & caching.

Example (Python)

from kreuzberg import ExtractionConfig, EmbeddingConfig, EmbeddingModelType

config = ExtractionConfig(
    embeddings=EmbeddingConfig(
        model=EmbeddingModelType.preset("balanced"),
        normalize=True
    )
)

result = kreuzberg.extract_bytes(pdf_bytes, config=config)
# result.embeddings contains a vector for each chunk

Semantic Text Chunking (Built‑in)

  • Structure‑aware chunking respecting document semantics.
  • Two strategies:
    1. Generic – whitespace/punctuation aware.
    2. Markdown – preserves headings, lists, code blocks, tables.
  • Configurable chunk size & overlap.
  • Unicode‑safe (handles CJK, emojis).
  • Automatic mapping of chunks to pages with byte‑accurate offsets.

Byte‑Accurate Page Tracking (Breaking Change)

  • v3: Character‑based indices (incorrect for UTF‑8 multi‑byte characters).
  • v4: Byte‑based indices (byte_start / byte_end) – correct for all string operations.
  • O(1) lookup: “which page contains byte offset X?”
  • Per‑page content extraction and page markers (e.g., --- Page 5 ---).

Enhanced Token Reduction

Three configurable modes to reduce LLM context size:

ModeApprox. Reduction
Light~15 %
Moderate~30 %
Aggressive~50 %

Implemented with TF‑IDF sentence scoring, position‑aware weighting, language‑specific stop‑word filtering, and SIMD acceleration.

Language Detection (Built‑in)

  • Supports 68 languages with confidence scores.
  • Handles mixed‑language documents.
  • ISO 639‑1 & ISO 639‑3 codes.
  • Configurable confidence thresholds.

Keyword Extraction (Built‑in)

  • YAKE – unsupervised, language‑independent.
  • RAKE – fast statistical method.
  • Configurable n‑grams (1‑3 words).
  • Relevance scoring with language‑specific stopwords.

Plugin System

Four extensible plugin types:

  1. DocumentExtractor – custom file‑format handlers.
  2. OcrBackend – integrate custom OCR engines (including Python models).
  3. PostProcessor – data transformation & enrichment.
  4. Validator – pre‑extraction validation.

Plugins are defined in Rust and work across all language bindings; Python/TypeScript can provide thread‑safe callbacks into the Rust core.

Production‑Ready Servers

  • HTTP REST API – Axum server with OpenAPI documentation.
  • MCP Server – Direct integration with Claude Desktop, Continue.dev, and other MCP clients.
  • MCP‑SSE Transport (RC.8) – Server‑Sent Events for environments without WebSocket support.

All server modes share the same feature set: extraction, batch processing, and caching.


For more details, refer to the official Kreuzberg documentation and release notes.

Back to Blog

Related posts

Read more »