Kreuzberg v4.0.0-RC.8 is Available
Source: Dev.to
Announcement
Kreuzberg v4.0.0‑rc.8 is now available on all channels. The final v4.0.0 release is scheduled for the beginning of next year (in a few weeks).
What is Kreuzberg?
Kreuzberg is a document‑intelligence toolkit that extracts text, metadata, tables, images, and structured data from 56+ file formats.
- v1‑v3: pure Python implementation.
- v4: complete rewrite in Rust (2024 edition) with native bindings for multiple languages, delivering zero‑cost abstractions, memory safety, and native performance.
Language Bindings
| Language / Runtime | Binding Type |
|---|---|
| Rust | Native library |
| Python | PyO3 native bindings |
| TypeScript (Node.js) | NAPI‑RS native bindings |
| TypeScript (Deno / Browser / Edge) | WebAssembly |
| Ruby | Magnus FFI |
| Java 25+ | Panama Foreign Function & Memory API |
| C# | P/Invoke |
| Go | cgo bindings |
Upcoming bindings
- PHP
- Elixir (via Rustler, with Erlang & Gleam interop)
Installation Options
- CLI – install via
cargoor Homebrew. - HTTP REST API server (Axum).
- Model Context Protocol (MCP) server for Claude Desktop / Continue.dev.
- Docker images (publicly available).
Architectural Improvements
- Zero‑copy operations using Rust’s ownership model.
- True async concurrency with the Tokio runtime (no GIL).
- Streaming parsers for constant‑memory processing of multi‑GB files.
- SIMD‑accelerated text processing for token reduction and string ops.
- Memory‑safe FFI boundaries for all language bindings.
- Trait‑based plugin system for extensibility.
v3 → v4 Comparison
| Feature | v3 (Python) | v4 (Rust) |
|---|---|---|
| Core Language | Pure Python | Rust (2024 edition) |
| Supported Formats | 30‑40 (via Pandoc) | 56+ (native parsers) |
| Language Bindings | Python only | 7 languages (Rust, Python, TS, Ruby, Java, Go, C#) |
| Dependencies | Requires Pandoc (system binary) | Zero system dependencies |
| Embeddings | Not supported | FastEmbed with ONNX (3 presets + custom) |
| Semantic Chunking | External library | Built‑in (text + markdown‑aware) |
| Token Reduction | TF‑IDF based | Enhanced with 3 configurable modes |
| Language Detection | Optional (fast‑langdetect) | Built‑in (68 languages) |
| Keyword Extraction | Optional (KeyBERT) | Built‑in (YAKE + RAKE) |
| OCR Backends | Tesseract/EasyOCR/PaddleOCR | Same + tighter integration |
| Plugin System | Limited extractor registry | Full trait‑based system (4 plugin types) |
| Page Tracking | Character‑based indices | Byte‑based indices with O(1) lookup |
| Servers | REST API (Litestar) | HTTP (Axum) + MCP + MCP‑SSE |
| Installation Size | ~100 MB base | 16‑31 MB complete |
| Memory Model | Python heap | RAII with streaming |
| Concurrency | asyncio (GIL‑limited) | Tokio work‑stealing |
From Pandoc to Native Parsers
v3 limitations (Pandoc):
- System‑level dependency and installation overhead.
- Subprocess spawn for every document.
- No streaming support.
- Limited metadata extraction.
- ~500 MB installation footprint.
v4 advantages (native Rust parsers):
- Zero external dependencies.
- Direct parsing with full control over extraction.
- Substantially richer metadata (e.g., DOCX properties, section structure, styles).
- Streaming support for massive files (tested on multi‑GB XML).
- Example: PPTX extractor now streams gigabyte‑scale presentations with constant memory usage.
Expanded Format Support
Newly supported legacy formats:
.doc(Word 97‑2003).ppt(PowerPoint 97‑2003).xls(Excel 97‑2003).eml(Email).msg(Outlook)
Academic/technical formats:
- LaTeX (
.tex) - BibTeX (
.bib) - Typst (
.typ) - JATS XML (scientific articles)
- DocBook XML
- FictionBook (
.fb2) - OPML (
.opml)
Improved Office support:
- Excel binary/macros (
.xlsb,.xlsm) - Richer metadata extraction from DOCX/PPTX/XLSX
- Full table extraction from presentations
- Image extraction with deduplication
New Features for RAG & LLM Workflows
Embeddings (FastEmbed)
- ONNX Runtime acceleration.
- Presets: fast (384 d), balanced (512 d), quality (768 d/1024 d).
- Custom ONNX model support.
- Local generation (no external API calls, no rate limits).
- Automatic model download & caching.
Example (Python)
from kreuzberg import ExtractionConfig, EmbeddingConfig, EmbeddingModelType
config = ExtractionConfig(
embeddings=EmbeddingConfig(
model=EmbeddingModelType.preset("balanced"),
normalize=True
)
)
result = kreuzberg.extract_bytes(pdf_bytes, config=config)
# result.embeddings contains a vector for each chunk
Semantic Text Chunking (Built‑in)
- Structure‑aware chunking respecting document semantics.
- Two strategies:
- Generic – whitespace/punctuation aware.
- Markdown – preserves headings, lists, code blocks, tables.
- Configurable chunk size & overlap.
- Unicode‑safe (handles CJK, emojis).
- Automatic mapping of chunks to pages with byte‑accurate offsets.
Byte‑Accurate Page Tracking (Breaking Change)
- v3: Character‑based indices (incorrect for UTF‑8 multi‑byte characters).
- v4: Byte‑based indices (
byte_start/byte_end) – correct for all string operations. - O(1) lookup: “which page contains byte offset X?”
- Per‑page content extraction and page markers (e.g.,
--- Page 5 ---).
Enhanced Token Reduction
Three configurable modes to reduce LLM context size:
| Mode | Approx. Reduction |
|---|---|
| Light | ~15 % |
| Moderate | ~30 % |
| Aggressive | ~50 % |
Implemented with TF‑IDF sentence scoring, position‑aware weighting, language‑specific stop‑word filtering, and SIMD acceleration.
Language Detection (Built‑in)
- Supports 68 languages with confidence scores.
- Handles mixed‑language documents.
- ISO 639‑1 & ISO 639‑3 codes.
- Configurable confidence thresholds.
Keyword Extraction (Built‑in)
- YAKE – unsupervised, language‑independent.
- RAKE – fast statistical method.
- Configurable n‑grams (1‑3 words).
- Relevance scoring with language‑specific stopwords.
Plugin System
Four extensible plugin types:
- DocumentExtractor – custom file‑format handlers.
- OcrBackend – integrate custom OCR engines (including Python models).
- PostProcessor – data transformation & enrichment.
- Validator – pre‑extraction validation.
Plugins are defined in Rust and work across all language bindings; Python/TypeScript can provide thread‑safe callbacks into the Rust core.
Production‑Ready Servers
- HTTP REST API – Axum server with OpenAPI documentation.
- MCP Server – Direct integration with Claude Desktop, Continue.dev, and other MCP clients.
- MCP‑SSE Transport (RC.8) – Server‑Sent Events for environments without WebSocket support.
All server modes share the same feature set: extraction, batch processing, and caching.
For more details, refer to the official Kreuzberg documentation and release notes.