The Perfect Extraction: Unlocking Unstructured Data with Docling + LangExtract 🚀
Source: Dev.to
The Structural Foundation: IBM Docling 📑
The first challenge in any extraction pipeline is converting “messy” formats into machine‑readable data without losing structural metadata. Docling is an open‑source toolkit that streamlines this process, turning unstructured files into JSON or Markdown that LLMs can easily digest.
Unlike traditional OCR, which can be slow and error‑prone, Docling uses specialized computer‑vision models like DocLayNet for layout analysis and TableFormer for recovering complex table structures. It identifies headers, list items, and even equations while maintaining their hierarchical relationships.
How to start with Docling
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
# Export to Markdown for LLM readiness
print(result.document.export_to_markdown())
The Semantic Engine: Google’s LangExtract 🧠
Once you have clean text, you need a way to pull out specific, structured information. LangExtract is a Python library designed to transform raw text into rigorously structured data based on user‑defined schemas and few‑shot examples.
Its defining feature is Precise Source Grounding, which maps every extracted entity to its exact character offsets in the original text. This is critical for sensitive domains like healthcare (clinical notes) or legal services, where every data point must be auditable.
Setting up a LangExtract task
import langextract as lx
# 1. Define the extraction rules
prompt = "Extract characters and their emotional states."
# 2. Provide few‑shot examples for schema enforcement
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks?",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
)
]
)
]
# 3. Run the extraction
result = lx.extract(
text_or_documents="Lady Juliet gazed longingly at the stars...",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash"
)
Achieving 100 % Traceability: The Integrated Pipeline 🔍
The true magic happens when you combine these two tools. LangExtract works only on raw text strings, which often requires manual file conversion and can lose document layout and provenance. By using Docling as the front‑end, you can parse various formats into a rich, unified representation that includes page numbers and bounding boxes.
This integration creates a seamless pipeline where semantic data extracted by LangExtract can be mapped back through Docling’s metadata to its exact physical location on a PDF page, providing 100 % traceability—both in text and visually.
Conceptual Integrated Workflow
# Conceptual: Using Docling for provenance‑aware extraction
from docling.document_converter import DocumentConverter
import langextract as lx
# Step 1: Convert with Docling to preserve metadata
converter = DocumentConverter()
conv_result = converter.convert("report.pdf")
text = conv_result.document.export_to_text()
# Step 2: Extract with LangExtract
result = lx.extract(text_or_documents=text, ...)
# Step 3: Map offsets back to Docling's page/bbox metadata
# (Conceptual integration for visual auditability)
Production Benefits and Industry Impact 📈
- RAG & Graph‑RAG: High‑recall, structured output is ideal for feeding Knowledge Graphs or advanced Retrieval‑Augmented Generation systems.
- Auditability: Interactive HTML visualizations let human‑in‑the‑loop reviewers click an extracted entity and see it highlighted directly in the original context.
- Domain Adaptability: The pipeline can be adapted for radiology reports (RadExtract), financial summaries, or resume parsing without expensive model fine‑tuning.
Conclusion: The Future of Document Intelligence ✨
By uniting Docling’s structural layout analysis with LangExtract’s grounded semantic reasoning, developers can move past “fragmented” extractions. This synergy turns unstructured documents into “structured gold” with a complete, verifiable audit trail for every data point.
The Pipeline Metaphor
Docling is a meticulous librarian who takes a pile of loose, unnumbered pages and organizes them into a bound book with a detailed table of contents. LangExtract is the expert researcher who reads that book, highlighting every vital fact with a neon marker and leaving a precise bookmark that points exactly to the sentence used as proof. Without the librarian, the researcher’s desk is a mess; without the researcher, the librarian’s work is just an organized pile of unread information.