Beyond OCR: Building a Truly Multimodal Local RAG Pipeline

Published: 1 month ago (March 11, 2026 at 03:49 AM EDT)

8 min read

Source: Dev.to

Source: Dev.to

Source: Beyond OCR: Building a Truly Multimodal Local RAG Pipeline – Dev.to

The Problem with Classic RAG Pipelines

If you’ve ever tried to build a document chatbot over a collection of scanned reports, technical manuals, or mixed‑content PDFs, you’ve probably run into the same wall: classic RAG pipelines are essentially blind.

They extract text, chunk it, embed it, and retrieve it — but the moment your document contains a scanned table, a wiring diagram, or an annotated chart, that information either gets mangled by OCR or vanishes entirely. The retrieved context is impoverished, and your chatbot’s answers reflect that.

Ask it about the diagram on page 12, and it will confidently summarise the paragraph next to it, which is arguably worse than saying nothing at all.

A Better Way

Instead of treating documents as bags of text, treat them the way a human would: read the page as a whole, visuals included. For pages that contain native, selectable text, you don’t have to choose between precision and visual understanding—you can have both.

The standard pipeline — OCR → chunking → embedding → vector search → LLM — was designed for text‑native documents. When applied to rich, heterogeneous content, it breaks down in predictable ways:

Scanned tables lose their structure and become an unreadable string of values.
Technical diagrams are reduced to a handful of disconnected labels.
Spatial relationships (captions, callouts, annotations) are destroyed.
Charts and graphs lose all their meaning once flattened to text.

The root problem is that OCR reduces a two‑dimensional, semantically rich object (a page) to a one‑dimensional stream of characters. You can’t recover what was never captured—it’s like describing a painting by reading the label on the frame: technically accurate, entirely useless.

Complementary Approaches: Native Text + Vision‑Language Models

Native text extraction and Vision‑Language Models (VLMs) are not competing approaches—they are complementary. Each fills the gaps left by the other.

Approach	Strengths
Native text (via PyMuPDF)	• Exact, faithful representation of characters • Computationally free • No risk of hallucination
Vision‑Language Models	• Understand layout, visual semantics, and spatial relationships • Capture tables, diagrams, charts, and images that pure text extraction misses

Recommended Pipeline

Hybrid pages (native text + visual elements)
- Use native extraction for the prose.
- Apply a VLM only to the visual components (tables, diagrams, charts, images).
Fully scanned pages
- Let the VLM handle the entire page, as there is no native text to extract.

By combining both methods, you obtain a complete, accurate representation of the document’s content.

The Combined Pipeline

import fitz               # PyMuPDF
from pdf2image import convert_from_path
import ollama

# Load the PDF and render each page as an image
doc = fitz.open("document.pdf")
page_images = convert_from_path("document.pdf", dpi=200)

for i, page in enumerate(doc):
    native_text = page.get_text()
    image = page_images[i]
    # Pass both to the processing function

Detecting Visuals

def page_has_visuals(page) -> bool:
    """
    Return ``True`` if the page contains any images or vector drawings.
    """
    images = page.get_images()
    drawings = page.get_drawings()
    return bool(images) or bool(drawings)

VLM Helpers (using Ollama)

def describe_visuals(image_path: str) -> str:
    """
    Describe only the non‑textual elements on a page.

    The prompt asks the model to focus on tables, diagrams, charts,
    images, and graphs, transcribing tables in Markdown/JSON and
    ignoring plain‑text paragraphs.
    """
    prompt = """Focus only on non‑textual elements on this page:
    - Tables, diagrams, charts, images, graphs
    - Describe their content and what they convey
    - For tables, transcribe their content in Markdown or JSON
    - Ignore plain‑text paragraphs
    - If there are no visual elements, say so briefly."""
    
    response = ollama.chat(
        model="llama3.2-vision",
        messages=[{"role": "user", "content": prompt, "images": [image_path]}],
    )
    return response["message"]["content"]


def describe_full_page(image_path: str) -> str:
    """
    Exhaustively describe a fully scanned page, including text,
    tables, diagrams, charts, and graphs.
    """
    prompt = """Describe this document page exhaustively:
    - If you see a table: transcribe its full content in a structured way
    - If you see a diagram or chart: describe its elements and relationships
    - If you see text: transcribe it faithfully
    - If you see a graph: describe the data and visible trends
    Be precise and thorough."""
    
    response = ollama.chat(
        model="llama3.2-vision",
        messages=[{"role": "user", "content": prompt, "images": [image_path]}],
    )
    return response["message"]["content"]

Page‑Level Processing

def process_page(page, page_image_path: str) -> str:
    """
    Decide how to extract information from a page:

    * If the page has both substantial native text and visuals,
      combine the exact text with a VLM description of the visuals.
    * If it contains only text, return the native extraction.
    * If it is a scanned image without reliable text, let the VLM
      describe the entire page.
    """
    native_text = page.get_text().strip()
    has_text = len(native_text) > 100          # heuristic: enough prose?
    has_visuals = page_has_visuals(page)

    if has_text and has_visuals:
        # Best of both worlds: precise text + VLM for visuals
        visual_desc = describe_visuals(page_image_path)
        return (
            f"## Extracted text\n{native_text}\n\n"
            f"## Visual elements\n{visual_desc}"
        )
    elif has_text:
        # Text‑only page: no VLM needed
        return native_text
    else:
        # Fully scanned page: VLM takes over entirely
        return describe_full_page(page_image_path)

Benefits of the Combined Approach

Benefit	How It’s Achieved
Exact recall	Queries for a specific article number or technical specification match the native text verbatim.
Semantic recall	Queries like “the heat‑flow diagram” or “the comparison table” match the VLM’s description.
Structural fidelity	Tables are indexed as structured Markdown/JSON, not as a garbled sequence of cell values.
Compute efficiency	The VLM runs only when visual elements are present, keeping ingestion time reasonable.

Going Further: ColPali (Visual‑Only Retrieval)

For the most demanding use cases, ColPali takes a fundamentally different approach: it embeds document pages directly as images, without any intermediate text representation. Queries are embedded in the same visual space, and retrieval is based on visual similarity.

from colpali_engine.models import ColPali, ColPaliProcessor

model = ColPali.from_pretrained("vidore/colpali-base")
processor = ColPaliProcessor.from_pretrained("vidore/colpali-base")
# ... further ingestion / retrieval code ...

The snippet above shows the basic model loading; the full ingestion pipeline would involve converting each page to an image, feeding it through processor, and storing the resulting embeddings for later similarity search.

TL;DR

Native text extraction – precise, hallucination‑free prose.
Vision‑Language Models – understanding of tables, diagrams, charts, and other visuals.
Combine them – a robust RAG pipeline that works on heterogeneous PDFs, improves recall, and stays computationally efficient.

# ColPali‑v1.2

> **Both page images and text queries are embedded directly**  
> Retrieval happens in the visual embedding space.

The benefit is **zero information loss** — the layout itself is part of the index.  
ColPali consistently ranks among the best‑performing models on document‑retrieval benchmarks, particularly for visually complex pages. It can also be combined with the hybrid approach above: use ColPali for retrieval, then pass the retrieved page image **plus** its extracted text to the LLM for generation.

Example: Storing Page Descriptions with Chroma

from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

# Initialize the embedding model
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Create a Chroma vector store that persists to ./db
vectorstore = Chroma(
    embedding_function=embeddings,
    persist_directory="./db",
)

# Index each combined page description
vectorstore.add_texts(
    texts=[page_description],
    metadatas=[{"source": filename, "page": page_num}],
)

Practical Tips for VLM‑Powered Document Retrieval

Chunk at the page level
A page is a natural semantic unit for a VLM. Splitting mid‑page breaks the visual context the model needs to produce a coherent description.
Keep the original images
Store the source page image alongside its description. When a page is retrieved, you can pass the image directly to the LLM as additional context—especially useful for complex visuals that are hard to describe fully in text.
Tailor VLM prompts to document type
A technical schematic, a financial report, and a product datasheet each warrant different prompting strategies. Investing in prompt templates per document category pays off in description quality.
Request structured output for tables
When a page contains tabular data, explicitly ask the VLM to output Markdown or JSON. This preserves structure in a way that plain prose cannot, and makes the indexed content far easier for the LLM to reason over.

Why Combine OCR and VLMs?

The classic OCR‑based RAG pipeline was never designed for visually rich documents. The solution isn’t to replace text extraction with a VLM—it’s to use both in concert:

Native text provides precision and reliability.
VLM adds visual understanding.

Together, they produce page descriptions that are richer than either could achieve alone—think of it as hiring both a speed‑reader and an art critic and having them share a desk.

End‑to‑End Local Stack

Combined with a fully local stack, this approach gives you a document chatbot that can reason over:

Tables
Diagrams
Charts
Mixed content

All without any data leaving your infrastructure. The tooling is mature, the models are capable, and the entire pipeline runs on commodity hardware. There’s no reason to settle for text‑only anymore.