What Changed When Our Research Pipeline Hit a PDF Wall (Production Case Study)
Source: Dev.to
On March 12 2025 – Incident Overview
The PDF‑ingestion pipeline for a document‑heavy product hit a hard limit: nightly batches that previously finished in three hours began spilling into the business day, causing timeouts, missed SLAs, and angry support tickets.
- Product: Live production feature used by legal teams to search across contracts, scanned exhibits, and technical manuals.
- Impact: Lost user trust and a blocked roadmap that depended on faster, more reliable document understanding.
Discovery
We traced the outage to two linked problems:
- Brittle retrieval layer – failed on scanned PDFs with complex layouts.
- Orchestration scheme – treated every file as “same weight” during processing.
The existing pipeline used an off‑the‑shelf OCR + embedding flow that worked for plain text but degraded quickly on mixed‑layout documents (tables, figures, two‑column scans). The result was:
- High false‑negative rates for entity extraction.
- Queue backlog.
What we needed
A system that could do more than keyword matching: a repeatable, evidence‑driven research path for each document that combined
- Layout‑aware parsing,
- Citation‑style provenance,
- Prioritized re‑processing.
That led us to evaluate specialized tooling focused on long‑form analysis. The team agreed we needed a dedicated Deep Research Tool to run a programmatic, document‑first investigation at scale without manually curating source lists.
Example failure
A legal brief processed through the old flow returned this error fragment in the logs:
Error: embedding failure - token overflow at pipeline.step.embed(4500 tokens)
Context: OCR produced repeated headers and malformed table OCR output that corrupted the tokenizer.
This concrete error drove two decisions:
- Limit token inputs via smart chunking.
- Move heavy context reasoning off to a deeper research layer that could synthesize across multiple pages before producing final extractions.
Implementation
Phase 1 – Stabilization (Week 1)
Goal: Introduce deterministic chunking and prioritize files.
- Inserted a lightweight pre‑processor that extracted page‑level layout metadata and classified pages as text‑first, table‑first, or image‑first.
- Routed documents down different micro‑pipelines.
# routify.py – determine micro‑pipeline based on layout
def route_document(layout_stats):
if layout_stats['table_density'] > 0.3:
return 'table_pipeline'
if layout_stats['image_coverage'] > 0.25:
return 'image_pipeline'
return 'text_pipeline'
# used in orchestration
pipeline = route_document(extract_layout_stats(pdf_blob))
Phase 2 – Deep‑Reasoning Overlay (Weeks 2‑3)
- Prototyped a research‑style assistant that could plan a reading strategy for each document batch (identify tables to extract, pages to OCR at higher quality, sections to prioritize for citation).
- Integrated a third‑party component that acts like an AI Research Assistant to orchestrate multi‑step passes: read → plan → extract → verify.
- Not a blind LLM call – it ran a plan, logged decisions, and attached provenance to every extracted fact.
Friction discovered: The assistant sometimes conflated citations across documents when references were vague.
Fix:
- Added document‑scoped prefixes to all internal identifiers.
- Enforced a strict evidence threshold (two independent page‑level matches required for a claim to be promoted).
Phase 3 – Scale & Resilience (Weeks 4‑6)
- Replaced the synchronous single‑worker model with an async worker pool that scheduled deep passes only when fast‑path heuristics failed.
- Decreased worker contention and reserved heavy‑reasoning passes for worst‑case documents.
- Exposed a “deep audit” endpoint that let engineers replay reasoning steps for any extraction – critical during debugging.
# trigger a deep‑research replay for a document id
curl -X POST "https://internal.api/research/replay" \
-H "Authorization: Bearer $TOKEN" \
-d '{"document_id":"doc-20250312-47","mode":"full-audit"}'
Quality‑gate configuration
# gates.yml – quality gate sample
fields:
- name: counterparty
required: true
min_confidence: 0.85
- name: effective_date
required: false
min_confidence: 0.70
Decision point:
- Option A: Expand the cheap embedding layer to handle noisy OCR.
- Option B: Adopt a dedicated deep‑research overlay and keep the cheap layer limited.
We chose Option B because expanding embeddings would have multiplied processing time across the entire corpus. Isolating complexity to problematic documents gave a better ROI. The overlay leveraged a specialized Deep Research AI mode that supported stepwise plans and stronger provenance, matching our need for reproducible, auditable extraction.
Results
After six weeks of incremental rollout:
| Metric | Before | After |
|---|---|---|
| Nightly batch completion time | > 3 h (many spilling into business day) | ≤ 3 h for 95 % of jobs |
| Deep‑research path usage | N/A | 5 % of jobs (problematic docs) |
| Queue backlog | Persistent, causing SLA misses | Eliminated |
| Manual re‑processes | High | Dropped dramatically |
| Extraction reliability | Fragile (frequent failures) | Stable – verifiable extractions via multi‑pass workflow |
Key takeaways:
- Deterministic chunking + layout‑aware routing removed the bulk of token‑overflow errors.
- The deep‑research overlay provided a safety net for the hardest documents without sacrificing overall throughput.
- Provenance‑rich audit logs gave engineers confidence during post‑mortems and satisfied compliance requirements.
TL;DR
- Problem: PDF pipeline overloaded, causing timeouts and SLA breaches.
- Solution: Layout‑aware pre‑processing, smart routing, and a deep‑research overlay with provenance.
- Outcome: 95 % of nightly batches finish under three hours, backlog cleared, and extraction reliability transformed from fragile to stable.
Reasoning
The team reduced false negatives on entity extraction by a significant margin after introducing layout‑aware routing and deep passes.
- Operational cost stayed efficient: heavy passes were targeted, so average compute per document increased only modestly while overall throughput improved.
Trade‑offs were explicit. The deep‑research overlay added latency for a minority of documents and required more engineering oversight during early rollout. It also increased complexity in the debugging workflow, which we mitigated by adding the replay and audit endpoints shown above.
Qualitative ROI Summary
The architecture moved from “best‑effort extraction” to a “tiered confidence pipeline” that delivered reproducible results and much clearer debugging signals. The real lever was separating fast heuristics from slow, evidence‑heavy reasoning—this pattern is the core of modern document‑AI workflows and the reason teams often adopt a research‑style orchestration layer rather than pushing every document through a single model.
Practical takeaway
If your document pipeline fails at scale,
- Add layout‑aware routing.
- Isolate heavy reasoning into a targeted research pass.
- Require provenance for promoted facts.
These moves keep average costs low and results reliable.
Teams building similar features should evaluate solutions that offer:
- Multi‑pass planning
- Document‑scoped provenance
- Configurable quality gates
Those capabilities often make the difference between brittle extraction and a system engineers can trust in production.