비전 LLMs도 PDF 파서… 차트·그래프 읽어 RAG 지원
Source: 데이터 사이언스 companion in 기업 문서 인텔리전스, the series that builds an enterprise RAG system from four bricks. 기사 5 (문서 파싱) built the parser with PyMuPDF (fitz), which reads the words on a page. This companion swaps the engine for a vision LLM that reads the page as an image, so it gives you the words plus the one thing the text parsers cannot, the content of the pictures.
이 동반자는 기사 5(문서 파싱) 내부 Part II(네 개의 브릭)에서 다른 파싱 엔진을 사용한다는 위치에 있습니다 – 이미지 저자 제공
Show a PDF parser a chart and it sees an empty box. The text engines, native or cloud or local, all find the words on a page and put them in searchable tables. A chart has no words, so to every one of them the region is blank, and to a retrieval system it does not exist.
A vision model is different. It looks at the page the way a person would. Ask it for the text and it gives you the text and the tables, just like the others. Show it a chart and it tells you what the chart says, in plain words you can search. That last part is what the others can’t do.
The catch: it is slower, costs more, and reads numbers off a chart only roughly. It is also only as good as the model you pick. gpt-4.1 reads a chart that the cheaper gpt-4o-mini half-misses. So you don’t use it everywhere. You save it for the pages that are mostly pictures, where the other parsers come back empty.
1. 비전 모델이 할 수 있는 유일한 일: 이미지를 검색 가능하게 만들기
Start with the reason this parser exists at all. The textual engines turn a page into the relational tables from the earlier articles, but a figure defeats them: they return a chart as a bounding box in image_df with maybe a stray axis label. There is no text in a chart, so to OCR and to a layout model the region is empty, and to a retrieval system it does not exist.
OCR와 레이아웃은 박스를 반환하고, 비전 파서는 텍스트를 작성해 검색할 수 있게 합니다 – 이미지 저자 제공
A vision model reads the picture. Below are three figures pulled straight out of two PDFs: the Transformer diagrams from Attention Is All You Need (Vaswani et al. 2017) and the commodity-price charts from the World Bank Commodity Markets Outlook (April 2026 issue). Each figure sits next to the one-sentence description gpt-4.1 wrote for it. Source documents and licensing details are listed at the end of the article.
각 추출된 이미지는 텍스트 검색과 매치할 수 있는 한 문장 설명을 받습니다 – 이미지 저자 제공
The price chart is now a sentence: commodity price indices by sector, falling since their 2022 peak. A user searching for “commodity price index since 2022” can now hit that page. Before, there was nothing on it to match.
Here is the argument in its sharpest form. Picture a satellite image of a parking lot. It has no text at all. OCR finds nothing, layout finds one box, and to a retrieval system the image does not exist. A vision model writes “aerial view of a parking lot, roughly half full, around forty cars”. Now a search for parking occupancy finds it. That sentence is the parse, and only a vision model can produce it. OCR and layout cannot, by definition, because there were never any characters to read.
2. 텍스트와 표도 파싱합니다, 다른 엔진과 동일하게
The figure is the unique part, but a parser that only read pictures would be useless. A vision model reads the text and the tables too, and not worse than the textual engines on clean material. We pointed parse_page_vision at page 30 of the NIST 사이버보안 프레임워크, the Framework Core table, and asked for markdown. It returned the table columns intact, merged cells handled (the Function name sits on the first row of its block and the continuation rows leave it blank).
다른 엔진들이 재구성한 동일한 4열 표, 이미지에서 직접 읽힙니다 – 이미지 저자 제공
This is the same cell structure Docling and Azure produce from the same page in the two previous articles: they emit markdown tables too, so the format is not what sets vision apart. The vision model never built a table object; it read the grid off the picture and wrote markdown (it returns HTML just as well). So the claim from the lead holds: it is a parser, returning the reusable model the others return, plus the figures they cannot.
3. 모델이 중요합니다: gpt-4o-mini가 gpt-4.1이 읽는 차트를 놓칩니다
How good the parse is depends heavily on the model, and the gap shows precisely where it counts, on the figures. We ran the same CMO chart page through gpt-4o-mini and gpt-4.1.
두 모델 모두 페이지 텍스트와 표를 읽었습니다. 차트에서는 저렴한 모델이 절반 정도만 찾습니다 – 이미지 저자 제공
gpt-4o-mini found three of the six charts and labelled two of them as tables. gpt-4.1 found all six and transcribed their axes down to the month, including the policy-uncertainty and temperature-anomaly charts the smaller model missed. Both read the page text and the NIST table correctly. The weaker model fell down on the pictures, the one thing you brought vision in to do. So with this parser the model is part of the quality, not just a latency and cost knob: a cheaper vision model degrades gracefully on text and badly on figures.
4. 솔직한 교환: 정확성과 비용
None of this is free, and the catch is worth naming plainly. It is not that vision “isn’ t really parsing”, because it is. It is that the parse is less exact and costs more per page.
텍스트와 표에서는 동일하게, 비전은 이미지만 읽습니다. 정확성과 비용이 가격이라는 교환의 일부입니다 – 이미지 저자 제공
Two costs stand out.
Exactness, with two faces: The values it reads off a curve are approximate: the shape and the gist are right, a specific tick can be off, so treat a transcribed number as a lead, not a fact. Worse, it can silently omit an element, a row of a table or one chart in a panel, the way gpt-4o-mini dropped half the charts in section 3. That is a completeness problem, a kind of hallucination by omission, and a deterministic parser never has it: when fitz or Docling reads a table, no row goes missing.
비전은 차트의 형태를 복구하지만 정확한 값은 제공하지 않음; 전사된 숫자를 검증용 단서로 취급하십시오 – 이미지 저자 제공
Cost: Each page is a large image and a model call, billed per page, with