PDF 이미지 검색 가능하게 RAG 적용…전체 읽지 않아 비용 절감
출처: Towards Data Science 동반자 Enterprise Document Intelligence, 네 개의 브릭을 기반으로 기업 RAG 시스템을 구축하는 시리즈. 이전 Article 5 (문서 파싱) 에서 하나 표에
image_df가 위치해 있으며, PDF 안의 모든 그림을 찾지만 읽지 않음. 이 부분은 독서 도구 상자를 구축합니다: 비용이 적은 연쇄(저렴한 필터, 타입 검사, 클래식 OCR, 비전 모델)로, 지불해야 할 몇 안 되는 이미지를 검색 가능한 텍스트로 변환합니다.
where this companion sits: it extends Article 5 (document parsing), inside Part II (the four bricks), reading the images the parser only located – Image by author
The parsing brick gives you image_df: one row per image in the PDF, with its page, its bounding box, its size, a content hash. That locates every picture. It does not say what any of them shows. For retrieval, that is the same as not having them: a bounding box is not something a user can search, and the image’s text slot, the place a description would live, is empty.
The reflex is to throw a vision model at every image and be done. That is the wrong default. A real document is full of images that carry nothing a reader would ever search for: the company logo in every page header, a horizontal rule drawn as a 2-pixel-tall picture, a bullet glyph, a decorative banner. Captioning those with a vision LLM is paying a model to describe a logo three hundred times.
So the job splits in two. First, the methods that turn an image into text, and what each one costs: a cheap filter, a type check, classic OCR, a vision model. Second, which images are actually worth spending on in a given run. That second half is driven by context. A body line that reads “Figure 3 below shows…” is the cue to read that figure with a vision model, and not its neighbours; the question being asked narrows it further. This article lays down the methods and shows what each returns, ordered by cost. Choosing which images to pay for, per document and per query, is adaptive parsing, and it has its own article (Article 10). Here we build the toolbox.
one extracted image in, a searchable description out, paying the cheapest method that can read it – Image by author
1. Most images are not worth a model call
The first step spends nothing. Before any OCR or vision call, a cheap filter looks at signals already in image_df plus a couple of pixel statistics, and drops the images with no retrieval value:
-
Too small. An image whose shortest side is a few dozen pixels, or whose total area is below a small floor, is an icon or a bullet, not a figure. A size threshold removes most of them.
-
The wrong shape. A picture that is very long and very thin is a rule or a divider, not content. An aspect‑ratio guard catches those.
-
Repeated everywhere. The same content hash on most pages of the document is chrome: a header logo, a footer mark, a watermark. Counting how many pages an image hash appears on flags it as decoration, not information.
is_worth_analyzing applies these size and shape rules per image, and flag_worth_analyzing first derives the per‑page repeat frequency from the content hash, then adds a worth_analyzing column to image_df. Both live in docintel.parsing.pdf.images. The thresholds are deliberately loose: a false keep costs one model call later, a false drop loses content with no trace, so when in doubt the filter keeps the image. Flat, contentless images that are too big to fail the size test (a solid colour panel, say) are not forced through here; they are caught one step later as decorative and skipped just the same.
In: image_df (+ per‑image pixel stats). Out: the same table with a worth_analyzing flag.
On a typical report, this alone removes the large majority of images before a single model runs. What’s left is the handful that actually carry meaning.
2. What kind of image is it?
The images that survive the filter are not all read the same way. A screenshot of a table is text: classic OCR reads it cheaply and exactly. A line chart is not text at all; its meaning is in the axes and the trend, and only a vision model can put that into words. Sending the chart to OCR returns a few stray axis labels; sending the screenshot to a vision model pays chart prices for something OCR does for free.
So the second step classifies each kept image into one type:
-
decorative: a blank or near‑uniform panel. Skip. -
text: a screenshot, a scanned region, a table rendered as an image. Reads with OCR. -
chart/diagram/photo: the meaning is visual. Reads with a vision model.
classify_image returns one ImageType from cheap pixel signals: how much the pixels vary, how saturated they are, how much of the image is near‑white background, how dense its edges are. A near‑uniform panel is decorative. The test there is worth dwelling on, because the obvious version is wrong: you cannot detect a blank panel by counting its colours. A real “all‑black” or “all‑white” region is never pixel‑perfect; sensor noise and JPEG compression give it hundreds of near‑identical colours, so a colour count sails right past it. What stays near zero on a blank panel, noise and all, is the dispersion of the pixel values, their standard deviation. Low dispersion means blank, whatever the colour count, so that is the signal. Black ink on a white page, near‑zero saturation with real stroke structure, is text. A saturated, full‑bleed image with no white margins is a photo. Everything else, every uncertain case, falls through to chart.
Notice what is not in that list: a step that decides “this looks like a logo”. That is on purpose, and it is the same lesson as the blank panel. A logo can be two flat colours, a black wordmark on white, or a full‑colour gradient with soft edges. Counting colours catches the first and misses the second, and worse, the two‑color test also catches a bilevel scan of real text you wanted to read. Appearance does not tell you it is a logo. Behaviour does: a logo is chrome because it repeats, the same mark in every page header. That signal already ran, back in the filter, which drops an image whose content hash recurs across pages no matter how many colours it has. A logo that appears only once, a mark on a cover page, is not worth a special case; it gets read like anything else, a wordmark falling to free OCR, a graphic to a single vision call. The rule throughout is the same: skip only what you are sure is empty or chrome, and read everything else, because a wrong skip loses content silently.
That fall‑through to chart is the other important design choice. Classifying a chart against a diagram against a photo on cheap signals alone is not reliable, so the classifier does not try to be clever: it only diverts an image to cheap OCR when it is confident the image is clean monochrome text, and sends everything else to the vision model, which reads charts, diagrams, photos, and any text they happen to contain. The bias is asymmetric on purpose. A missed OCR shortcut costs one vision call; OCR run on a diagram returns a handful of stray axis labels and nonsense. So when in doubt, the classifier pays for vision.