91% 정확도는 아닙니다.

발행: (2026년 6월 18일 PM 10:00 GMT+9)
7 분 소요
원문: Dev.to

출처: Dev.to

The April 2026 New York Times commission of Oumi to test Google’s AI Overviews against the SimpleQA benchmark produced two numbers that were widely reported and one that mostly was not. The widely reported numbers: 85% accuracy on Gemini 2 in the AI Overview slot, 91% on Gemini 3. Roughly one in ten answers wrong, in headlines from TechSpot, Futurism, Newsweek, BigGo, TechRepublic, Breitbart, Computing.co.uk, Newsbytes, Algorythmic, and DigitalToday.

The number that mostly didn’t make the headlines, but should have: among the answers the benchmark scored as correct, Oumi tracked how often the AI Overview’s stated claim was actually supported by the source it cited, and the unsupported rate grew between the model upgrades — 37% of correct answers ungrounded on Gemini 2, 56% on Gemini 3. The model got more accurate; its summaries got less faithful to what their citations actually said.

That is the part of the story that I want to spend most of this essay on, because once you sit with it for a moment it stops looking like a quirk of one analysis and starts looking like the shape of the entire AI‑search class of product. The 9% error number is interesting; the source‑claim divergence is structural; and the trust‑budget the interface establishes against either of them is the thing that determines whether your week of casually reading AI‑summarised search results was useful or actively misleading.

The arithmetic is unkind. SimpleQA is OpenAI’s 4,326‑question benchmark of short fact‑seeking questions, each constructed to have a single time‑stable answer that two independent annotators agreed on, and each filtered through a third annotator on a thousand‑question subset for additional QA. It is a clean benchmark — almost cruelly so. The questions are not the kind of thing your laptop’s AI search receives in a normal day. SimpleQA asks “Who was the second‑place finisher in the 1992 IOC presidential election?” and your laptop is asked to compare two pairs of trail‑running shoes that were released last quarter.

The benchmark is not load‑bearing on the realism front. It is load‑bearing on the can the model retrieve a fact that it has the data for front.

Google’s response to the analysis was that real users don’t ask SimpleQA‑shaped questions; their internal benchmarking, on more representative queries, produces different (better, in their telling) numbers. That’s a defensible point, and at the same time the standalone Gemini 3 hallucination rate Google itself disclosed in their pushback was around 28% — measured on Google’s own internal benchmark, not SimpleQA, so the two numbers don’t subtract cleanly. The directional point survives: grounding is doing real work, and the 9% on SimpleQA is the residual after RAG has already suppressed a substantial fraction of standalone failure.

The 9% that remains is what’s left after the work is done — the residual failures that grounding cannot fix because they don’t live inside the model’s pretraining; they live in the seam between the model and the index it’s allowed to consult.

There are four obvious places to look for the seam, and the Oumi analysis and the surrounding industry literature taken together implicate all of them.

실패 단계

  • 무엇이 잘못되는가
    • Query interpretation / branching
      • 자연어 질문이 잘못된 하위 질문으로 분석되어 전체적인 질문을 재구성하지 못합니다.
      • “Did this drug interact with that one in the trial?”라는 문장은 “그 약이 무엇을 했는가?”와 “다른 약이 무엇을 했는가?”로 분기되며, 상호작용 질문을 묻지 않습니다.
      • 아니오
    • Source ranking
      • 검색 엔진은 인기가 많은 문서를 반환하지만 권위적인 출처는 아닙니다.
      • 제조업체 사양서보다 레딧 댓글_thread_가 특정 제조사 사양에 대한 질문에 더 상위에 표시됩니다.
      • 아니오
    • Fact compilation
      • 모델은 여러 출처에서 발견된 모달(대다수) 주장을 선택하고, 정확한 주장을 찾지 못합니다.
      • 5개 블로그 중 3개가 단백질이 X라고 말하고 실제는 Y이므로 AI 개요는 X를 답변합니다.
      • 부분적 — 검색 엔진 품질과 재랭킹에 따라 다름
    • Post-processing / smoothing
      • 유창한 생성기는 인용문의 주장을 실제 인용문이 지원하지 않는 형태로 재구성합니다.
      • SimpleQA에서 제미니 3가 맞힌 91%의 답변 중 56%는 주장과 인용 사이에 간극이 있으며, Gemini 2의 85% 정답에 비해 37%로 증가했습니다.
      • 아니오 — 이는 근거(grounding)로 잡히지 않는 간극입니다.

That last row is where the source‑claim divergence number is coming from. The model is grounded on real documents, retrieves them in a sensible‑looking order, and then rewrites the answer in a way that sounds authoritative and confident and doesn’t faithfully match the document it cites. The 56% rate is of the correct answers — i.e., among the 91 in 100 that scored as right under SimpleQA, 56 had a gap between the headline claim and the citation chain. The headline claim was right enough; the citation underneath wasn’t faithful to what the source actually said.

It is worth running the comparison the source piece I’m reading suggested, because it is the most useful frame I’ve seen for thinking about the trust part of this. Major diagnostic errors at a Swiss teaching hospital, comparing antemortem clinical diagnoses against autopsy findings, ran 30% in 1972, 18% in 1982, and 14% in 1992 — a substantial improvement, attributable in the authors’ reading to the rise of ultrasonography and endoscopy. Minor diagnostic errors, the same paper found, almost doubled over the same period: 23% in 1972 to 46% in 1992. More tools, more granular wrongness alongside fewer catastrophic wrongness.

None of this is a crisis. It is the rate at which a sophisticated profession running a busy hospital, with consulting peers and second opinions and post‑hoc verification, gets things wrong. The headline number for AI Overviews, 9% on grounded SimpleQA, sits in the same numerical neighbourhood as 1990s‑Swiss‑clinic major error rates. The two numbers aren’t strictly commensurate — clinical diagnosis is multi‑step reasoning across an entire patient encounter, SimpleQA is single‑fact retrieval, and the scoring rubrics are very different — but the comparison is useful as a calibration of where 9% sits in the universe of human‑institution error rates we already accept. It is comparable to a profession with two thousand years of practice, decade‑over‑decade tooling improvements, and explicit error‑catching protocols.

The trouble is that the question of accuracy is not the only one that matters. The Swiss clinicians had three things AI search does not: peer consultation, second‑opinion protocols, and a post‑hoc verification step (the autopsy itself) that turned every individual error into a feedback signal for the institution. AI Overviews has none of these by construction. The user reads the summary, treats it as the answer, and moves on. There is no autopsy.

The 9% errors that get through are not errors that get caught; they are errors that propagate.

Here is where the second number, the 56% source‑claim discrepancy, becomes the part of the story that should have been the headline. When a piece of software hands you an answer accompanied by a footnote‑style citation marker, the user‑experience signal of that interface is this claim is verified by this source. You can in principle click the link, but the affordance is calibrated for the case where you don’t.

0 조회
Back to Blog

관련 글

더 보기 »

코드 리뷰가 잘못됐다

!Cover image for Code Review Gone Wronghttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Flavkesh.com%2F...