GPU‑Resident Top‑K로 에이전트 RAG 구현, 나만의 CUDA 커널로 Retrieval이 GPU 밖으로 튀는 걸 막았다

발행: (2026년 6월 19일 PM 09:00 GMT+9)
9 분 소요

Source: Towards Data Science

, 343줄의 CUDA TopK 검색 투어. 이 커널, CPU 오라클, 그리고 벤치마크 스위트는 표준 에이전틱 RAG 라운드트립—PCIe 버스를 가로지르는 쿼리 이동—that은 파이프라인의 무소음 살인자임을 입증합니다. 장치를 메모리에 상주시켜 유사도 검색을 유지함으로써 이 아키텍처는 최적화된 CPU 베이스라인 대비 8.6배 빠른 속도를 7년 된 GTX 1080에서도 달성합니다.

이것은 “Production‑Grade Agentic Inference” 시리즈의 제 3부입니다. 각 파트에서는 에이전틱 LLM 파이프라인에서 하나의 종류의 중복 작업을 제거합니다. Part 1는 불필요한 프리필을 없앴습니다. Part 2는 불필요한 대기 시간을 없애고, 여러 마이크로 에이전트가 타임스라이싱을 통해 하나의 GPU를 공유하도록 했습니다. Part 3 (이 포스트)은 커스텀 CUDA TopK 커널을 사용해 RAG 검색을 GPU에 유지합니다. Part 4는 에이전트 상태를 핸드오프를 통해 지속시켜 다음 에이전트가 콜드 스타트 문제를 겪지 않게 합니다.

Key Takeaways

The problem: in agentic RAG, every tool call that needs context fires a similarity search. A default pipeline ships the query embedding from the GPU to Python, lets the CPU score N corpus rows and pick the best K, then ships the answer back. That round‑trip is the silent tax. The compute is fine; the travel is the bill. We all know, travel is never cheap, no matter where you want to go (pun intended!)

The easy fix: upload the corpus to VRAM once, then keep the similarity scoring, the Top‑K selection, and the merge step on the device. Only the tiny per‑query embedding (D floats) and the K results travel across PCIe.

The receipts: on the same 7-year-old GTX 1080 used in Parts 1 and 2, the GPU‑resident path runs the retrieval hop up to 8.57× faster than a CPU brute‑force baseline. At K=8 it wins on all 15 sweep configurations (N ∈ {10k, 50k, 100k, 500k, 1M}, D ∈ {384, 768, 1024}) with speedups from 2.43× to 8.57×. At K=32 it wins on 13 of 15 configs, peaking at 7.76×. At K=100 — where the V1 selector intentionally stays simple — the CPU wins on 14 of 15 configs. That last sentence is the honest part (Well, even if I had lied, you could have easily caught it).

The kicker: the wins are not “magic kernel” wins. They are “we stopped shipping the corpus back to host RAM for no reason” wins. It is also exactly the kind of “measure many candidates, report only the best K back to the consumer” decision a 5G base station and your phone have been making every few milliseconds since CSI feedback became a thing.

TL;DR: Default agentic RAG treats the GPU as a serving box and the retrieval as a Python concern. Every tool call ships the query embedding D→H, lets the CPU compute N dot products, sort the candidates, pick the top K, and ship indices and scores H→D. For an agent that calls a vector store ten times per reasoning step, that round‑trip is the dominant cost — not the model, not the embedding, it is the travel. CUDA-TopK-Retrieval keeps the corpus resident on the device, runs scoring + per‑block partial Top‑K + a multi‑way merge entirely on the GPU, and exposes a tiny C++ orchestrator API (upload_corpus_rowmajor once, search_resident per query). The host‑touching bytes per query collapse to one D‑length embedding up and 2K results down. On a GTX 1080, across a 45‑config sweep, the GPU‑resident path beats the CPU‑round‑trip baseline on all 15 K=8 configs (2.43× to 8.57×, peaking at N=1M, D=1024) and on 13 of 15 K=32 configs (the two losses are both at the smallest N=10k for D=384 and D=768, where the round‑trip itself is already cheap; big‑N K=32 wins climb to 7.76×). At K=100 the V1 kernel deliberately stays simple — single‑lane-per‑block bubble sort with a serial merge — and the CPU wins on 14 of 15 configs; that ceiling is the article’s honest punchline and a clean setup for Part 4.

GitHub Repo: https://github.com/AnubhabBanerjee/cuda-topk-retrieval

(Quick confession before we start: I came at this from a 5G/6G RAN engineering background. Beam selection at a base station looks shockingly close to RAG Top‑K — the UE scores a codebook of candidate beams by received power and reports the best handful back over the air. There is a whole section on that below — section 8 — but it is also why this kernel exists in the shape it does.)

Architecture mental model — keep this open while you read.

agent.embed(query) → cudaMemcpy H→D (D floats) → row_dot_scores_kernel → partial_topk_block_kernel (P blocks) → merge_partial_topk_kernel → cudaMemcpy D→H (K indices + K scores)

Everything below is just commentary on one part of that line.

CUDA TopK 검색 개요CUDA TopK 검색 개요

1. 고백: 에이전트의 모든 RAG 단계는 작은 PCIe 로드 트립이다

In Part 2 of this series, we successfully isolated our LLM agent’s inference loop, keeping token generation running hot and fast on the device. We designed a system which avoids stalling. But the moment we give that agent a tool to search an external knowledge base—the core of any multi‑hop Retrieval‑Augmented Generation (RAG) pipeline—we silently destroy all that hard‑won performance and we hit the wall.

If you have ever wired an “agentic” pipeline to a vector store through a Python retriever, here is what really happens on each tool call (with a little intentional dramatization):

You: “Agent, find me the five chunks most relevant to ‘how do I claim deduction under section 80C?’

에이전트: “네, 쿼리를 GPU에서 임베딩합니다. ✅”

에이전트: “이제 쿼리 임베딩을 호스트로 전송합니다.” (cudaMemcpy D→H, 약 1,024 플롯) Python 검색기: “알았어요. NumPy 루프. N번 점적곱. argpartition. Top‑5.”

Python 검색기: “완료했습니다. 여기 인덱스와 스코어가 있습니다.” (CPU는 반십만 개 코퍼스 행을 한 번에 점수 매기며 9 TFLOP GPU가 관찰) Python 검색기: “완료했습니다. 여기 인덱스와 스코어가 있습니다.”

에이전트: “Cool. Bouncing them back to the GPU now.” (cudaMemcpy H→D, 10 numbers) 에이전트: “준비됐어. 질문은 뭐였지?”

The agent has a perfectly good GPU. The corpus is sitting in 4 GB of the VRAM. The query embedding was already on the GPU — we just generated it there. And then, on every single retrieval hop, we ship the query back to the host, do brute‑force similarity in NumPy / FAISS‑on‑CPU / a hand‑rolled loop, and ship the answer back.

Your GPU’s utility meter: spends most of the retrieval step idle. Your PCIe bus: gets a workout it did not sign up for. Your agent’s tool‑call latency: dominated by something that is neither the model nor the embedding. That is the joke.

That is also the dirty secret of every agentic RAG demo that scales past the toy “ten chunks in memory” stage. The retrieval hop bounces off the GPU and back, every time, and the bigger the corpus, the worse the tax. At a million rows of 1024‑dim embeddings, the round‑trip alone — not even the scoring, yes, just the round‑trip — eats most of the budget of the retrieval step itself.

CUDA-TopK-Retrieval is what happens when you decide the round‑trip is optional and you would rather write 343 lines of CUDA than let the agent vacation through host RAM every time it wants a neighbor.

Now imagine the real workload behind this. It is not “five chunks for one question.” It is multiple specialized micro‑agents — each one running its own RAG hops, each one needing Top‑K against the same corpus, each one currently paying its own PCIe bill on every tool call.

Part 1 killed redundant prefill. Part 2 killed redundant waiting — how multiple micro‑agents share one GPU through time‑slicing. Part 3 (this post) keeps RAG retrieval on the GPU with a custom CUDA TopK kernel. Part 4 persists agent state across hand‑offs so the next agent never has the cold‑start problem.

(시작하기 전에 간단한 고백: 저는 5G/6G RAN 엔지니어링 배경에서 이 문제를 다루었습니다. 베이스 스테이션의 빔 선택이 RAG Top‑K와 놀랍게도 유사합니다 — UE가 수신 전력에 따라 후보 코드북을 점수 매기고 공중으로 최적 핸드풀을 보고합니다. 아래 섹션 8에 해당 내용이 있으며, 이_kernel_ 형식이 탄생한 이유이기도 합니다.)

아키텍처 메타 모델 — 읽는 동안 열어두세요.

agent.embed(query) → cudaMemcpy H→D (D floats) → row_dot_scores_kernel → partial_topk_block_kernel (P blocks) → merge_partial_topk_kernel → cudaMemcpy D→H (K indices + K scores)

아래 내용은 그 라인의 한 부분에 대한 설명만입니다.

0 조회
Back to Blog

관련 글

더 보기 »