[Paper] TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL
Source: arXiv - 2601.08743v1
Overview
Text‑to‑SQL systems let users ask natural‑language questions and get SQL queries back. Modern approaches rely on large language models (LLMs) that need the whole database schema in the prompt, which makes the prompt long and slows down the “prefill” stage of inference. TableCache shows how to pre‑compute and reuse the key‑value (KV) cache for individual tables, so that repeated queries that touch the same tables can share work and answer users faster.
Key Contributions
- Offline KV‑cache precomputation for each database table, preserving primary‑foreign‑key relationships.
- Table Trie data structure that enables O(1) lookup of the right combination of table caches during inference.
- Cache management & query reranking that selects the most cache‑friendly query ordering to boost hit rates.
- Parallel loading pipeline that overlaps model inference with cache fetches, reducing idle GPU time.
- Empirical demonstration of up to 3.62× reduction in Time‑to‑First‑Token (TTFT) with < 1 % drop in SQL accuracy.
Methodology
- Table Representation Extraction – For every table, the LLM processes a short “table description” (column names, types, primary/foreign keys) once and stores the resulting KV cache (the hidden states that would normally be recomputed for each prompt).
- Preserving Relationships – When a table references another via a foreign key, the cache for the child table is built after the parent’s cache, ensuring the model sees the relational context in the same order it would during a full prompt.
- Table Trie Index – All possible table‑order prefixes are inserted into a trie. At runtime, the system walks the trie according to the tables required by the user query, instantly retrieving the pre‑computed KV slices.
- Cache Management
- Reranking: Given a user query, the engine tries a few table orderings (e.g., alphabetical, schema‑dependency order) and picks the one that maximizes cache reuse.
- Loading Pipeline: While the model is decoding the first token, a background thread streams the needed KV blocks from GPU memory or host RAM, overlapping I/O with compute.
- Integration with Existing Engines – TableCache plugs into SGLang/vLLM by replacing the usual “prefill” step with a lookup‑plus‑copy operation, leaving the rest of the generation pipeline untouched.
Results & Findings
| Metric | Baseline (vLLM) | TableCache | Speed‑up |
|---|---|---|---|
| TTFT (average) | 1.84 s | 0.51 s | 3.62× |
| End‑to‑End latency (90 th pct) | 3.2 s | 1.1 s | 2.9× |
| Exact‑match SQL accuracy | 92.3 % | 91.8 % | –0.5 % |
| Cache hit rate (per query) | N/A | 78 % | — |
- The speed gains are most pronounced for workloads where the same handful of tables appear in many queries (e.g., dashboards, reporting tools).
- Accuracy loss is negligible because the cached KV states are identical to those that would be produced by a full prompt; the tiny drop stems from occasional mismatches in table ordering.
- Memory overhead is modest: storing KV caches for 200 tables of a typical enterprise schema consumes ~2 GB on a 40 GB GPU.
Practical Implications
- Faster interactive analytics – Developers can embed LLM‑driven query assistants in BI tools and keep response times under a second, improving user experience.
- Cost reduction – Prefill is the most GPU‑intensive part of inference; reusing KV caches cuts compute cycles, lowering cloud GPU bills.
- Scalable multi‑tenant services – A single LLM instance can serve many customers whose schemas overlap (e.g., SaaS platforms) by sharing table caches across tenants.
- Simplified prompt engineering – Since the schema is no longer part of the prompt, developers can keep prompts short and focus on natural‑language intent.
- Compatibility – TableCache works as a drop‑in layer for any transformer‑based decoder that exposes KV caches (e.g., LLaMA, Mistral), making it easy to adopt in existing pipelines.
Limitations & Future Work
- Schema churn – Adding, dropping, or altering tables requires recomputing the affected caches; the current system assumes relatively static schemas.
- Cache size vs. GPU memory – Very large catalogs (thousands of tables) may exceed GPU memory, necessitating smarter eviction or hierarchical storage (CPU‑RAM → GPU).
- Query diversity – For ad‑hoc queries that involve many rarely used tables, cache hit rates drop and the benefit diminishes.
- Future directions suggested by the authors include: dynamic cache updating for evolving schemas, hierarchical caching for multi‑database environments, and extending the approach to other LLM‑driven code generation tasks (e.g., API call synthesis).
Authors
- Jinbo Su
- Yuxuan Hu
- Cuiping Li
- Hong Chen
- Jia Li
- Lintao Ma
- Jing Zhang
Paper Information
- arXiv ID: 2601.08743v1
- Categories: cs.CL, cs.AI
- Published: January 13, 2026
- PDF: Download PDF