Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

Published: 1 hour ago (May 4, 2026 at 07:48 PM EDT)

2 min read

Source: Dev.to

Source: Dev.to

Cover image for Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

If a user closes the AI diagnosis overlay and reopens it, should you call Gemini again?

No. Cache the result. Same input → same output. No reason to burn rate‑limit quota.

The problem

Without caching

User clicks Diagnose on error line 847.
Gemini responds in ~3 seconds.
User closes the overlay.
User reopens the overlay.
Gemini is called again → another 3 seconds and another request.

With caching

Steps 4‑5 become instant, with zero API calls.

Cache key: hash the input

The same log context produces the same hash, which maps to the same cached result.

use std::collections::HashMap;
use sha2::{Sha256, Digest};

pub struct DiagnosisCache {
    entries: HashMap,
    max_size: usize,
}

#[derive(Clone)]
pub struct CacheEntry {
    pub result: String,
    pub created_at: std::time::Instant,
}

impl DiagnosisCache {
    pub fn new(max_size: usize) -> Self {
        Self { entries: HashMap::new(), max_size }
    }

    pub fn key(context: &str) -> String {
        let mut hasher = Sha256::new();
        hasher.update(context.as_bytes());
        format!("{:x}", hasher.finalize())
    }

    pub fn get(&self, key: &str) -> Option {
        self.entries.get(key)
    }

    pub fn insert(&mut self, key: String, result: String) {
        // Evict oldest entries if at capacity
        if self.entries.len() >= self.max_size {
            if let Some(oldest_key) = self.entries
                .iter()
                .min_by_key(|(_, v)| v.created_at)
                .map(|(k, _)| k.clone())
            {
                self.entries.remove(&oldest_key);
            }
        }

        self.entries.insert(key, CacheEntry {
            result,
            created_at: std::time::Instant::now(),
        });
    }
}

Using it in the command

#[tauri::command]
pub async fn diagnose(
    context: String,
    api_key: String,
    cache: tauri::State>,
) -> Result> {
    let key = DiagnosisCache::key(&context);

    // Check cache first
    {
        let cache = cache.lock().unwrap();
        if let Some(entry) = cache.get(&key) {
            return Ok(entry.result.clone()); // instant
        }
    }

    // Cache miss — call Gemini
    let result = call_gemini(&context, &api_key).await?;

    // Store result
    {
        let mut cache = cache.lock().unwrap();
        cache.insert(key, result.clone());
    }

    Ok(result)
}

Cache size

A capacity of 50 entries is enough for a typical session. Log lines change constantly, so a cache entry older than a couple of hours is rarely useful. You can clear the cache on app restart or add a TTL if desired.

// Register in main.rs
.manage(std::sync::Mutex::new(DiagnosisCache::new(50)))

Result

First diagnosis: ~3 seconds.
Subsequent identical diagnoses: instant.

Rate‑limit usage is cut dramatically for users who re‑examine the same errors.

Links

Hiyoko PDF Vault →
X →

Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

The problem

Cache key: hash the input

Using it in the command

Cache size

Result

Related posts

Claude Moves Fast. Codex Ships.

The smarter the model, the more it saves.

LLM386: borrowing a 1990s idea for managing LLM context

Token Consumption Anxiety and the Open Source App I Built to Solve It