When 5 Minutes Isn't Enough: Moving AI Ingestion from Sync to Async (And Saving 99% Compute)

Published: (February 12, 2026 at 10:38 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Background

In a previous post I introduced Synapse, the AI system I built for my wife that uses a Knowledge Graph to give her LLM a “deep memory.” Early demos showed the graph updating in about 50 seconds after a chat ended, but real‑world usage quickly exposed a fundamental flaw.

The Problem

During 45‑minute chat sessions with dozens of messages, the “End Session” button would spin for minutes and eventually crash. The root cause wasn’t a simple timeout bug—it was the architecture.

Initial Synchronous Implementation

  1. Convex (Orchestrator) → triggers an HTTP POST to my Python backend.
  2. FastAPI (Brain) → calls Graphiti + Gemini to process the text.
  3. FastAPI waits for the result and returns it.
  4. Convex saves the result to the database.

This is a classic synchronous request‑reply pattern.

Why it failed: Convex Actions have a hard execution limit (5–10 minutes depending on the plan). Short conversations finished in 1–2 minutes, but larger sessions required 12–18 minutes, far exceeding the limit.

The Cascade of Failures

  • Added exponential‑backoff retries on Convex actions.
  • Each retry started a new background process while the previous one kept running, doubling token usage and creating “zombie” jobs.
  • The user still saw an error, and the backend was overwhelmed.

Diagnosis

OpenTelemetry traces (sent to Axiom) showed that ingestion wasn’t failing—it was simply slow, consistently taking 12–18 minutes for large sessions.

Switching to an Async Polling Architecture

When a task exceeds the time a client or server is willing to wait, the request must be decoupled from the response.

New Flow

  1. Convex sends POST /ingest.
  2. FastAPI immediately returns 202 Accepted with a jobId (≈ 300 ms).
  3. FastAPI launches the heavy processing in a background task (asyncio.create_task).
  4. Convex sleeps, then polls the job status every few minutes.

Polling Strategy

  • Switched from exponential to linear backoff.
  • Schedule: check after 5 minutes, then after 10 minutes, then every 10 minutes thereafter.
  • Reduces unnecessary load and noise on the server.

Resource Usage Comparison

ScenarioAction TimeTotal Billed ComputeToken Waste
Synchronous5 min (blocking) → timeout → retry (another 5 min)~10–15 minHigh (duplicate processing)
Async PollingTrigger ≈ 300 ms, Poll ≈ 300 ms, Final fetch ≈ 300 ms< 2 secondsMinimal

We went from wasting ~10 minutes of compute per job to under 2 seconds of active execution time, while eliminating duplicate processing.

Lessons Learned

  • AI tasks are inherently slow. A “fast” LLM call can be 30 seconds; a “deep” knowledge‑graph update can be 15 minutes.
  • Don’t just increase timeouts. Decouple request and response to keep the system resilient and cost‑effective.
  • Linear backoff for polling matches the expected duration of long‑running jobs and reduces server chatter.

Code Repository

The implementation of this async request‑reply pattern is available in the following repositories:

Call for Feedback

I’m interested in how others handle long‑running LLM tasks. Feel free to reach out on X or LinkedIn.

0 views
Back to Blog

Related posts

Read more »