How I Cut API Latency by 95% by Fixing One Hidden Loop ❤️🔥
Source: Dev.to
The Trigger
“Hey, the organization dashboard feels really slow for some customers. Like… 3–4 seconds slow.”
I sighed 🥴 – another performance ticket.
- Hit the endpoint locally → everything looked fine.
- Checked staging → “Probably their network,” I thought, and closed the tab.
But the complaints kept coming.
The Symptom
- No errors, no crashes.
- No alarming CPU spikes.
- Service technically healthy.
Yet production latency had crept from a respectable 200 ms to an uncomfortable 3–4 seconds during peak hours.
The code for this endpoint had been touched recently. It was clean, idiomatic Go – the kind you skim in a review and immediately trust. The previous developer was solid – no obvious mistakes, no red flags.
Still, something felt off.
Digging In with Tracing
I opened SigNoz, filtered by that endpoint, and clicked into a trace from one of the slowest requests.
What I found wasn’t a bug.
It was a pattern – silently strangling our database.
The Pattern: N+1 Queries
If you’ve never heard of it, here’s the short version:
Your code runs one query to get a list of items, then loops through that list to run N additional queries – one per item.
The Go Code (looks perfectly fine)
// Fetch all workspaces for an organization
workspaces, err := repo.GetWorkspaces(orgID)
if err != nil {
return err
}
// For each workspace, fetch its storage stats
for _, ws := range workspaces {
storage, err := repo.GetWorkspaceStorage(ws.ID) // 🔴 This is the problem
if err != nil {
return err
}
// ... do something with storage
}
It reads beautifully, but under the hood it translates to:
| Query | SQL (simplified) |
|---|---|
| #1 | SELECT * FROM workspaces WHERE org_id = 123 (gets all 100 workspaces) |
| #2 | SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 1 |
| #3 | SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 2 |
| … | … |
| #101 | SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 100 |
- Each individual query is fast – ~8–10 ms.
- But 200 queries × 10 ms = 2000 ms of pure database time. 🥳
Dead Ends
1️⃣ Logs
- Filtered logs for the slow endpoint.
- Every query completed in 8–12 ms.
- Logs weren’t lying; they were just telling the wrong story.
2️⃣ Basic Metrics
- CPU: normal.
- Only anomaly: API latency.
I was stuck 😔
The Breakthrough: Tracing the Request
- Filter by the problematic endpoint.
- Load the waterfall view.
A healthy service trace looks boring:
HTTP handler → short
Go application logic → tiny sliver
But my trace showed:
[DB] SELECT * FROM workspaces WHERE org_id = ? (15 ms)
[DB] SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 1 (9 ms)
…
[DB] SELECT used_bytes FROM workspace_storage_view WHERE workspace_id = 100 (9 ms)
Scrolling down revealed 217 database queries – the smoking gun 🚨.
What the Trace Made Impossible to Ignore
- Query Count Explosion – a 40× outlier.
- Latency Waterfall – application code ~45 ms, DB time ~2 s.
- Pattern – classic N+1.
The N+1 problem doesn’t show up in slow‑query logs because no single query is slow. You need traces.
I stared at the trace for a full minute: 217 queries.
The developer who wrote the code wasn’t careless; they tested with only 5 workspaces.
The Real Mole
workspaces, _ := repo.GetWorkspacesByOrg(ctx, orgID)
for _, ws := range workspaces {
used, _ := repo.GetWorkspaceStorageUsed(ctx, ws.ID)
total += used
}
- One query to fetch the list.
- One query per workspace inside the loop.
In production the assumption “few workspaces” silently collapsed.
The Fix
Two Deliberate Changes
- Stop querying inside a loop.
- Fetch all needed data in a single query (or at least batch it).
Before (conceptual)
// For each workspace, fetch its storage stats (N+1!)
for _, ws := range workspaces {
storage, _ := repo.GetWorkspaceStorage(ws.ID)
// …
}
After (conceptual)
// Fetch workspaces
workspaces, _ := repo.GetWorkspacesByOrg(ctx, orgID)
// Fetch all workspace storage usage in ONE query
storages, _ := repo.GetAllWorkspaceStoragesByOrg(ctx, orgID)
// Map storages to workspaces and process
The code change took about 20 minutes and the impact was immediate.
Why the Database View Mattered 🥳
We already had a materialized‑style view:
CREATE OR REPLACE VIEW workspace_storage_view AS
SELECT
workspace_id,
COALESCE(SUM(size), 0) AS used_bytes
FROM files
WHERE type != 'folder'
GROUP BY workspace_id;
- The view pre‑aggregates the heavy
SUM(size)over millions of rows. - Without it, the “fix” would have simply moved the expensive aggregation into the per‑request query, still killing performance.
With the view, aggregation happens once, and the application only reads the ready‑made results.
Takeaways
- N+1 queries are invisible in logs and basic metrics because each query is fast.
- Distributed tracing shines a light on hidden patterns.
- Always question loops that hit the database per iteration.
- Use batch queries or pre‑aggregated views to keep latency low.
The moment I stopped guessing and followed a single request end‑to‑end, the problem became undeniable.
Deployed It
Then I opened the monitoring dashboard and watched.
The latency line for the endpoint fell off a cliff.
Metric Comparison
| Metric | Before | After | Improvement |
|---|---|---|---|
| Queries per request | 217 | 12 | 17× fewer |
| Avg response time | ~2.8 s | ~80 ms | 35× faster |
| P95 latency | ~4.2 s | ~120 ms | 35× better |
| DB CPU usage | ~65 % | ~12 % | 82 % reduction |
I refreshed the endpoint for our largest customer:
- 78 ms.
- Refreshed again – 82 ms.
I sat there for five minutes just watching the graphs. No spikes.
The Best Metric of All
“Hey, whatever you did yesterday the dashboard complaints stopped completely. Users are happy again.”
I didn’t tell him it was one structural query change. I just said:
“Fixed a thing.” 😁 (I told him after that, by the way.)
Fixing the N+1 problem felt great. Watching latency drop from 3 seconds to ~80 ms was an instant dopamine hit.
What I Learned
I used to think clean code was enough—readable loops, proper error handling, separation of concerns. The real lesson wasn’t about JOINs or database views.
Databases don’t think in loops. They think in sets.
Now I test with real volumes.
“Can I do this once?”
Because the fix was one query change.
Best Habits to Avoid N+1
-
Don’t query the database inside a loop. Ever. (Very few legitimate exceptions.)
-
Think in sets, not rows.
- Databases excel at set operations —
JOINs,INclauses, bulk reads. - Translate “I need data for each of these items” → “I need all this data at once.”
- Databases excel at set operations —
-
Enable query logging in development.
// Go with GORM db, err := gorm.Open(postgres.Open(dsn), &gorm.Config{ Logger: logger.Default.LogMode(logger.Info), // Shows all queries })With
sqlxordatabase/sql, wrap your queries with logging middleware and make it part of your local dev setup. -
Set query‑count budgets for endpoints.
Example budget: Detail page (single resource): ≤ 3 queries -
Write tests that assert query counts.
func TestGetOrgStorageUsage_QueryCount(t *testing.T) { // Setup test with 200 workspaces queryLog := &QueryLogger{} service := NewWorkspaceService(queryLog) service.GetOrgStorageUsage(ctx, orgID) if queryLog.Count() > 2 { t.Errorf("Expected ≤ 2 queries, got %d", queryLog.Count()) } } -
Profile with production‑scale data.
// ⚠️ N+1 risk: loops over workspaces -
Use APM tools to catch what you miss.
- Set alerts for “queries per request > threshold.”
-
Learn your ORM’s data‑fetching patterns.
PreferINclauses over loops:SELECT * FROM table WHERE id IN (?, ?, ?); -
Make “query count” part of code reviews.
- Does this loop make database calls?
-
When in doubt, measure.
Track query count per request and response latency.
These habits won’t eliminate every performance problem, but they’ll catch the majority of N+1 issues.
The 10 % That Slip Through
If you’ve handled N+1 differently, or think there’s a better approach, I’d love to hear it.
Happy reading. Go check your query logs — I’ll wait. ❤️