We Started with Lambdas. Here's What Broke.
Source: Dev.to
Lambdas seemed perfect for AI workloads
Single‑purpose functions, automatic scaling, pay only for what you use. We built 7 of them before realizing our mistake.
First Lambda – Document Summarizer
import { APIGatewayProxyHandler } from 'aws-lambda';
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
export const handler: APIGatewayProxyHandler = async (event) => {
try {
const { document } = JSON.parse(event.body || '{}');
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{
role: 'system',
content: 'Summarize the following document in 2-3 sentences.'
},
{
role: 'user',
content: document
}
],
max_tokens: 150
});
return {
statusCode: 200,
body: JSON.stringify({
summary: response.choices[0].message.content
})
};
} catch (error) {
return {
statusCode: 500,
body: JSON.stringify({ error: error.message })
};
}
};
Clean. Simple. It worked great… until it didn’t.
The 29‑Second Wall
Our first major problem hit when we built an agent that could analyze complex documents. The agent needed to:
- Extract text from the document
- Analyze for key themes
- Generate tags
- Create a summary
- Suggest related assets
Each step took 3–7 seconds. Total runtime: ~25 seconds. Within Lambda’s 15‑minute limit, right?
Wrong.
2024-02-15 14:32:18 START RequestId: abc-123-def
2024-02-15 14:32:18 Calling OpenAI for document analysis...
2024-02-15 14:32:25 Analysis complete, generating tags...
2024-02-15 14:32:32 OpenAI inference still running...
2024-02-15 14:32:47 ERROR Task timed out after 29.00 seconds
API Gateway has a 29‑second timeout – not Lambda. If you expose the function through API Gateway (which you probably are), you hit the wall at 29 seconds.
When this timeout hits:
- The client receives a 504 Gateway Timeout
- Lambda keeps running and burns money
- OpenAI/Bedrock calls finish but results are lost
- Users see failed requests
- You are charged for the full Lambda execution time
We lost 30 % of our complex‑agent requests to timeouts. Users thought our AI was broken. It wasn’t – it was just slow.
Streaming? Not from Lambda
Our users wanted real‑time chat responses, like ChatGPT’s streaming interface. We tried to implement streaming:
export const handler: APIGatewayProxyHandler = async (event) => {
const stream = await openai.chat.completions.create({
model: 'gpt-4',
messages: [/* ... */],
stream: true
});
// This is where it breaks
for await (const chunk of stream) {
// How do you stream through API Gateway?
// You can't.
}
};
API Gateway buffers the entire Lambda response before sending it to the client. There’s no way to stream partial data; the client won’t see anything until the Lambda finishes.
Work‑around: WebSockets, but that means:
- Separate WebSocket API Gateway
- Connection management
- Message routing
- State tracking
- Much more complexity
We tried it; the code ballooned to 3× the size for a simple streaming response.
Cold Starts from Hell
AI SDKs are heavy. Here’s what we imported:
import OpenAI from 'openai'; // 2.1 MB
import { BedrockRuntimeClient } from '@aws-sdk/client-bedrock-runtime'; // 1.8 MB
import Anthropic from '@anthropic-ai/sdk'; // 1.9 MB
import { DynamoDBClient } from '@aws-sdk/client-dynamodb'; // 1.2 MB
import PDFParse from 'pdf-parse'; // 0.9 MB
Total bundle size: ~8 MB
Cold‑start time: 8–12 seconds
When a Lambda hasn’t run for 5+ minutes, AWS creates a new container. Container startup + code initialization = users wait 10+ seconds for the first response.
Because our AI functions were used sporadically, cold starts happened constantly:
- Document analysis: ~20 requests / hour
- Image classification: 5–10 requests / hour
- Content generation: 1–2 requests / hour
Each function went cold multiple times per day. Users would upload a document, wait ~12 seconds, and think the platform was broken.
We tried Provisioned Concurrency. It helped but cost ≈ $50 / month per function just to keep them warm. For 7 functions that’s ≈ $350 / month before processing a single request.
No Shared State
Multi‑turn conversations were impossible. Here’s what we attempted:
// Turn 1: User asks about a document
export const chatHandler: APIGatewayProxyHandler = async (event) => {
const { message, conversationId } = JSON.parse(event.body || '{}');
// Get conversation history from DynamoDB
const history = await getConversationHistory(conversationId);
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
...history,
{ role: 'user', content: message }
]
});
// Save new messages to DynamoDB
await saveMessage(conversationId, 'user', message);
await saveMessage(conversationId, 'assistant', response.choices[0].message.content);
return {
statusCode: 200,
body: JSON.stringify({ response: response.choices[0].message.content })
};
};
Every request required:
- DynamoDB read to fetch conversation history
- AI inference
- Two DynamoDB writes to persist the exchange
For a 3‑turn conversation that’s 3 reads + 6 writes, adding noticeable latency and cost.
Cost Spikes That Hurt
Lambda billing is per‑millisecond, but AI inference has unpredictable latency:
- Simple questions: 2–3 seconds
- Complex analysis: 15–25 seconds
- Code generation: 10–30 seconds
- Image analysis: 5–20 seconds
Cost breakdown for one expensive month
Document Summarizer: 1,200 requests × 8 s avg = 2.7 h = $180
Image Classifier: 800 requests × 12 s avg = 2.7 h = $180
Content Generator: 400 requests × 18 s avg = 2.0 h = $135
Chat Agent: 2,000 requests × 15 s avg = 8.3 h = $560
Tag Suggester: 3,000 requests × 5 s avg = 4.2 h = $280
PDF Analyzer: 200 requests × 22 s avg = 1.2 h = $80
Report Builder: 100 requests × 35 s avg = 1.0 h = $65
---------------------------------------------------------------
Total: $1,480
We were paying Lambda compute costs for AI “thinking” time. A 20‑second GPT‑4 call that actually uses only 50 ms of CPU still costs us for the full 20 seconds of Lambda runtime.
Compare that to a long‑running container that can handle multiple requests while a single AI call is processing – far more cost‑efficient.
The worst part? Peak usage amplified the problem. During business hours we had 50+ concurrent Lambda executions waiting for AI responses. Each one burned money while the actual compute happened on OpenAI’s servers. It felt like paying for a taxi stuck in traffic – you’re paying for time, not progress.
Multi‑Turn Agent Loops
The final straw was building an agent that could help users organize their assets. The workflow:
- User: “Help me organize my product photos.”
- Agent: Analyzes available photos, asks clarifying questions.
- User: Provides criteria.
- Agent: Suggests folder structure.
- User: Approves or requests changes.
- Agent: Executes the organization.
Each step was a separate Lambda invocation. The state management looked like this:
// Step 1: Initial request
await saveToDynamoDB(sessionId, {
step: 'analyzing',
photos: userPhotos,
status: 'in_progress'
});
// Step 2: Agent response
const session = await getFromDynamoDB(sessionId);
await openai.chat.completions.create(/* ... */);
await saveToDynamoDB(sessionId, {
...session,
step: 'awaiting_criteria',
analysis: result
});
// Step 3: User provides criteria
const session = await getFromDynamoDB(sessionId);
// ... and so on
By step 6 we had 12+ DynamoDB operations, 6 Lambda invocations, and a conversation context that was getting expensive to load each time.
The user experience was clunky because every step required a new HTTP request. No persistent connection, no real‑time updates, no streaming—just request‑response cycles that felt broken compared to ChatGPT.
“This feels like software from 2010,” said our head of product after trying the workflow once. He wasn’t wrong.
The Breaking Point
Our Lambda‑based AI platform had fundamental problems:
- 29‑second timeout killed complex workflows
- No streaming made chat feel broken
- Cold starts added 10+ second delays
- Cost inefficiency from paying for AI wait time
- State‑management complexity made agents painful
- Integration sprawl across 7 different functions
We were spending more time fighting infrastructure than building features. Users complained about slow responses, and our AWS bill kept climbing.
Lambdas: Perfect AI Tools, Terrible AI Agents
Tools are single‑purpose, stateless, and fast:
- Classify an image
- Summarize a document
- Extract text from a PDF
- Generate alt text
Agents are multi‑turn, stateful, and complex:
- Help me organize photos
- Analyze data and create a report
- Chat about my documents
- Build a workflow based on conversation
For tools, Lambda is ideal. For agents, you need persistent connections, shared state, and streaming—something Lambda fights at every step.
What We Built Instead
We created a gateway: a single API endpoint that can handle both tools and agents, with proper streaming, state management, and vendor flexibility.
Architecture overview
- API Gateway → routes to a lightweight Lambda (gateway logic)
- Gateway Lambda → proxies requests to long‑running containers that perform the actual AI processing
This gives us the best of both worlds: serverless scaling for the API layer and persistent connections for AI workloads.
In the next article I’ll walk through the gateway pattern and show how we unified seven different AI Lambdas into one clean API that works with any model provider.
This is part 2 of an 8‑part series on building a production AI platform. You can find the complete code examples at .