We Started with Lambdas. Here's What Broke.

Published: 14 hours ago (March 3, 2026 at 01:10 AM EST)

8 min read

Source: Dev.to

Lambdas seemed perfect for AI workloads

Single‑purpose functions, automatic scaling, pay only for what you use. We built 7 of them before realizing our mistake.

First Lambda – Document Summarizer

import { APIGatewayProxyHandler } from 'aws-lambda';
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export const handler: APIGatewayProxyHandler = async (event) => {
  try {
    const { document } = JSON.parse(event.body || '{}');

    const response = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [
        {
          role: 'system',
          content: 'Summarize the following document in 2-3 sentences.'
        },
        {
          role: 'user',
          content: document
        }
      ],
      max_tokens: 150
    });

    return {
      statusCode: 200,
      body: JSON.stringify({
        summary: response.choices[0].message.content
      })
    };
  } catch (error) {
    return {
      statusCode: 500,
      body: JSON.stringify({ error: error.message })
    };
  }
};

Clean. Simple. It worked great… until it didn’t.

The 29‑Second Wall

Our first major problem hit when we built an agent that could analyze complex documents. The agent needed to:

Extract text from the document
Analyze for key themes
Generate tags
Create a summary
Suggest related assets

Each step took 3–7 seconds. Total runtime: ~25 seconds. Within Lambda’s 15‑minute limit, right?

Wrong.

2024-02-15 14:32:18 START RequestId: abc-123-def
2024-02-15 14:32:18 Calling OpenAI for document analysis...
2024-02-15 14:32:25 Analysis complete, generating tags...
2024-02-15 14:32:32 OpenAI inference still running...
2024-02-15 14:32:47 ERROR Task timed out after 29.00 seconds

API Gateway has a 29‑second timeout – not Lambda. If you expose the function through API Gateway (which you probably are), you hit the wall at 29 seconds.

When this timeout hits:

The client receives a 504 Gateway Timeout
Lambda keeps running and burns money
OpenAI/Bedrock calls finish but results are lost
Users see failed requests
You are charged for the full Lambda execution time

We lost 30 % of our complex‑agent requests to timeouts. Users thought our AI was broken. It wasn’t – it was just slow.

Streaming? Not from Lambda

Our users wanted real‑time chat responses, like ChatGPT’s streaming interface. We tried to implement streaming:

export const handler: APIGatewayProxyHandler = async (event) => {
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [/* ... */],
    stream: true
  });

  // This is where it breaks
  for await (const chunk of stream) {
    // How do you stream through API Gateway?
    // You can't.
  }
};

API Gateway buffers the entire Lambda response before sending it to the client. There’s no way to stream partial data; the client won’t see anything until the Lambda finishes.

Work‑around: WebSockets, but that means:

Separate WebSocket API Gateway
Connection management
Message routing
State tracking
Much more complexity

We tried it; the code ballooned to 3× the size for a simple streaming response.

Cold Starts from Hell

AI SDKs are heavy. Here’s what we imported:

import OpenAI from 'openai';                                   // 2.1 MB
import { BedrockRuntimeClient } from '@aws-sdk/client-bedrock-runtime'; // 1.8 MB
import Anthropic from '@anthropic-ai/sdk';                    // 1.9 MB
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';    // 1.2 MB
import PDFParse from 'pdf-parse';                             // 0.9 MB

Total bundle size: ~8 MB
Cold‑start time: 8–12 seconds

When a Lambda hasn’t run for 5+ minutes, AWS creates a new container. Container startup + code initialization = users wait 10+ seconds for the first response.

Because our AI functions were used sporadically, cold starts happened constantly:

Document analysis: ~20 requests / hour
Image classification: 5–10 requests / hour
Content generation: 1–2 requests / hour

Each function went cold multiple times per day. Users would upload a document, wait ~12 seconds, and think the platform was broken.

We tried Provisioned Concurrency. It helped but cost ≈ $50 / month per function just to keep them warm. For 7 functions that’s ≈ $350 / month before processing a single request.

No Shared State

Multi‑turn conversations were impossible. Here’s what we attempted:

// Turn 1: User asks about a document
export const chatHandler: APIGatewayProxyHandler = async (event) => {
  const { message, conversationId } = JSON.parse(event.body || '{}');

  // Get conversation history from DynamoDB
  const history = await getConversationHistory(conversationId);

  const response = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      ...history,
      { role: 'user', content: message }
    ]
  });

  // Save new messages to DynamoDB
  await saveMessage(conversationId, 'user', message);
  await saveMessage(conversationId, 'assistant', response.choices[0].message.content);

  return {
    statusCode: 200,
    body: JSON.stringify({ response: response.choices[0].message.content })
  };
};

Every request required:

DynamoDB read to fetch conversation history
AI inference
Two DynamoDB writes to persist the exchange

For a 3‑turn conversation that’s 3 reads + 6 writes, adding noticeable latency and cost.

Cost Spikes That Hurt

Lambda billing is per‑millisecond, but AI inference has unpredictable latency:

Simple questions: 2–3 seconds
Complex analysis: 15–25 seconds
Code generation: 10–30 seconds
Image analysis: 5–20 seconds

Cost breakdown for one expensive month

Document Summarizer:   1,200 requests × 8 s avg  = 2.7 h = $180
Image Classifier:        800 requests × 12 s avg = 2.7 h = $180
Content Generator:      400 requests × 18 s avg = 2.0 h = $135
Chat Agent:            2,000 requests × 15 s avg = 8.3 h = $560
Tag Suggester:         3,000 requests × 5 s avg  = 4.2 h = $280
PDF Analyzer:            200 requests × 22 s avg = 1.2 h = $80
Report Builder:          100 requests × 35 s avg = 1.0 h = $65
---------------------------------------------------------------
Total:                                          $1,480

We were paying Lambda compute costs for AI “thinking” time. A 20‑second GPT‑4 call that actually uses only 50 ms of CPU still costs us for the full 20 seconds of Lambda runtime.

Compare that to a long‑running container that can handle multiple requests while a single AI call is processing – far more cost‑efficient.

The worst part? Peak usage amplified the problem. During business hours we had 50+ concurrent Lambda executions waiting for AI responses. Each one burned money while the actual compute happened on OpenAI’s servers. It felt like paying for a taxi stuck in traffic – you’re paying for time, not progress.

Multi‑Turn Agent Loops

The final straw was building an agent that could help users organize their assets. The workflow:

User: “Help me organize my product photos.”
Agent: Analyzes available photos, asks clarifying questions.
User: Provides criteria.
Agent: Suggests folder structure.
User: Approves or requests changes.
Agent: Executes the organization.

Each step was a separate Lambda invocation. The state management looked like this:

// Step 1: Initial request
await saveToDynamoDB(sessionId, {
  step: 'analyzing',
  photos: userPhotos,
  status: 'in_progress'
});

// Step 2: Agent response
const session = await getFromDynamoDB(sessionId);
await openai.chat.completions.create(/* ... */);
await saveToDynamoDB(sessionId, {
  ...session,
  step: 'awaiting_criteria',
  analysis: result
});

// Step 3: User provides criteria
const session = await getFromDynamoDB(sessionId);
// ... and so on

By step 6 we had 12+ DynamoDB operations, 6 Lambda invocations, and a conversation context that was getting expensive to load each time.

The user experience was clunky because every step required a new HTTP request. No persistent connection, no real‑time updates, no streaming—just request‑response cycles that felt broken compared to ChatGPT.

“This feels like software from 2010,” said our head of product after trying the workflow once. He wasn’t wrong.

The Breaking Point

Our Lambda‑based AI platform had fundamental problems:

29‑second timeout killed complex workflows
No streaming made chat feel broken
Cold starts added 10+ second delays
Cost inefficiency from paying for AI wait time
State‑management complexity made agents painful
Integration sprawl across 7 different functions

We were spending more time fighting infrastructure than building features. Users complained about slow responses, and our AWS bill kept climbing.

Lambdas: Perfect AI Tools, Terrible AI Agents

Tools are single‑purpose, stateless, and fast:

Classify an image
Summarize a document
Extract text from a PDF
Generate alt text

Agents are multi‑turn, stateful, and complex:

Help me organize photos
Analyze data and create a report
Chat about my documents
Build a workflow based on conversation

For tools, Lambda is ideal. For agents, you need persistent connections, shared state, and streaming—something Lambda fights at every step.

What We Built Instead

We created a gateway: a single API endpoint that can handle both tools and agents, with proper streaming, state management, and vendor flexibility.

Architecture overview

API Gateway → routes to a lightweight Lambda (gateway logic)
Gateway Lambda → proxies requests to long‑running containers that perform the actual AI processing

This gives us the best of both worlds: serverless scaling for the API layer and persistent connections for AI workloads.

In the next article I’ll walk through the gateway pattern and show how we unified seven different AI Lambdas into one clean API that works with any model provider.

This is part 2 of an 8‑part series on building a production AI platform. You can find the complete code examples at .