DEV Track Spotlight: The Art of Embracing Failures in Serverless Architectures (DEV312)
Source: Dev.to
Serverless architectures promise simplicity wrapped in the immense power of distributed systems. But as Anahit Pogosova, AWS Data Hero and Lead Cloud Architect at F‑Secure, reminded us in her DEV312 session, that simplicity is an illusion.
“Serverless managed services are a step up in the abstraction ladder.
They make the underlying infrastructure seem almost invisible, almost magical.
But by using serverless services, we didn’t just magically teleport to a different reality.
We are still living in the very same messy physical world with all its underlying complexities.”
Her session took us on a journey through the hidden pitfalls of distributed systems, armed with real‑world war stories and practical strategies for building resilience.
Watch the full session:
(embed or link to the video here)
The False Sense of Security
The serverless abstraction layer creates a dangerous illusion. When we pick services, connect them together, and watch everything “just work,” we might forget about the distributed‑systems complexity lurking beneath. As Anahit put it:
“A serverless architecture is one in which the failure of the computer you definitely didn’t know was there can render your entire architecture unusable.”
This higher level of abstraction makes spotting potential issues harder because the failures are abstracted away from us too. But those failures didn’t go anywhere—they’re still embedded in the underlying distributed system, waiting to manifest.
A real‑world story
Anahit shared her experience building a near‑real‑time data‑streaming architecture at scale. The setup seemed simple:
- Producer → Amazon Kinesis Data Streams
- Consumer → AWS Lambda (processes the records)
It worked perfectly—until they realized they were losing data and had no idea it was happening.
The three interconnected issues
-
Unconfigured timeouts – The JavaScript SDK’s default timeout was infinite (previously two minutes in SDK v2). When requests to Kinesis timed out due to network glitches, the producer exhausted its resources waiting, becoming incapable of processing new incoming data.
-
Unhandled partial failures – Batch operations like
Kinesis.PutRecordsaren’t atomic. Part of a batch might succeed while the rest fails (e.g., hitting shard limits during traffic spikes). The SDK returns success, but it’s your responsibility to detect and handle those partial failures. -
Default retry behavior – When Lambda failed to process a bad record, it retried the entire batch indefinitely (until records expired after 24 hours). One “poison‑pill” record blocked an entire shard, causing cascading data loss as records expired faster than Lambda could catch up.
Anahit called timeouts and retries “hidden super‑powers” because they’re incredibly powerful for resilience—but they can backfire spectacularly if misused.
Best practices for timeouts & retries
- Never blindly trust default timeout values. For AWS SDK requests, configure appropriate timeouts based on your service and latency expectations. Too long wastes resources; too short triggers premature retries that overwhelm downstream systems.
“When you go back to your code, please check all the requests that go over the network. Make sure that you know what those timeout values are. Make sure that you are controlling them.” – Anahit
- Retries are inherently selfish. Poorly implemented retries can amplify small problems into cascading failures that bring entire systems down.
“Retries have brought more distributed systems down than all the other causes together.” – Gregor Hohpe
Key principles for safe retries
- Retry only transient failures – Don’t retry if it won’t help (e.g., overloaded systems or operations with side effects).
- Set upper limits – Stop retrying when it’s not helping to avoid cascading failures.
- Use exponential backoff with jitter – Spread retry attempts uniformly to avoid overwhelming systems. Jitter adds randomness to exponential backoff, dramatically increasing retry success rates.
Lambda event‑source mapping – the hidden component
Most developers have never heard of Lambda’s event source mapping, yet it’s critical when using Lambda with Kinesis, DynamoDB Streams, or similar sources. This hidden component reads records, batches them, and invokes your Lambda function.
By default, if Lambda fails to process a batch, it retries indefinitely until records expire (24 + hours for Kinesis). One bad record creates a “poison pill” that blocks the entire shard, causing:
- Useless invocations you’re still paying for
- Reprocessing of the same data repeatedly
- Complete shard blockage while retries continue
- Cascading data loss as records expire faster than Lambda can catch up
Configuring event‑source mapping
| Parameter | Purpose | Default |
|---|---|---|
MaximumRetryAttempts | Limit the number of retries | -1 (infinite) |
MaximumRecordAge | Set a timeout for how long a record can stay in the stream | -1 (no limit) |
BisectBatchOnFunctionError | Split failed batches to isolate bad records | false |
DestinationConfig | Route failed records to SQS or SNS for analysis | none |
ParallelizationFactor | Scale processing (watch Lambda concurrency limits) | 1 |
“Whatever you do, please do not go with the defaults.” – Anahit
Capacity limits – the reality of “infinite” scalability
Serverless promises scalability, but we often mistake that for infinite scalability. The reality: we share resources with everyone else, and service limits prevent any single user from monopolizing capacity.
- Kinesis shards – 1 MiB or 1,000 records per second per shard.
- Lambda concurrency – Default 1,000 concurrent executions per account/region.
If you hit these limits, requests are throttled and fail. For example, a Kinesis stream with 100 shards using a parallelization factor of 10 consumes your entire Lambda concurrency quota, potentially causing unrelated Lambda functions in the same account to fail.
Embrace failure as reality
In distributed systems, everything fails—all the time. Designing with that mindset—proper timeouts, bounded retries, and thoughtful configuration—turns serverless from a fragile illusion into a resilient, production‑ready architecture.
Time‑Related Best Practices
-
Plan for timeouts – Never trust defaults. Set explicit timeout values that suit your use case.
-
Implement safe retries – Only retry transient failures, enforce limits, and use exponential backoff with jitter.
-
Handle partial failures – Batch operations aren’t atomic; detect and retry the failed portions.
-
Know your service limits – Understand capacity constraints and throttling behavior.
-
Configure event‑source mapping – Don’t rely on default settings for Lambda with Kinesis/DynamoDB Streams.
-
Be paranoid (in a good way) – As Martin Kleppmann says:
“In distributed systems, suspicion, pessimism, and paranoia pay off.”
-
Closing advice from Anahit –
“Distributed systems and architectures are hard, but they can teach us a valuable skill – to embrace the chaos of the real world. Each failure is an opportunity to do things better, to make our systems even more resilient.”
-
Reminder from Dr. Werner Vogels –
“Everything fails, all the time.” The best thing we can do is stay calm and be prepared when those failures happen.
About This Post
This post is part of DEV Track Spotlight, a series that highlights the incredible sessions from the AWS re:Invent 2025 Developer Community (DEV) track.
DEV Track Overview
- 60 unique sessions
- 93 speakers from the AWS Community (AWS Heroes, Community Builders, User Group Leaders) plus AWS and Amazon staff
Topics Covered
| Category | Highlights |
|---|---|
| 🤖 GenAI & Agentic AI | Multi‑agent systems, Strands Agents SDK, Amazon Bedrock |
| 🛠️ Developer Tools | Kiro, Kiro CLI, Amazon Q Developer, AI‑driven development |
| 🔒 Security | AI agent security, container security, automated remediation |
| 🏗️ Infrastructure | Serverless, containers, edge computing, observability |
| ⚡ Modernization | Legacy app transformation, CI/CD, feature flags |
| 📊 Data | Amazon Aurora DSQL, real‑time processing, vector databases |
Each post in this series dives deep into one session, sharing:
- Key insights
- Practical takeaways
- Links to the full recordings
Whether you attended re:Invent or are catching up remotely, these sessions showcase the best of our developer community—real code, real demos, and real learnings.
Follow along as we spotlight these amazing sessions and celebrate the speakers who made the DEV track what it was!