DEV Track Spotlight: The Art of Embracing Failures in Serverless Architectures (DEV312)

Published: 2 days ago (December 16, 2025 at 08:02 AM EST)

6 min read

Source: Dev.to

Serverless architectures promise simplicity wrapped in the immense power of distributed systems. But as Anahit Pogosova, AWS Data Hero and Lead Cloud Architect at F‑Secure, reminded us in her DEV312 session, that simplicity is an illusion.

“Serverless managed services are a step up in the abstraction ladder.
They make the underlying infrastructure seem almost invisible, almost magical.
But by using serverless services, we didn’t just magically teleport to a different reality.
We are still living in the very same messy physical world with all its underlying complexities.”

Her session took us on a journey through the hidden pitfalls of distributed systems, armed with real‑world war stories and practical strategies for building resilience.

Watch the full session:
(embed or link to the video here)

The False Sense of Security

The serverless abstraction layer creates a dangerous illusion. When we pick services, connect them together, and watch everything “just work,” we might forget about the distributed‑systems complexity lurking beneath. As Anahit put it:

“A serverless architecture is one in which the failure of the computer you definitely didn’t know was there can render your entire architecture unusable.”

This higher level of abstraction makes spotting potential issues harder because the failures are abstracted away from us too. But those failures didn’t go anywhere—they’re still embedded in the underlying distributed system, waiting to manifest.

A real‑world story

Anahit shared her experience building a near‑real‑time data‑streaming architecture at scale. The setup seemed simple:

Producer → Amazon Kinesis Data Streams
Consumer → AWS Lambda (processes the records)

It worked perfectly—until they realized they were losing data and had no idea it was happening.

The three interconnected issues

Unconfigured timeouts – The JavaScript SDK’s default timeout was infinite (previously two minutes in SDK v2). When requests to Kinesis timed out due to network glitches, the producer exhausted its resources waiting, becoming incapable of processing new incoming data.
Unhandled partial failures – Batch operations like Kinesis.PutRecords aren’t atomic. Part of a batch might succeed while the rest fails (e.g., hitting shard limits during traffic spikes). The SDK returns success, but it’s your responsibility to detect and handle those partial failures.
Default retry behavior – When Lambda failed to process a bad record, it retried the entire batch indefinitely (until records expired after 24 hours). One “poison‑pill” record blocked an entire shard, causing cascading data loss as records expired faster than Lambda could catch up.

Anahit called timeouts and retries “hidden super‑powers” because they’re incredibly powerful for resilience—but they can backfire spectacularly if misused.

Best practices for timeouts & retries

Never blindly trust default timeout values. For AWS SDK requests, configure appropriate timeouts based on your service and latency expectations. Too long wastes resources; too short triggers premature retries that overwhelm downstream systems.

“When you go back to your code, please check all the requests that go over the network. Make sure that you know what those timeout values are. Make sure that you are controlling them.” – Anahit

Retries are inherently selfish. Poorly implemented retries can amplify small problems into cascading failures that bring entire systems down.

“Retries have brought more distributed systems down than all the other causes together.” – Gregor Hohpe

Key principles for safe retries

Retry only transient failures – Don’t retry if it won’t help (e.g., overloaded systems or operations with side effects).
Set upper limits – Stop retrying when it’s not helping to avoid cascading failures.
Use exponential backoff with jitter – Spread retry attempts uniformly to avoid overwhelming systems. Jitter adds randomness to exponential backoff, dramatically increasing retry success rates.

Lambda event‑source mapping – the hidden component

Most developers have never heard of Lambda’s event source mapping, yet it’s critical when using Lambda with Kinesis, DynamoDB Streams, or similar sources. This hidden component reads records, batches them, and invokes your Lambda function.

By default, if Lambda fails to process a batch, it retries indefinitely until records expire (24 + hours for Kinesis). One bad record creates a “poison pill” that blocks the entire shard, causing:

Useless invocations you’re still paying for
Reprocessing of the same data repeatedly
Complete shard blockage while retries continue
Cascading data loss as records expire faster than Lambda can catch up

Configuring event‑source mapping

Parameter	Purpose	Default
`MaximumRetryAttempts`	Limit the number of retries	`-1` (infinite)
`MaximumRecordAge`	Set a timeout for how long a record can stay in the stream	`-1` (no limit)
`BisectBatchOnFunctionError`	Split failed batches to isolate bad records	`false`
`DestinationConfig`	Route failed records to SQS or SNS for analysis	none
`ParallelizationFactor`	Scale processing (watch Lambda concurrency limits)	`1`

“Whatever you do, please do not go with the defaults.” – Anahit

Capacity limits – the reality of “infinite” scalability

Serverless promises scalability, but we often mistake that for infinite scalability. The reality: we share resources with everyone else, and service limits prevent any single user from monopolizing capacity.

Kinesis shards – 1 MiB or 1,000 records per second per shard.
Lambda concurrency – Default 1,000 concurrent executions per account/region.

If you hit these limits, requests are throttled and fail. For example, a Kinesis stream with 100 shards using a parallelization factor of 10 consumes your entire Lambda concurrency quota, potentially causing unrelated Lambda functions in the same account to fail.

Embrace failure as reality

In distributed systems, everything fails—all the time. Designing with that mindset—proper timeouts, bounded retries, and thoughtful configuration—turns serverless from a fragile illusion into a resilient, production‑ready architecture.

Time‑Related Best Practices

Plan for timeouts – Never trust defaults. Set explicit timeout values that suit your use case.
Implement safe retries – Only retry transient failures, enforce limits, and use exponential backoff with jitter.
Handle partial failures – Batch operations aren’t atomic; detect and retry the failed portions.
Know your service limits – Understand capacity constraints and throttling behavior.
Configure event‑source mapping – Don’t rely on default settings for Lambda with Kinesis/DynamoDB Streams.
Be paranoid (in a good way) – As Martin Kleppmann says:

“In distributed systems, suspicion, pessimism, and paranoia pay off.”
Closing advice from Anahit –

“Distributed systems and architectures are hard, but they can teach us a valuable skill – to embrace the chaos of the real world. Each failure is an opportunity to do things better, to make our systems even more resilient.”
Reminder from Dr. Werner Vogels –

“Everything fails, all the time.” The best thing we can do is stay calm and be prepared when those failures happen.

About This Post

This post is part of DEV Track Spotlight, a series that highlights the incredible sessions from the AWS re:Invent 2025 Developer Community (DEV) track.

DEV Track Overview

60 unique sessions
93 speakers from the AWS Community (AWS Heroes, Community Builders, User Group Leaders) plus AWS and Amazon staff

Topics Covered

Category	Highlights
🤖 GenAI & Agentic AI	Multi‑agent systems, Strands Agents SDK, Amazon Bedrock
🛠️ Developer Tools	Kiro, Kiro CLI, Amazon Q Developer, AI‑driven development
🔒 Security	AI agent security, container security, automated remediation
🏗️ Infrastructure	Serverless, containers, edge computing, observability
⚡ Modernization	Legacy app transformation, CI/CD, feature flags
📊 Data	Amazon Aurora DSQL, real‑time processing, vector databases

Each post in this series dives deep into one session, sharing:

Key insights
Practical takeaways
Links to the full recordings

Whether you attended re:Invent or are catching up remotely, these sessions showcase the best of our developer community—real code, real demos, and real learnings.

Follow along as we spotlight these amazing sessions and celebrate the speakers who made the DEV track what it was!