Is your monitoring testing strategy chaos?

Published: 1 week ago (January 8, 2026 at 06:28 AM EST)

4 min read

Source: Dev.to

Introduction

Nowadays many cloud implementations use server‑less architectures—e.g., AWS Lambda and API Gateway—to deliver micro‑services or other business‑logic functionality without managing servers.
This pattern is mature, and we have a wealth of tools and approaches to ensure that our server‑less code performs as expected. We can develop and test locally, use pipelines to deploy, and minimise the risk of releasing non‑functioning code.

When I work with teams I recommend best practices such as:

Deploying Lambdas via CI/CD.
Setting log‑retention periods.
Enabling monitoring to capture errors, failures, or timeouts.

Testing code functionality (happy‑path testing) is straightforward, but testing that our monitoring actually captures the events we care about—and that alarms fire when issues are detected—can be more complex.

We should also test the unhappy path: how our system behaves when errors occur. In regulated industries, testing must be reproducible and auditable, which is difficult if we introduce errors manually. Adding test hooks directly into the code (e.g., if TEST then …) pollutes business logic and should be avoided.

Chaos Engineering can help. I’ve previously written about using AWS Fault Injection Simulator (FIS) to deliver “Chaos Engineering as a Service”. While it’s easy to apply chaos to servers (e.g., SSH in, add CPU load, throttle network), doing so for server‑less workloads is less obvious—until re:Invent 2024, when AWS announced new FIS capabilities for Lambda.

New Lambda‑FIS Capabilities

FIS now supports three ways to perturb a Lambda function:

Delay the start of the function.
Force the function to generate an error.
Modify the response returned by the function.

To use these capabilities you must perform four setup steps:

Add a Lambda Layer that lets FIS interact with the Lambda runtime.
Create an S3 bucket for passing configuration and runtime data between FIS and the layer.

Add environment variables to the Lambda configuration:

AWS_FIS_CONFIGURATION_LOCATION   # S3 bucket (and optional prefix)
AWS_LAMBDA_EXEC_WRAPPER          # Executable within the layer, e.g. /opt/aws-fis/bootstrap

Grant the Lambda’s execution role permission to read and list the bucket’s contents.

For full details see the AWS documentation.

Defining an Experiment Template

Once Lambdas are configured with the FIS layer, you define what to test via an experiment template. A template consists of several components; the two we care about are:

Component	Purpose
Targets	Identify the AWS resources to test (e.g., a specific Lambda function or a set of functions matching a tag).
Actions	Specify the perturbation to apply (delay, error injection, or response modification).

Example Template (YAML)

Description: "Inject latency, errors, and 4xx responses into tagged Lambdas"
Targets:
  MyLambdas:
    ResourceType: "aws:lambda:function"
    ResourceArns:
      - "arn:aws:lambda:{{region}}:{{account}}:function:*"
    SelectionMode: "ALL"
    Filters:
      - Tags:
          my-chaos-tag: "true"
Actions:
  Delay:
    ActionId: "aws:fis:lambda:delay"
    Parameters:
      duration: "30s"
  Error:
    ActionId: "aws:fis:lambda:error"
    Parameters:
      errorType: "Runtime"
  Response:
    ActionId: "aws:fis:lambda:response"
    Parameters:
      statusCode: "404"

You can then run the experiment against all Lambdas that carry the tag my-chaos-tag=true.

Expected Monitoring Results

After enabling monitoring (e.g., CloudWatch dashboards), you should see patterns similar to the screenshots below.

Baseline (no chaos)

+-------------------+-------------------+-------------------+
| Invocations       | Duration (ms)    | Error Count       |
+-------------------+-------------------+-------------------+
| steady line       | flat line        | zero              |
+-------------------+-------------------+-------------------+

During the Experiment

Time	Event	Observed Metric Change
09:45	Delay injected	Dip in Invocations; spikes in Duration and Latency
09:55	Error injected	Spike in Error Count
10:10	Response changed to 4xx	Spike in 4xx Error Count

These changes appear automatically—no code modifications or manual infrastructure tweaks are required—providing a repeatable, auditable experiment.

Repository & Deployment

I’ve created a GitHub repository that contains everything you need to try this yourself:

CloudFormation template that deploys:
- An example Lambda (pre‑configured with the FIS layer).
- An API Gateway to invoke the Lambda.
- A sample CloudWatch dashboard.
- A FIS experiment template.
Instructions for deployment and usage.

🔗 Repository:

When the stack finishes, it outputs the API Gateway URL.

Simple Load‑Generator Script

#!/usr/bin/env bash
# Continuous calls to the API Gateway to generate baseline traffic
while :; do
    curl -s "" &
    sleep 0.5
done

Run the script to establish a baseline.
Start the FIS experiment template.
Observe the dashboard changes described above.

Conclusion

When it comes to monitoring server‑less workloads, we should adopt a formal, repeatable testing approach rather than relying on “it’ll be OK” assumptions. Using AWS FIS together with the Lambda layer lets us:

Inject latency, errors, and response changes without touching application code.
Produce auditable, reproducible chaos experiments.
Validate that our monitoring, alerts, and dashboards react as expected.

Give it a try—your compliance auditors (and your on‑call engineers) will thank you!

Introducing Chaos Engineering into Your Testing Process

With the Lambda‑specific tests, we can move away from manual tinkering with a configuration—or intrusive if TEST then … code blocks—and adopt an approach where chaos engineering is an integral part of our testing process.

Why Take This Approach?

Validate our monitoring – Ensure that your dashboard and alerts show us when real issues occur.
Audit our resilience – Provide stakeholders with repeatable, documented evidence that our monitoring approach is robust and fit for purpose.
Streamline our code – Keep the code focused on business value and reduce unit‑testing overhead.

Embracing chaos lets us demonstrate that our monitoring works and gives teams the overview they need when they need it, instead of waking up at 3:00 AM on a Sunday morning.

So go ahead, introduce chaos to your testing—your team will thank you for it!