AWS re:Invent 2025 - Chaos & Continuity: Using Gen AI to improve humanitarian workload resilience

Published: 15 hours ago (December 4, 2025 at 10:48 PM EST)

4 min read

Source: Dev.to

Introduction

Mike George, Principal Solutions Architect at AWS, works primarily with nonprofit and public‑sector organizations. In this session he explains how to plan for failure so that humanitarian workloads remain operational during disasters. He introduces five key principles for cloud resilience and demonstrates an agentic resilience advisor built with Amazon Bedrock and the Strands Agent SDK that can automatically assess workloads, generate architecture diagrams, provide observability recommendations, and create operational runbooks based on defined RTO and RPO targets.

The SEEMS Principles for Cloud Resilience

The acronym SEEMS helps remember the five categories of failure to consider when designing resilient cloud workloads.

Single Points of Failure

Identify components that have no redundancy.
Ask: If this component fails, does the entire workload go down?
Design for redundancy (multiple instances, AZs, regions) where appropriate.

Excessive Load

Ensure the workload can handle traffic spikes and sustained high demand.
Verify service quotas, auto‑scaling policies, and capacity limits.
Consider:
- What could overwhelm this component?
- Can work be discarded when it will never complete?
- Does the component exhibit bimodal behavior (normal vs. failure mode)?
- How does it scale under load?

Excessive Latency

Evaluate how the workload behaves when a downstream dependency is slow.
Ask: What happens if this component or a downstream service experiences high latency?
Design timeouts, retries, and fallback mechanisms.

Misconfiguration and Bugs

Implement robust CI/CD pipelines and automation to avoid manual changes in production.
Ensure the ability to roll back or shift traffic away from a faulty deployment.
Use guardrails, policy enforcement, and automated testing to prevent operator errors.
Track expirations (certificates, credentials) and keep them up‑to‑date.

Shared Fate

Reduce blast radius by decoupling workloads that share critical resources.
Ask: If a shared database or service fails, which workloads are impacted?
Consider smaller, isolated deployments and avoid large, monolithic changes.

Applying the SEEMS Checklist

When reviewing a workload, walk through the following questions for each SEEMS category:

Single Points of Failure – Is the architecture redundant? What fails over?
Excessive Load – What can overwhelm the component? Are quotas sufficient? How does scaling behave?
Excessive Latency – How does the system react to slow downstream services?
Misconfiguration and Bugs – Can deployments be rolled back automatically? Are guardrails in place?
Shared Fate – How tightly coupled are the components? What is the blast radius of a failure?

Mental Model: High Availability vs. Disaster Recovery

Aspect	High Availability (HA)	Disaster Recovery (DR)
Goal	Keep the application running despite anticipated failures (e.g., AZ loss).	Restore operations after unanticipated or severe failures.
Scope	Built‑in redundancy, automatic failover, continuous operation.	Defined RTO/RPO, backup/restore processes, possibly a different region.
Approach	Design for failure; expect and mitigate common issues.	Plan for recovery; test and improve restoration procedures.
Continuous Improvement	Ongoing CI/CD, resilience testing, and automation.	Regular DR drills, post‑mortems, and updates to recovery plans.

Resilience is not a one‑time effort; it requires continuous testing, automation, and refinement of both HA and DR strategies.

Continuous Improvement

CI/CD Automation – Deploy changes reliably and roll back quickly.
Resilience Testing – Inject failures (chaos engineering) to validate HA and DR.
Team Training – Test not only the system but also the operational processes and personnel.
Feedback Loops – Use observability data to identify weak points and iterate on designs.

Generative AI for Resilience Advisory

Mike demonstrated an agentic resilience advisor that leverages:

Amazon Bedrock – Provides foundation models for natural‑language understanding and generation.
Strands Agent SDK – Orchestrates tool usage and workflow execution.

Advisor Capabilities

Tool	Purpose
`use AWS`	Interacts with AWS APIs to gather workload configuration and health data.
`calculate letter grade`	Scores the resilience posture against the SEEMS criteria.
`AWS documentation MCP server`	Retrieves best‑practice guidance from AWS docs.

The advisor can:

Analyze a workload against the SEEMS principles.
Generate architecture diagrams in Mermaid syntax, e.g.:

graph LR
    A[User] --> B[API Gateway]
    B --> C[Lambda Function]
    C --> D[Amazon DynamoDB]
    D --> E[Backup S3 Bucket]

Provide observability recommendations (metrics, logs, traces).
Create operational runbooks that specify recovery steps based on target RTO/RPO.

By automating these assessments, organizations can quickly identify resilience gaps and obtain actionable guidance without manual deep‑dive analyses.

AWS re:Invent 2025 - Chaos & Continuity: Using Gen AI to improve humanitarian workload resilience

Introduction

The SEEMS Principles for Cloud Resilience

Single Points of Failure

Excessive Load

Excessive Latency

Misconfiguration and Bugs

Shared Fate

Applying the SEEMS Checklist

Mental Model: High Availability vs. Disaster Recovery

Continuous Improvement

Generative AI for Resilience Advisory

Advisor Capabilities

Related posts

Building a Multi-Agent Ghost Story: How Kiro’s Hybrid Development Changed Everything

AWS re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services (CNS416)

Mejora en PLD gracias a IA/ML: Una historia de éxito

Turbocharge Your Optimization: Preconditioning for the Win

Introduction

The SEEMS Principles for Cloud Resilience

Single Points of Failure

Excessive Load

Excessive Latency

Misconfiguration and Bugs

Shared Fate

Applying the SEEMS Checklist

Mental Model: High Availability vs. Disaster Recovery

Continuous Improvement

Generative AI for Resilience Advisory

Advisor Capabilities

Related posts

Building a Multi-Agent Ghost Story: How Kiro’s Hybrid Development Changed Everything

AWS re:Invent 2025 - Deep Dive: ECS Managed Instances & Blue/Green for Resilient Services (CNS416)

**Mejora en PLD gracias a IA/ML: Una historia de éxito**

Turbocharge Your Optimization: Preconditioning for the Win

Mejora en PLD gracias a IA/ML: Una historia de éxito