AWS re:Invent 2025 - Chaos & Continuity: Using Gen AI to improve humanitarian workload resilience

Published: (December 4, 2025 at 10:48 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

Introduction

Mike George, Principal Solutions Architect at AWS, works primarily with nonprofit and public‑sector organizations. In this session he explains how to plan for failure so that humanitarian workloads remain operational during disasters. He introduces five key principles for cloud resilience and demonstrates an agentic resilience advisor built with Amazon Bedrock and the Strands Agent SDK that can automatically assess workloads, generate architecture diagrams, provide observability recommendations, and create operational runbooks based on defined RTO and RPO targets.

The SEEMS Principles for Cloud Resilience

The acronym SEEMS helps remember the five categories of failure to consider when designing resilient cloud workloads.

Single Points of Failure

  • Identify components that have no redundancy.
  • Ask: If this component fails, does the entire workload go down?
  • Design for redundancy (multiple instances, AZs, regions) where appropriate.

Excessive Load

  • Ensure the workload can handle traffic spikes and sustained high demand.
  • Verify service quotas, auto‑scaling policies, and capacity limits.
  • Consider:
    • What could overwhelm this component?
    • Can work be discarded when it will never complete?
    • Does the component exhibit bimodal behavior (normal vs. failure mode)?
    • How does it scale under load?

Excessive Latency

  • Evaluate how the workload behaves when a downstream dependency is slow.
  • Ask: What happens if this component or a downstream service experiences high latency?
  • Design timeouts, retries, and fallback mechanisms.

Misconfiguration and Bugs

  • Implement robust CI/CD pipelines and automation to avoid manual changes in production.
  • Ensure the ability to roll back or shift traffic away from a faulty deployment.
  • Use guardrails, policy enforcement, and automated testing to prevent operator errors.
  • Track expirations (certificates, credentials) and keep them up‑to‑date.

Shared Fate

  • Reduce blast radius by decoupling workloads that share critical resources.
  • Ask: If a shared database or service fails, which workloads are impacted?
  • Consider smaller, isolated deployments and avoid large, monolithic changes.

Applying the SEEMS Checklist

When reviewing a workload, walk through the following questions for each SEEMS category:

  • Single Points of Failure – Is the architecture redundant? What fails over?
  • Excessive Load – What can overwhelm the component? Are quotas sufficient? How does scaling behave?
  • Excessive Latency – How does the system react to slow downstream services?
  • Misconfiguration and Bugs – Can deployments be rolled back automatically? Are guardrails in place?
  • Shared Fate – How tightly coupled are the components? What is the blast radius of a failure?

Mental Model: High Availability vs. Disaster Recovery

AspectHigh Availability (HA)Disaster Recovery (DR)
GoalKeep the application running despite anticipated failures (e.g., AZ loss).Restore operations after unanticipated or severe failures.
ScopeBuilt‑in redundancy, automatic failover, continuous operation.Defined RTO/RPO, backup/restore processes, possibly a different region.
ApproachDesign for failure; expect and mitigate common issues.Plan for recovery; test and improve restoration procedures.
Continuous ImprovementOngoing CI/CD, resilience testing, and automation.Regular DR drills, post‑mortems, and updates to recovery plans.

Resilience is not a one‑time effort; it requires continuous testing, automation, and refinement of both HA and DR strategies.

Continuous Improvement

  • CI/CD Automation – Deploy changes reliably and roll back quickly.
  • Resilience Testing – Inject failures (chaos engineering) to validate HA and DR.
  • Team Training – Test not only the system but also the operational processes and personnel.
  • Feedback Loops – Use observability data to identify weak points and iterate on designs.

Generative AI for Resilience Advisory

Mike demonstrated an agentic resilience advisor that leverages:

  • Amazon Bedrock – Provides foundation models for natural‑language understanding and generation.
  • Strands Agent SDK – Orchestrates tool usage and workflow execution.

Advisor Capabilities

ToolPurpose
use AWSInteracts with AWS APIs to gather workload configuration and health data.
calculate letter gradeScores the resilience posture against the SEEMS criteria.
AWS documentation MCP serverRetrieves best‑practice guidance from AWS docs.

The advisor can:

  • Analyze a workload against the SEEMS principles.
  • Generate architecture diagrams in Mermaid syntax, e.g.:
graph LR
    A[User] --> B[API Gateway]
    B --> C[Lambda Function]
    C --> D[Amazon DynamoDB]
    D --> E[Backup S3 Bucket]
  • Provide observability recommendations (metrics, logs, traces).
  • Create operational runbooks that specify recovery steps based on target RTO/RPO.

By automating these assessments, organizations can quickly identify resilience gaps and obtain actionable guidance without manual deep‑dive analyses.

Back to Blog

Related posts

Read more »