AWS ECS Service Task Recycle
Source: Dev.to
Overview
This solution provides controlled task recycling for ECS services by:
- Stopping tasks one at a time instead of parallel replacement
- Waiting for service stability between each task replacement
- Optionally maintaining service state by temporarily increasing capacity
- Configurable wait time between task replacements
Features
- Sequential Task Recycling – Stops and replaces tasks one by one
- Service Stability – Waits for a stable state after each task replacement
- Capacity Management – Optional temporary capacity increase to maintain availability
- Autoscaling Support – Handles services with Application Auto Scaling
- Flexible Authentication – Multiple AWS credential methods via
AWSSessionmodule - Email Notifications – Optional SMTP notifications on completion
- CloudFormation Deployment – Infrastructure as code with automated deployment
- Zero Retries –
EventInvokeConfigset to 0 retry attempts - Comprehensive Logging – Detailed CloudWatch logs for monitoring
Architecture
Lambda Function (Python 3.13)
├── Event-driven execution
├── AWSSession.py (AWS authentication)
├── Notification.py (Email notifications)
└── input.json (Configuration)
Prerequisites
- Python 3.13+
- AWS CLI configured
- IAM permissions for ECS and Application Auto Scaling
- SMTP server (optional, for notifications)
Installation
1. Clone Repository
cd aws-ecs-service-task-recycle
2. Configure Settings
Edit input.json with your configuration:
{
"awsCredentials": {
"region_name": "us-east-1"
},
"smtpCredentials": {
"host": "smtp.example.com",
"port": "587",
"username": "user@example.com",
"password": "password",
"from_email": "noreply@example.com"
},
"emailNotification": {
"email_subject": "ECS Service Task Recycle Completed",
"subject_prefix": "AWS ECS",
"to": ["admin@example.com"]
}
}
3. Deploy CloudFormation Stack
chmod +x cloudformation_deploy.sh lambda_build.sh
./cloudformation_deploy.sh
Usage
Lambda Event Parameters
{
"cluster_name": "my-ecs-cluster",
"service_name": "my-service",
"maintain_service_state": true,
"wait_time": 30
}
Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
cluster_name | Yes | – | ECS cluster name |
service_name | Yes | – | ECS service name |
maintain_service_state | No | true | Temporarily increase capacity by 1 |
wait_time | No | 30 | Seconds to wait between task replacements |
Invoke Lambda Function
AWS CLI
aws lambda invoke \
--function-name ecs-task-recycle-function \
--payload '{"cluster_name":"my-cluster","service_name":"my-service","maintain_service_state":true,"wait_time":30}' \
response.json
AWS Console
- Navigate to Lambda → Functions → ecs-task-recycle-function
- Open the Test tab → Create test event
- Add the event JSON and click Test
How It Works
Process Flow
- Get Current State – Retrieve service configuration and running tasks
- Increase Capacity (if
maintain_service_state=true) – Add +1 to desired count - Wait for Stability – Ensure the new task is running
- Recycle Tasks – For each old task:
- Stop the task
- Wait for replacement task to start
- Wait for service stability
- Sleep for the configured
wait_time
- Restore Capacity – Return to the original desired count
- Send Notification – Email report (if configured)
Example Scenario
Service with 3 tasks
Initial State: 3 tasks running
↓
Increase to 4 tasks (maintain availability)
↓
Stop task 1 → Wait stable → Sleep 30s
↓
Stop task 2 → Wait stable → Sleep 30s
↓
Stop task 3 → Wait stable → Sleep 30s
↓
Restore to 3 tasks
↓
Complete
Configuration
AWS Credentials (input.json)
Multiple authentication methods are supported:
{
"awsCredentials": {
"region_name": "us-east-1",
"profile_name": "my-profile",
"role_arn": "arn:aws:iam::123456789012:role/MyRole",
"access_key": "AKIAIOSFODNN7EXAMPLE",
"secret_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"session_token": "token"
}
}
SMTP Configuration (Optional)
{
"smtpCredentials": {
"host": "smtp.gmail.com",
"port": "587",
"username": "user@gmail.com",
"password": "app-password",
"from_email": "noreply@example.com"
}
}
IAM Permissions
Required permissions (included in the CloudFormation template):
{
"Effect": "Allow",
"Action": [
"ecs:DescribeServices",
"ecs:UpdateService",
"ecs:ListTasks",
"ecs:StopTask",
"ecs:DescribeTasks",
"application-autoscaling:DescribeScalableTargets",
"application-autoscaling:RegisterScalableTarget"
],
"Resource": "*"
}
CloudFormation Resources
- Lambda Function: Python 3.13 runtime, 900 s timeout, 256 MB memory
- IAM Role: Execution role with ECS and Auto Scaling permissions
- EventInvokeConfig:
MaximumRetryAttemptsset to 0 - CloudWatch Logs: 7‑day retention
Monitoring
CloudWatch Logs
aws logs tail /aws/lambda/ecs-task-recycle-function --follow
Key Log Messages
Starting task recycle for {cluster}/{service}Original desired count: X, tasks: YRecycling task N/M: {task_arn}Task N recycled, waiting XsTask recycle completed successfully
Troubleshooting
Service Not Stabilizing
- Increase waiter
MaxAttemptsin code (default: 40) - Check ECS service health and task definitions
- Verify target‑group health checks
Timeout Errors
- Increase Lambda timeout (default: 900 s)
- Reduce number of tasks or increase
wait_time
Authentication Failures
- Verify IAM role permissions
- Check AWS credentials in
input.json - Ensure Lambda execution role is correct
Best Practices
- Test in Non‑Production – Always test with non‑critical services first.
- Monitor CloudWatch – Watch logs during the first execution.
- Adjust Wait Time – Tune based on application startup time.
- Use Maintain State – Enable for production services.
- Schedule Wisely – Run during low‑traffic periods.
Comparison with Force Deployment
| Feature | Force Deployment | Task Recycle |
|---|---|---|
| Task Replacement | Parallel | Sequential |
| Service Disruption | Higher | Lower |
| Completion Time | Faster | Slower |
| Control | Limited | Configurable |
| Wait Between Tasks | No | Yes |
Security Considerations
- Lambda execution role follows the principle of least privilege.
- No hard‑coded credentials in code.
- SMTP credentials stored in
input.json(use Secrets Manager in production). - CloudWatch logs provide an audit trail.
EventInvokeConfigprevents retry storms.
Cost Optimization
- Lambda execution time ≈ (number of tasks × wait_time) seconds.
- CloudWatch Logs: 7‑day retention (no additional storage cost).
- No extra AWS service charges.
- Consider scheduling during off‑peak hours to reduce indirect costs.
Limitations
- Maximum Lambda execution time: 15 minutes.
- Suitable for services with < 20 tasks (with a 30 s wait time).
- Requires a stable service for the waiter to succeed.
- No automatic rollback mechanism on failure.
Contributing
Contributions are welcome! Please follow the repository structure:
- Test changes thoroughly.
- Update documentation.
- Follow the existing code style.
- Add appropriate error handling.