Stop Managing EKS Add-ons by Hand
Source: Dev.to
Originally published on graycloudarch.com.
The problem
I was preparing to upgrade a production EKS cluster to v1.32 when I discovered that four core add‑ons were running versions incompatible with the new EKS release:
| Add‑on | Install mode | Current version | Issue |
|---|---|---|---|
| VPC CNI | Self‑managed (auto‑installed) | ❌ | Incompatible |
| CoreDNS | Self‑managed (auto‑installed) | ❌ | Incompatible |
| kube‑proxy | Self‑managed (auto‑installed) | ❌ | Incompatible |
| Metrics Server | Installed via kubectl apply -f | ❌ | Incompatible |
There was no version pinning, no history of changes, and no way to test the upgrade safely.
That’s when I decided to stop managing EKS add‑ons by hand.
Two categories of EKS add‑ons
| Category | Who manages it? | What you get |
|---|---|---|
| Self‑managed | You – you install, update, and ensure compatibility. AWS won’t help troubleshoot. | Full control, but you must manually verify compatibility on every EKS release. |
| EKS‑managed | AWS – lifecycle, version compatibility, security patches, and support are handled for you. | Simpler upgrades; AWS publishes a compatible version for each EKS release. |
If you created a cluster without explicitly enabling managed add‑ons, the VPC CNI, CoreDNS, and kube‑proxy are currently self‑managed.
The fix is straightforward: migrate them to EKS‑managed.
Metrics Server, however, is a plain kubectl‑installed resource and isn’t managed by anything.
One Terraform module for all add‑ons
I built a single eks-addons Terraform module that manages everything in one place.
Managed by AWS (EKS‑managed)
| Add‑on | Purpose |
|---|---|
| VPC CNI | Pod networking |
| EBS CSI Driver | Persistent volumes (added while I was at it) |
| CoreDNS | DNS resolution |
| kube‑proxy | Network proxy |
Managed by Helm (Helm‑managed)
| Add‑on | Purpose |
|---|---|
| Metrics Server | Resource metrics for kubectl top and HPA |
| Reloader | Auto‑restart pods when ConfigMaps or Secrets change |
Why a single module?
All add‑ons share the same dependency – the EKS cluster. Consolidating them means:
- One
terragrunt applydeploys everything.- One
terraform planshows drift across all add‑ons.- One PR updates any version.
Core Terraform for an EKS‑managed add‑on
resource "aws_eks_addon" "vpc_cni" {
count = var.enable_vpc_cni ? 1 : 0
cluster_name = var.cluster_name
addon_name = "vpc-cni"
addon_version = var.vpc_cni_version
resolve_conflicts_on_create = "OVERWRITE"
resolve_conflicts_on_update = "OVERWRITE"
preserve = true
}Two important flags
resolve_conflicts = "OVERWRITE"– Terraform is the source of truth; any manual changes in the cluster are overwritten on the next apply.preserve = true– If the resource is removed from Terraform, the add‑on remains in the cluster. This acts as a safety net during refactoring.
EBS CSI Driver – extra IAM work
The EBS CSI Driver needs IAM permissions to create/attach EBS volumes. The recommended pattern is IRSA (IAM Roles for Service Accounts).
# IAM role for the driver
resource "aws_iam_role" "ebs_csi" {
count = var.enable_ebs_csi ? 1 : 0
name = "${var.cluster_name}-ebs-csi-driver"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Federated = var.oidc_provider_arn }
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${var.oidc_provider}:sub" = "system:serviceaccount:kube-system:ebs-csi-controller-sa"
"${var.oidc_provider}:aud" = "sts.amazonaws.com"
}
}
}]
})
}# Attach the AWS‑managed policy
resource "aws_iam_role_policy_attachment" "ebs_csi" {
count = var.enable_ebs_csi ? 1 : 0
role = aws_iam_role.ebs_csi[0].name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
}No credentials in pods, automatic rotation, and a clean audit trail in CloudTrail.
IRSA is the correct pattern for any AWS service that needs to call AWS APIs from inside Kubernetes.
Migrating the Metrics Server
1️⃣ Delete the existing kubectl‑installed resources
kubectl delete deployment metrics-server -n kube-system
kubectl delete service metrics-server -n kube-system
kubectl delete apiservice v1beta1.metrics.k8s.io2️⃣ Install the Helm‑managed version via Terraform
resource "helm_release" "metrics_server" {
count = var.enable_metrics_server ? 1 : 0
name = "metrics-server"
repository = "https://kubernetes-sigs.github.io/metrics-server/"
chart = "metrics-server"
version = var.metrics_server_chart_version
namespace = "kube-system"
values = [yamlencode({
replicas = 2
args = [
"--kubelet-preferred-address-types=InternalIP",
"--kubelet-insecure-tls"
]
podDisruptionBudget = {
enabled = true
minAvailable = 1
}
})]
}Expected downtime: 2‑3 minutes (only kubectl top is unavailable). Running applications are not affected.
CI/CD gotcha
Our GitHub Actions workflow looks for modified terragrunt.hcl files to decide which stacks to deploy. When I changed files under common/modules/eks-addons/, the workflow triggered but found no stacks (no terragrunt.hcl changed), so nothing ran.
Solution: Deploy module changes manually.
cd workloads-nonprod/us-east-1/cluster-name/eks-addons
terragrunt init
terragrunt plan # Should show ~10 resources to add
terragrunt applyVerify everything is healthy
# Check EKS‑managed add‑on status
for addon in vpc-cni aws-ebs-csi-driver coredns kube-proxy; do
aws eks describe-addon \
--cluster-name <cluster-name> \
--addon-name $addon \
--query "addon.status" \
--output text
doneYou should see ACTIVE for each add‑on.
Takeaway
- Self‑managed add‑ons require you to track versions, compatibility, and security patches.
- EKS‑managed add‑ons let AWS handle the lifecycle, giving you confidence during upgrades.
- Consolidating all add‑ons (AWS‑managed, Helm‑managed, and custom) into a single Terraform module simplifies drift detection, version upgrades, and CI/CD integration.
Now the cluster can be upgraded to EKS 1.32 without the previous add‑on compatibility headaches.
kubectl top nodesBefore
- Four add‑ons running in self‑managed mode
- One add‑on installed by
kubectl - No version history
- No drift detection
After
- All six add‑ons defined in code with pinned versions
terraform planshows immediately if anything drifts from the declared state- Rollback is simply git revert + terragrunt apply
EKS Cluster Upgrade Checklist
- Update the four version strings in the Terragrunt config.
- Open a PR.
- Merge – the upgrade is applied automatically.
The cluster upgrade I was dreading took about 30 minutes instead of a day of manual compatibility checking.
Running into EKS add‑on management problems? Reach out—this is the kind of operational work I do for platform teams.