Stop Managing EKS Add-ons by Hand

Published: 0 month ago (April 5, 2026 at 12:53 PM EDT)

6 min read

Source: Dev.to

Source: Dev.to

Originally published on graycloudarch.com.

The problem

I was preparing to upgrade a production EKS cluster to v1.32 when I discovered that four core add‑ons were running versions incompatible with the new EKS release:

Add‑on	Install mode	Current version	Issue
VPC CNI	Self‑managed (auto‑installed)	❌	Incompatible
CoreDNS	Self‑managed (auto‑installed)	❌	Incompatible
kube‑proxy	Self‑managed (auto‑installed)	❌	Incompatible
Metrics Server	Installed via `kubectl apply -f`	❌	Incompatible

There was no version pinning, no history of changes, and no way to test the upgrade safely.
That’s when I decided to stop managing EKS add‑ons by hand.

Two categories of EKS add‑ons

Category	Who manages it?	What you get
Self‑managed	You – you install, update, and ensure compatibility. AWS won’t help troubleshoot.	Full control, but you must manually verify compatibility on every EKS release.
EKS‑managed	AWS – lifecycle, version compatibility, security patches, and support are handled for you.	Simpler upgrades; AWS publishes a compatible version for each EKS release.

If you created a cluster without explicitly enabling managed add‑ons, the VPC CNI, CoreDNS, and kube‑proxy are currently self‑managed.

The fix is straightforward: migrate them to EKS‑managed.
Metrics Server, however, is a plain kubectl‑installed resource and isn’t managed by anything.

One Terraform module for all add‑ons

I built a single eks-addons Terraform module that manages everything in one place.

Managed by AWS (EKS‑managed)

Add‑on	Purpose
VPC CNI	Pod networking
EBS CSI Driver	Persistent volumes (added while I was at it)
CoreDNS	DNS resolution
kube‑proxy	Network proxy

Managed by Helm (Helm‑managed)

Add‑on	Purpose
Metrics Server	Resource metrics for `kubectl top` and HPA
Reloader	Auto‑restart pods when ConfigMaps or Secrets change

Why a single module?
All add‑ons share the same dependency – the EKS cluster. Consolidating them means:
One terragrunt apply deploys everything.
One terraform plan shows drift across all add‑ons.
One PR updates any version.

Core Terraform for an EKS‑managed add‑on

resource "aws_eks_addon" "vpc_cni" {
  count = var.enable_vpc_cni ? 1 : 0

  cluster_name                 = var.cluster_name
  addon_name                   = "vpc-cni"
  addon_version                = var.vpc_cni_version
  resolve_conflicts_on_create  = "OVERWRITE"
  resolve_conflicts_on_update  = "OVERWRITE"
  preserve                     = true
}

Two important flags

resolve_conflicts = "OVERWRITE" – Terraform is the source of truth; any manual changes in the cluster are overwritten on the next apply.
preserve = true – If the resource is removed from Terraform, the add‑on remains in the cluster. This acts as a safety net during refactoring.

EBS CSI Driver – extra IAM work

The EBS CSI Driver needs IAM permissions to create/attach EBS volumes. The recommended pattern is IRSA (IAM Roles for Service Accounts).

# IAM role for the driver
resource "aws_iam_role" "ebs_csi" {
  count = var.enable_ebs_csi ? 1 : 0
  name  = "${var.cluster_name}-ebs-csi-driver"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = var.oidc_provider_arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${var.oidc_provider}:sub" = "system:serviceaccount:kube-system:ebs-csi-controller-sa"
          "${var.oidc_provider}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

# Attach the AWS‑managed policy
resource "aws_iam_role_policy_attachment" "ebs_csi" {
  count      = var.enable_ebs_csi ? 1 : 0
  role       = aws_iam_role.ebs_csi[0].name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
}

No credentials in pods, automatic rotation, and a clean audit trail in CloudTrail.
IRSA is the correct pattern for any AWS service that needs to call AWS APIs from inside Kubernetes.

Migrating the Metrics Server

1️⃣ Delete the existing `kubectl`‑installed resources

kubectl delete deployment metrics-server -n kube-system
kubectl delete service metrics-server -n kube-system
kubectl delete apiservice v1beta1.metrics.k8s.io

2️⃣ Install the Helm‑managed version via Terraform

resource "helm_release" "metrics_server" {
  count       = var.enable_metrics_server ? 1 : 0
  name        = "metrics-server"
  repository  = "https://kubernetes-sigs.github.io/metrics-server/"
  chart       = "metrics-server"
  version     = var.metrics_server_chart_version
  namespace   = "kube-system"

  values = [yamlencode({
    replicas = 2
    args = [
      "--kubelet-preferred-address-types=InternalIP",
      "--kubelet-insecure-tls"
    ]
    podDisruptionBudget = {
      enabled      = true
      minAvailable = 1
    }
  })]
}

Expected downtime: 2‑3 minutes (only kubectl top is unavailable). Running applications are not affected.

CI/CD gotcha

Our GitHub Actions workflow looks for modified terragrunt.hcl files to decide which stacks to deploy. When I changed files under common/modules/eks-addons/, the workflow triggered but found no stacks (no terragrunt.hcl changed), so nothing ran.

Solution: Deploy module changes manually.

cd workloads-nonprod/us-east-1/cluster-name/eks-addons
terragrunt init
terragrunt plan   # Should show ~10 resources to add
terragrunt apply

Verify everything is healthy

# Check EKS‑managed add‑on status
for addon in vpc-cni aws-ebs-csi-driver coredns kube-proxy; do
  aws eks describe-addon \
    --cluster-name <cluster-name> \
    --addon-name $addon \
    --query "addon.status" \
    --output text
done

You should see ACTIVE for each add‑on.

Takeaway

Self‑managed add‑ons require you to track versions, compatibility, and security patches.
EKS‑managed add‑ons let AWS handle the lifecycle, giving you confidence during upgrades.
Consolidating all add‑ons (AWS‑managed, Helm‑managed, and custom) into a single Terraform module simplifies drift detection, version upgrades, and CI/CD integration.

Now the cluster can be upgraded to EKS 1.32 without the previous add‑on compatibility headaches.

kubectl top nodes

Before

Four add‑ons running in self‑managed mode
One add‑on installed by kubectl
No version history
No drift detection

After

All six add‑ons defined in code with pinned versions
terraform plan shows immediately if anything drifts from the declared state
Rollback is simply git revert + terragrunt apply

EKS Cluster Upgrade Checklist

Update the four version strings in the Terragrunt config.
Open a PR.
Merge – the upgrade is applied automatically.

The cluster upgrade I was dreading took about 30 minutes instead of a day of manual compatibility checking.

Running into EKS add‑on management problems? Reach out—this is the kind of operational work I do for platform teams.

Stop Managing EKS Add-ons by Hand

The problem

Two categories of EKS add‑ons

One Terraform module for all add‑ons

Managed by AWS (EKS‑managed)

Managed by Helm (Helm‑managed)

Core Terraform for an EKS‑managed add‑on

Two important flags

EBS CSI Driver – extra IAM work

Migrating the Metrics Server

1️⃣ Delete the existing `kubectl`‑installed resources

2️⃣ Install the Helm‑managed version via Terraform

CI/CD gotcha

Verify everything is healthy

Takeaway

Before

After

EKS Cluster Upgrade Checklist

Related posts

Right-Sizing vs. Auto-Scaling: Which Saves More on EKS?

How to Detect CrashLoopBackOff in Kubernetes Using Python (Step-by-Step Guide)

docker9

Bad Actor Drops 36 Malicious Packages in npm, Targets Guardarian Users

The problem

Two categories of EKS add‑ons

One Terraform module for all add‑ons

Managed by AWS (EKS‑managed)

Managed by Helm (Helm‑managed)

Core Terraform for an EKS‑managed add‑on

Two important flags

EBS CSI Driver – extra IAM work

Migrating the Metrics Server

1️⃣ Delete the existing kubectl‑installed resources

2️⃣ Install the Helm‑managed version via Terraform

CI/CD gotcha

Verify everything is healthy

Takeaway

Before

After

EKS Cluster Upgrade Checklist

Related posts

Right-Sizing vs. Auto-Scaling: Which Saves More on EKS?

How to Detect CrashLoopBackOff in Kubernetes Using Python (Step-by-Step Guide)

docker9

Bad Actor Drops 36 Malicious Packages in npm, Targets Guardarian Users

1️⃣ Delete the existing `kubectl`‑installed resources