Stop Managing EKS Add-ons by Hand

Published: (April 5, 2026 at 12:53 PM EDT)
6 min read
Source: Dev.to

Source: Dev.to

Originally published on graycloudarch.com.

The problem

I was preparing to upgrade a production EKS cluster to v1.32 when I discovered that four core add‑ons were running versions incompatible with the new EKS release:

Add‑onInstall modeCurrent versionIssue
VPC CNISelf‑managed (auto‑installed)Incompatible
CoreDNSSelf‑managed (auto‑installed)Incompatible
kube‑proxySelf‑managed (auto‑installed)Incompatible
Metrics ServerInstalled via kubectl apply -fIncompatible

There was no version pinning, no history of changes, and no way to test the upgrade safely.
That’s when I decided to stop managing EKS add‑ons by hand.

Two categories of EKS add‑ons

CategoryWho manages it?What you get
Self‑managedYou – you install, update, and ensure compatibility. AWS won’t help troubleshoot.Full control, but you must manually verify compatibility on every EKS release.
EKS‑managedAWS – lifecycle, version compatibility, security patches, and support are handled for you.Simpler upgrades; AWS publishes a compatible version for each EKS release.

If you created a cluster without explicitly enabling managed add‑ons, the VPC CNI, CoreDNS, and kube‑proxy are currently self‑managed.

The fix is straightforward: migrate them to EKS‑managed.
Metrics Server, however, is a plain kubectl‑installed resource and isn’t managed by anything.

One Terraform module for all add‑ons

I built a single eks-addons Terraform module that manages everything in one place.

Managed by AWS (EKS‑managed)

Add‑onPurpose
VPC CNIPod networking
EBS CSI DriverPersistent volumes (added while I was at it)
CoreDNSDNS resolution
kube‑proxyNetwork proxy

Managed by Helm (Helm‑managed)

Add‑onPurpose
Metrics ServerResource metrics for kubectl top and HPA
ReloaderAuto‑restart pods when ConfigMaps or Secrets change

Why a single module?
All add‑ons share the same dependency – the EKS cluster. Consolidating them means:

  • One terragrunt apply deploys everything.
  • One terraform plan shows drift across all add‑ons.
  • One PR updates any version.

Core Terraform for an EKS‑managed add‑on

resource "aws_eks_addon" "vpc_cni" {
  count = var.enable_vpc_cni ? 1 : 0

  cluster_name                 = var.cluster_name
  addon_name                   = "vpc-cni"
  addon_version                = var.vpc_cni_version
  resolve_conflicts_on_create  = "OVERWRITE"
  resolve_conflicts_on_update  = "OVERWRITE"
  preserve                     = true
}

Two important flags

  • resolve_conflicts = "OVERWRITE" – Terraform is the source of truth; any manual changes in the cluster are overwritten on the next apply.
  • preserve = true – If the resource is removed from Terraform, the add‑on remains in the cluster. This acts as a safety net during refactoring.

EBS CSI Driver – extra IAM work

The EBS CSI Driver needs IAM permissions to create/attach EBS volumes. The recommended pattern is IRSA (IAM Roles for Service Accounts).

# IAM role for the driver
resource "aws_iam_role" "ebs_csi" {
  count = var.enable_ebs_csi ? 1 : 0
  name  = "${var.cluster_name}-ebs-csi-driver"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = var.oidc_provider_arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${var.oidc_provider}:sub" = "system:serviceaccount:kube-system:ebs-csi-controller-sa"
          "${var.oidc_provider}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}
# Attach the AWS‑managed policy
resource "aws_iam_role_policy_attachment" "ebs_csi" {
  count      = var.enable_ebs_csi ? 1 : 0
  role       = aws_iam_role.ebs_csi[0].name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
}

No credentials in pods, automatic rotation, and a clean audit trail in CloudTrail.
IRSA is the correct pattern for any AWS service that needs to call AWS APIs from inside Kubernetes.

Migrating the Metrics Server

1️⃣ Delete the existing kubectl‑installed resources

kubectl delete deployment metrics-server -n kube-system
kubectl delete service metrics-server -n kube-system
kubectl delete apiservice v1beta1.metrics.k8s.io

2️⃣ Install the Helm‑managed version via Terraform

resource "helm_release" "metrics_server" {
  count       = var.enable_metrics_server ? 1 : 0
  name        = "metrics-server"
  repository  = "https://kubernetes-sigs.github.io/metrics-server/"
  chart       = "metrics-server"
  version     = var.metrics_server_chart_version
  namespace   = "kube-system"

  values = [yamlencode({
    replicas = 2
    args = [
      "--kubelet-preferred-address-types=InternalIP",
      "--kubelet-insecure-tls"
    ]
    podDisruptionBudget = {
      enabled      = true
      minAvailable = 1
    }
  })]
}

Expected downtime: 2‑3 minutes (only kubectl top is unavailable). Running applications are not affected.

CI/CD gotcha

Our GitHub Actions workflow looks for modified terragrunt.hcl files to decide which stacks to deploy. When I changed files under common/modules/eks-addons/, the workflow triggered but found no stacks (no terragrunt.hcl changed), so nothing ran.

Solution: Deploy module changes manually.

cd workloads-nonprod/us-east-1/cluster-name/eks-addons
terragrunt init
terragrunt plan   # Should show ~10 resources to add
terragrunt apply

Verify everything is healthy

# Check EKS‑managed add‑on status
for addon in vpc-cni aws-ebs-csi-driver coredns kube-proxy; do
  aws eks describe-addon \
    --cluster-name <cluster-name> \
    --addon-name $addon \
    --query "addon.status" \
    --output text
done

You should see ACTIVE for each add‑on.

Takeaway

  • Self‑managed add‑ons require you to track versions, compatibility, and security patches.
  • EKS‑managed add‑ons let AWS handle the lifecycle, giving you confidence during upgrades.
  • Consolidating all add‑ons (AWS‑managed, Helm‑managed, and custom) into a single Terraform module simplifies drift detection, version upgrades, and CI/CD integration.

Now the cluster can be upgraded to EKS 1.32 without the previous add‑on compatibility headaches.

kubectl top nodes

Before

  • Four add‑ons running in self‑managed mode
  • One add‑on installed by kubectl
  • No version history
  • No drift detection

After

  • All six add‑ons defined in code with pinned versions
  • terraform plan shows immediately if anything drifts from the declared state
  • Rollback is simply git revert + terragrunt apply

EKS Cluster Upgrade Checklist

  1. Update the four version strings in the Terragrunt config.
  2. Open a PR.
  3. Merge – the upgrade is applied automatically.

The cluster upgrade I was dreading took about 30 minutes instead of a day of manual compatibility checking.

Running into EKS add‑on management problems? Reach out—this is the kind of operational work I do for platform teams.

0 views
Back to Blog

Related posts

Read more »

docker9

The “Relay Race” Pattern: Syncing Gradle Builds with Ansible in Tekton In a modern CI/CD pipeline, passing a dynamic build number from a Gradle build to an Ans...