KubernetesEKSDevOps

Zero-Downtime EKS Upgrades: A Complete Guide

January 2025•12 min read

Upgrading Amazon EKS clusters across multiple major versions while maintaining 100% uptime requires careful planning, API compatibility validation, and coordinated execution. This guide shares lessons learned from performing 10+ major EKS upgrades.

The Challenge

EKS version upgrades can be deceptively complex. While AWS handles the control plane upgrade seamlessly, the real challenges emerge from:

Deprecated Kubernetes APIs that break existing workloads
Helm charts using outdated API versions
Node group rollouts affecting running pods
Add-on version compatibility requirements
Custom controllers and operators with version dependencies

Pre-Upgrade Phase

1. API Deprecation Audit with Pluto

Before any upgrade, run Pluto to scan your cluster for deprecated APIs:

api-audit.sh

# Install Pluto
brew install FairwindsOps/tap/pluto

# Scan live cluster
pluto detect-helm -o wide

# Scan Helm releases for target version
pluto detect-helm --target-versions k8s=v1.29

# Scan local manifests
pluto detect-files -d ./manifests/

Pluto identifies resources using deprecated APIs and tells you which version they'll be removed in. Address all critical findings before proceeding.

2. Helm Chart Migration

Update Helm charts to use current API versions. Common migrations include:

Old API	New API	Removed In
`networking.k8s.io/v1beta1`	`networking.k8s.io/v1`	v1.22
`policy/v1beta1 PodDisruptionBudget`	`policy/v1`	v1.25
`batch/v1beta1 CronJob`	`batch/v1`	v1.25
`autoscaling/v2beta2 HPA`	`autoscaling/v2`	v1.26

3. Add-on Compatibility Matrix

Document your current add-on versions and verify compatibility with the target EKS version:

CoreDNS - Check AWS recommended version
kube-proxy - Must match minor Kubernetes version
VPC CNI - Review changelog for breaking changes
EBS CSI Driver - Verify snapshot CRD compatibility

Upgrade Execution

1. Control Plane Upgrade

The control plane upgrade is straightforward but requires preparation:

upgrade-control-plane.sh

# Check current version
aws eks describe-cluster --name my-cluster \
  --query 'cluster.version'

# Upgrade control plane (one minor version at a time)
aws eks update-cluster-version \
  --name my-cluster \
  --kubernetes-version 1.29

# Monitor upgrade progress
aws eks describe-update \
  --name my-cluster \
  --update-id <update-id>

2. Node Group Rolling Update

For managed node groups, use the rolling update strategy:

upgrade-nodes.sh

# Update node group with rolling strategy
aws eks update-nodegroup-version \
  --cluster-name my-cluster \
  --nodegroup-name my-nodegroup \
  --kubernetes-version 1.29

# Or with Terraform
resource "aws_eks_node_group" "main" {
  # ... other config

  update_config {
    max_unavailable_percentage = 25
  }
}

3. Karpenter Considerations

If using Karpenter for node provisioning, update your NodePool configuration:

karpenter-nodepool.yaml

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        name: default
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h  # 30 days

Post-Upgrade Validation

After the upgrade, validate cluster health:

validate.sh

# Verify cluster version
kubectl version --short

# Check all nodes are Ready
kubectl get nodes

# Verify system pods
kubectl get pods -n kube-system

# Check all deployments have available replicas
kubectl get deployments -A | grep -v "1/1\|2/2\|3/3"

# Validate ingress is working
curl -I https://your-app.example.com

# Run smoke tests
./scripts/smoke-tests.sh

Lessons Learned

Always upgrade one minor version at a time. Skipping versions can cause unexpected issues.
Test in non-production first. Maintain staging clusters at the same version as production.
Schedule upgrades during low-traffic periods. Even with zero-downtime strategies, reduce risk.
Have rollback procedures ready. Know how to revert if critical issues emerge.
Document everything. Create runbooks for repeatable upgrade processes.

Conclusion

Zero-downtime EKS upgrades are achievable with proper planning and execution. The key is thorough pre-upgrade validation, especially API deprecation audits with tools like Pluto, and a systematic approach to node group updates.

By maintaining upgrade runbooks and testing procedures, you can confidently keep your clusters current with the latest Kubernetes features and security patches.

Amar Sattaur

Staff DevOps Engineer with experience performing 10+ major EKS version upgrades with zero downtime.

All Posts Next Post