Zero-Downtime EKS Upgrades: A Complete Guide
Upgrading Amazon EKS clusters across multiple major versions while maintaining 100% uptime requires careful planning, API compatibility validation, and coordinated execution. This guide shares lessons learned from performing 10+ major EKS upgrades.
The Challenge
EKS version upgrades can be deceptively complex. While AWS handles the control plane upgrade seamlessly, the real challenges emerge from:
- Deprecated Kubernetes APIs that break existing workloads
- Helm charts using outdated API versions
- Node group rollouts affecting running pods
- Add-on version compatibility requirements
- Custom controllers and operators with version dependencies
Pre-Upgrade Phase
1. API Deprecation Audit with Pluto
Before any upgrade, run Pluto to scan your cluster for deprecated APIs:
# Install Pluto brew install FairwindsOps/tap/pluto # Scan live cluster pluto detect-helm -o wide # Scan Helm releases for target version pluto detect-helm --target-versions k8s=v1.29 # Scan local manifests pluto detect-files -d ./manifests/
Pluto identifies resources using deprecated APIs and tells you which version they'll be removed in. Address all critical findings before proceeding.
2. Helm Chart Migration
Update Helm charts to use current API versions. Common migrations include:
| Old API | New API | Removed In |
|---|---|---|
networking.k8s.io/v1beta1 | networking.k8s.io/v1 | v1.22 |
policy/v1beta1 PodDisruptionBudget | policy/v1 | v1.25 |
batch/v1beta1 CronJob | batch/v1 | v1.25 |
autoscaling/v2beta2 HPA | autoscaling/v2 | v1.26 |
3. Add-on Compatibility Matrix
Document your current add-on versions and verify compatibility with the target EKS version:
- CoreDNS - Check AWS recommended version
- kube-proxy - Must match minor Kubernetes version
- VPC CNI - Review changelog for breaking changes
- EBS CSI Driver - Verify snapshot CRD compatibility
Upgrade Execution
1. Control Plane Upgrade
The control plane upgrade is straightforward but requires preparation:
# Check current version aws eks describe-cluster --name my-cluster \ --query 'cluster.version' # Upgrade control plane (one minor version at a time) aws eks update-cluster-version \ --name my-cluster \ --kubernetes-version 1.29 # Monitor upgrade progress aws eks describe-update \ --name my-cluster \ --update-id <update-id>
2. Node Group Rolling Update
For managed node groups, use the rolling update strategy:
# Update node group with rolling strategy
aws eks update-nodegroup-version \
--cluster-name my-cluster \
--nodegroup-name my-nodegroup \
--kubernetes-version 1.29
# Or with Terraform
resource "aws_eks_node_group" "main" {
# ... other config
update_config {
max_unavailable_percentage = 25
}
}3. Karpenter Considerations
If using Karpenter for node provisioning, update your NodePool configuration:
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
nodeClassRef:
name: default
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h # 30 daysPost-Upgrade Validation
After the upgrade, validate cluster health:
# Verify cluster version kubectl version --short # Check all nodes are Ready kubectl get nodes # Verify system pods kubectl get pods -n kube-system # Check all deployments have available replicas kubectl get deployments -A | grep -v "1/1\|2/2\|3/3" # Validate ingress is working curl -I https://your-app.example.com # Run smoke tests ./scripts/smoke-tests.sh
Lessons Learned
- Always upgrade one minor version at a time. Skipping versions can cause unexpected issues.
- Test in non-production first. Maintain staging clusters at the same version as production.
- Schedule upgrades during low-traffic periods. Even with zero-downtime strategies, reduce risk.
- Have rollback procedures ready. Know how to revert if critical issues emerge.
- Document everything. Create runbooks for repeatable upgrade processes.
Conclusion
Zero-downtime EKS upgrades are achievable with proper planning and execution. The key is thorough pre-upgrade validation, especially API deprecation audits with tools like Pluto, and a systematic approach to node group updates.
By maintaining upgrade runbooks and testing procedures, you can confidently keep your clusters current with the latest Kubernetes features and security patches.
Amar Sattaur
Staff DevOps Engineer with experience performing 10+ major EKS version upgrades with zero downtime.