Capstone Overview
This lab pulls together every Phase D topic into a single incident response scenario. You'll fix four independent failures that have stacked up in ShopEKS:
- ArgoCD target revision β stale branch reference
- PDB misconfiguration β drain permanently blocked
- Karpenter over-consolidation β pods evicted under load
- Pod Identity wrong role β ECR pull unauthorized
Working through a stacked incident requires systematic triage: fix the most disruptive issue first, verify it, then move to the next.
Systematic Triage Framework
1. Identify blast radius β what is down? (pods, namespace, full cluster?)
2. Check recent changes β ArgoCD sync history, Karpenter event log
3. Isolate root cause β one failure at a time
4. Fix + verify β apply, confirm pods healthy
5. Move to next failure
ArgoCD: Stale targetRevision
ArgoCD stores the target Git ref in the Application spec. If the branch is deleted, ArgoCD enters Missing state and stops syncing. Fix: update targetRevision to an existing branch (main).
PDB: Blocking Drain
minAvailable: N where N equals total replicas means the drain controller can never evict a pod β all replicas are always "needed." This permanently blocks node drains, EKS upgrades, and Karpenter consolidation. Fix: minAvailable: replicas - 1 to leave one slot for eviction.
Karpenter: Over-aggressive Consolidation
WhenEmptyOrUnderutilized with a short consolidateAfter (e.g. 5 minutes) reschedules pods continuously under fluctuating load. This causes tail latency spikes and connection resets. Fix for stateful/production pools: switch to WhenEmpty consolidation and increase consolidateAfter to 30 minutes or more.
Pod Identity: Wrong Role
The ServiceAccount annotation eks.amazonaws.com/role-arn must point to a role that has ecr:GetAuthorizationToken and the repository-level pull permissions. A wrong role ARN causes ImagePullBackOff on pods using that ServiceAccount. Fix: update the annotation to the correct role ARN.
Further Reading
EKS Troubleshooting