KubeForge — Hands-on Kubernetes & EKS Learning

Scenario

ALERT: three teams are reporting failures simultaneously. (1) The `api` Deployment pods are being killed and restarted — a liveness probe typo (`/heathz`) is causing them to be considered unhealthy. (2) The `worker` Deployment needs 2 replicas each requesting 2000m CPU, but nodes only have 2000m allocatable total — both pods are Pending. (3) The `logger` DaemonSet is missing a toleration for node `sim-node-1` which has taint `env=prod:NoSchedule` — one of the two expected pods is not scheduled. Fix all three.

Phase B synthesis

Real production incidents rarely announce themselves one at a time. A deploy touches three components, an engineer fat-fingers a request value, and a node is tainted for maintenance — all within the same five-minute window. The ability to triage multiple failures in parallel, prioritize by impact, and apply targeted fixes is what separates a capable platform engineer from a panicking one.

The diagnostic framework

When multiple things are broken, work through a standard triage order:

1. Impact assessment  — which failure affects the most users right now?
2. Symptom grouping   — cluster symptoms into categories (probe, scheduling, config)
3. Independent fixes  — each fix should be isolated and applied separately
4. Verification       — confirm each fix before moving to the next

Failure class 1: Probe typo causing constant restarts

A liveness probe with path /heathz instead of /healthz causes the kubelet to send an HTTP GET to a nonexistent path. The server returns 404. The kubelet counts this as a probe failure. After failureThreshold failures, the container is killed and restarted. You see increasing RESTARTS in kubectl get pods.

Triage signal: High restart count, not CrashLoopBackOff.

kubectl describe pod <api-pod> | grep -A5 "Liveness:"
# Liveness: http-get http://:80/heathz ...

Fix: correct the probe path in the Deployment spec, then kubectl apply. A rolling update replaces pods with the corrected probe.

Failure class 2: Over-requested CPU causing Pending pods

Pods remain in Pending when the scheduler cannot find a node with enough unreserved CPU. With 2 nodes at 2000m each (4000m total) and 2 worker pods each requesting 2000m (4000m needed), no room remains for system pods and the workers themselves cannot both fit.

Triage signal: kubectl get pods shows Pending; kubectl describe pod <worker-pod> shows Insufficient cpu.

Nodes: 2 × 2000m = 4000m
Requested: 2 × 2000m = 4000m  ← zero headroom for kube-system pods

Fix: reduce resources.requests.cpu to 600m per pod. Total: 2 × 600m = 1200m, well within budget.

Failure class 3: Missing toleration on a tainted node

A DaemonSet that should run on every node will skip tainted nodes unless it carries a matching toleration. Node sim-node-1 with taint env=prod:NoSchedule will not accept the logger pod, so desiredNumberScheduled = 2 but numberReady = 1.

Triage signal: kubectl get daemonset logger shows DESIRED=2, READY=1; kubectl describe node sim-node-1 shows the taint.

Fix:

spec:
  template:
    spec:
      tolerations:
      - key: env
        operator: Equal
        value: prod
        effect: NoSchedule

Fix order recommendation

Priority	Failure	Rationale
1	Probe typo (`api`)	Users get 502 errors — most visible
2	CPU over-request (`worker`)	Background jobs failing silently
3	DaemonSet toleration (`logger`)	Log gaps — important but not user-facing

Apply each fix to the appropriate resource separately. With multi-document YAML you can apply all three in one command — but test the manifests first in a dry run:

kubectl apply -f fixes.yaml --dry-run=client

After stabilization

Once all three fixes are live:

Write a postmortem — document the timeline, each root cause, and which monitoring alert caught (or missed) each failure.
Add alerts — liveness probe path correctness is not alertable directly, but rising restart count is. Set an alert on kube_pod_container_status_restarts_total > 5 in 5m.
Add admission checks — a validating webhook or OPA policy can reject probes pointing to nonexistent paths, and reject pod specs with requests that saturate an entire node.

Boss Lab: Three simultaneous production failures

Real-world incidentadvanced~45 min

manifest.yamlYAML

Cluster loading…