⌂ Home

CKA Troubleshooting Scenarios

CKA Troubleshooting Domain — 30% of the exam

Use these cards like short exam drills: read the problem, try to list what you would check, then reveal investigation steps and the solution. In the real CKA, you work on a live cluster; here the focus is the diagnostic story—symptoms, commands, root cause, and fix.

Buttons toggle panels with a smooth open/close animation. You can open investigation and solution at the same time if you want to compare.

1. Broken kubelet

Beginner

What you see: A worker node shows NotReady. The scheduler avoids it and pods are not placed on this node.

Investigation

Confirm node state and kubelet health from the control plane and on the node.

kubectl get nodes
kubectl describe node worker-1

SSH to the worker and inspect the kubelet service:

ssh worker-1
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 50 --no-pager

Root cause: The kubelet service is stopped (or failed to start).

Fix: Start kubelet and enable it on boot.

sudo systemctl start kubelet
sudo systemctl enable kubelet

2. Static pod not running

Beginner

What you see: A static pod for a monitoring agent should run on every node, but on worker-2 the mirror pod never appears (or stays failed).

Investigation

Static pods are defined by files on the node. Verify the manifest directory and kubelet configuration.

  • List files under the static pod path (often /etc/kubernetes/manifests/ on kubeadm clusters).
  • Check kubelet config for staticPodPath if the path is non-standard.
  • Validate YAML syntax; kubelet logs often mention parse errors.
ls -la /etc/kubernetes/manifests/
sudo grep -R staticPodPath /var/lib/kubelet/config.yaml /etc/kubernetes/kubelet.conf 2>/dev/null
sudo journalctl -u kubelet -n 80 --no-pager | tail -40

Root cause: The manifest file has a YAML syntax error—for example a missing colon after a key—so kubelet refuses to create the pod.

Fix: Correct the manifest on disk. Kubelet watches the directory and will recreate the static pod automatically.

sudo nano /etc/kubernetes/manifests/monitoring-agent.yaml
# Fix the YAML (e.g. ensure key: value pairs are valid)
sudo systemctl restart kubelet   # only if needed; usually not

3. API server certificate expired

Intermediate

What you see: kubectl fails with: Unable to connect to the server: x509: certificate has expired or is not yet valid.

Investigation

Inspect certificate dates on the control plane node and use kubeadm’s built-in check.

openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
kubeadm certs check-expiration

Compare notAfter with the current time. If the API server cert is past expiry, clients using that cert will fail TLS.

Root cause: The API server serving certificate has expired.

Fix: Renew the apiserver certificate with kubeadm, then restart the API server (static pod or systemd unit depending on install).

kubeadm certs renew apiserver
# If kube-apiserver is a static pod, moving the manifest aside and back can restart it;
# or delete the pod mirror so kubelet recreates it.
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 3
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/

Restore admin kubeconfig trust if needed after rotation in your environment; kubeadm typically updates server certs in place.

4. DNS not resolving

Beginner

What you see: Pods reach external IPs, but nslookup kubernetes.default (or similar) fails inside pods.

Investigation

Check CoreDNS (or kube-dns) pods and logs in kube-system.

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl get pods -n kube-system -l app.kubernetes.io/name=coredns
kubectl logs -n kube-system deploy/coredns --tail=80

If a specific pod is crashing, describe it and read events.

kubectl describe pod -n kube-system -l k8s-app=kube-dns

Root cause: CoreDNS pods are in CrashLoopBackOff because the Corefile contains a misconfiguration—here, a loop plugin that creates a forwarding loop.

Fix: Edit the CoreDNS ConfigMap, remove or replace the problematic loop plugin, then restart CoreDNS pods.

kubectl edit configmap coredns -n kube-system
# In the Corefile block, remove the `loop` plugin line or fix upstreams
kubectl delete pods -n kube-system -l k8s-app=kube-dns

5. NetworkPolicy blocking traffic

Intermediate

What you see: After a new NetworkPolicy was applied, a frontend pod can no longer reach the backend Service—even though endpoints and selectors looked fine before.

Investigation

List policies in the namespace, read rules, and test connectivity from a pod.

kubectl get networkpolicy -n app
kubectl describe networkpolicy deny-all-ingress -n app
kubectl exec -n app deploy/frontend -- curl -sS -m 2 http://backend.app.svc.cluster.local

Remember: some CNI implementations enforce default-deny once an ingress policy selects pods, unless another policy explicitly allows traffic.

📂 Repo reference: See k8s/labs/security/deny-all-ingress.yaml and allow-ingress.yaml for ready-made deny/allow policy examples. Also see deny-from-other-namespaces.yaml for namespace isolation.

Root cause: A default-deny ingress NetworkPolicy was added, but no policy allows traffic from the frontend pods to the backend pods.

Fix: Add an allow rule with matching labels (and namespace if cross-namespace). Example based on k8s/labs/security/allow-ingress.yaml:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
  namespace: app
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - protocol: TCP
          port: 8080

Adjust labels, ports, and add namespaceSelector if the client lives in another namespace.

6. PVC not binding

Beginner

What you see: A pod stays Pending. Events mention the persistent volume claim is not bound.

Investigation

kubectl get pvc -A
kubectl get pv
kubectl describe pvc data-vol -n dev

Look for StorageClass mismatches, missing provisioner, or quota issues in the PVC events.

Root cause: The PVC requests storageClassName: fast, but no StorageClass named fast exists (or no default can satisfy it).

Fix: Create the fast StorageClass, or change the PVC to use an existing class (e.g. standard).

kubectl get storageclass
# Option A: create StorageClass `fast` for your provisioner
# Option B: patch PVC
kubectl patch pvc data-vol -n dev -p '{"spec":{"storageClassName":"standard"}}'

7. Deployment not rolling out

Intermediate

What you see: kubectl rollout status deployment/web hangs. Only one of three new pods becomes available.

Investigation

kubectl get pods -l app=web -o wide
kubectl describe deployment web
kubectl get rs -l app=web
kubectl describe rs web-7d9c8f4b5c

Check new ReplicaSet pod events for ImagePullBackOff, CrashLoopBackOff, or probe failures.

Root cause: New pods fail because the image tag is wrong—often ImagePullBackOff—so the rollout cannot reach desired availability.

Fix: Point the deployment to a valid image, or undo the rollout.

kubectl set image deployment/web web=nginx:1.25
kubectl rollout status deployment/web

# Or revert the bad change
kubectl rollout undo deployment/web

8. Node disk pressure evictions

Intermediate

What you see: Pods disappear from a node with eviction messages about low ephemeral storage.

Investigation

kubectl describe node worker-3 | sed -n '/Conditions/,/Addresses/p'

Look for DiskPressure = True. On the node, check filesystem usage.

df -h
sudo du -sh /var/lib/containerd/* 2>/dev/null | sort -h | tail

Root cause: Container logs, layers, and images consumed too much local ephemeral storage; the kubelet evicts pods to protect the node.

Fix: Free space (prune unused images), tune log rotation, and cap per-pod ephemeral storage where appropriate.

sudo crictl rmi --prune
# Configure kubelet/container runtime log rotation per your distro
# In pod specs, set resources.limits.ephemeral-storage

9. Scheduler not running

Intermediate

What you see: New pods remain Pending. Events include messages like no nodes available to schedule pods even though nodes are Ready.

Investigation

Verify control plane pods, especially the scheduler, in kube-system (or static manifests on the control plane node).

kubectl get pods -n kube-system | grep -E 'scheduler|kube-scheduler'

If nothing is running, SSH to the control plane and list static pod manifests:

ls -la /etc/kubernetes/manifests/

Root cause: The scheduler static pod manifest was removed or corrupted, so no component is assigning pods to nodes.

Fix: Restore /etc/kubernetes/manifests/kube-scheduler.yaml from backup or official documentation for your Kubernetes version, then let kubelet recreate the pod.

sudo cp /root/backup/kube-scheduler.yaml /etc/kubernetes/manifests/kube-scheduler.yaml
sudo chmod 600 /etc/kubernetes/manifests/kube-scheduler.yaml

10. RBAC permission denied

Beginner

What you see: A developer cannot list pods in namespace dev. API error: User dev-user cannot list resource pods in namespace dev.

Investigation

kubectl auth can-i list pods -n dev --as=dev-user
kubectl get rolebindings,clusterrolebindings -n dev
kubectl describe rolebinding dev-readonly -n dev

Confirm which Role or ClusterRole is referenced and whether that Role exists.

kubectl get role -n dev

Root cause: A RoleBinding points to a Role name that does not exist in the namespace (or has no matching rules).

Fix: Create the missing Role with the right verbs/resources, or correct the RoleBinding’s roleRef.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: dev
  name: pod-reader
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]
Exam tip: For troubleshooting items, start from symptoms → kubectl describe / events → narrow to one component (kubelet, DNS, scheduler, storage, policy, RBAC). Write the minimal fix command or manifest edit the question asks for.