Use these cards like short exam drills: read the problem, try to list what you would check,
then reveal investigation steps and the solution. In the real CKA, you work
on a live cluster; here the focus is the diagnostic story—symptoms, commands, root cause, and fix.
Buttons toggle panels with a smooth open/close animation. You can open investigation and solution at the same time
if you want to compare.
1. Broken kubelet
Beginner
What you see: A worker node shows NotReady. The scheduler avoids it and pods are not placed on this node.
Investigation
Confirm node state and kubelet health from the control plane and on the node.
kubectl get nodes
kubectl describe node worker-1
SSH to the worker and inspect the kubelet service:
Root cause: The manifest file has a YAML syntax error—for example a missing colon after a key—so kubelet refuses to create the pod.
Fix: Correct the manifest on disk. Kubelet watches the directory and will recreate the static pod automatically.
sudo nano /etc/kubernetes/manifests/monitoring-agent.yaml
# Fix the YAML (e.g. ensure key: value pairs are valid)
sudo systemctl restart kubelet # only if needed; usually not
3. API server certificate expired
Intermediate
What you see:kubectl fails with: Unable to connect to the server: x509: certificate has expired or is not yet valid.
Investigation
Inspect certificate dates on the control plane node and use kubeadm’s built-in check.
Compare notAfter with the current time. If the API server cert is past expiry, clients using that cert will fail TLS.
Root cause: The API server serving certificate has expired.
Fix: Renew the apiserver certificate with kubeadm, then restart the API server (static pod or systemd unit depending on install).
kubeadm certs renew apiserver
# If kube-apiserver is a static pod, moving the manifest aside and back can restart it;
# or delete the pod mirror so kubelet recreates it.
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sleep 3
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
Restore admin kubeconfig trust if needed after rotation in your environment; kubeadm typically updates server certs in place.
4. DNS not resolving
Beginner
What you see: Pods reach external IPs, but nslookup kubernetes.default (or similar) fails inside pods.
Investigation
Check CoreDNS (or kube-dns) pods and logs in kube-system.
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl get pods -n kube-system -l app.kubernetes.io/name=coredns
kubectl logs -n kube-system deploy/coredns --tail=80
If a specific pod is crashing, describe it and read events.
kubectl describe pod -n kube-system -l k8s-app=kube-dns
Root cause: CoreDNS pods are in CrashLoopBackOff because the Corefile contains a misconfiguration—here, a loop plugin that creates a forwarding loop.
Fix: Edit the CoreDNS ConfigMap, remove or replace the problematic loop plugin, then restart CoreDNS pods.
kubectl edit configmap coredns -n kube-system
# In the Corefile block, remove the `loop` plugin line or fix upstreams
kubectl delete pods -n kube-system -l k8s-app=kube-dns
5. NetworkPolicy blocking traffic
Intermediate
What you see: After a new NetworkPolicy was applied, a frontend pod can no longer reach the backend Service—even though endpoints and selectors looked fine before.
Investigation
List policies in the namespace, read rules, and test connectivity from a pod.
Remember: some CNI implementations enforce default-deny once an ingress policy selects pods, unless another policy explicitly allows traffic.
📂 Repo reference: See k8s/labs/security/deny-all-ingress.yaml and allow-ingress.yaml for ready-made deny/allow policy examples. Also see deny-from-other-namespaces.yaml for namespace isolation.
Root cause: A default-deny ingress NetworkPolicy was added, but no policy allows traffic from the frontend pods to the backend pods.
Fix: Add an allow rule with matching labels (and namespace if cross-namespace). Example based on k8s/labs/security/allow-ingress.yaml:
Root cause: Container logs, layers, and images consumed too much local ephemeral storage; the kubelet evicts pods to protect the node.
Fix: Free space (prune unused images), tune log rotation, and cap per-pod ephemeral storage where appropriate.
sudo crictl rmi --prune
# Configure kubelet/container runtime log rotation per your distro
# In pod specs, set resources.limits.ephemeral-storage
9. Scheduler not running
Intermediate
What you see: New pods remain Pending. Events include messages like no nodes available to schedule pods even though nodes are Ready.
Investigation
Verify control plane pods, especially the scheduler, in kube-system (or static manifests on the control plane node).
kubectl get pods -n kube-system | grep -E 'scheduler|kube-scheduler'
If nothing is running, SSH to the control plane and list static pod manifests:
ls -la /etc/kubernetes/manifests/
Root cause: The scheduler static pod manifest was removed or corrupted, so no component is assigning pods to nodes.
Fix: Restore /etc/kubernetes/manifests/kube-scheduler.yaml from backup or official documentation for your Kubernetes version, then let kubelet recreate the pod.