Interactive guide: architecture, symptoms, checks, and recovery paths for API Server, Scheduler, Controller Manager, and etcd.
Every control plane component coordinates through the API Server. Cluster state lives in etcd; controllers and the scheduler read and write desired state via the API.
Interaction rule Schedulers and controllers never talk to etcd directly in a standard cluster—they use the API Server, which validates and persists changes.
6443 (or configured secure port).| Symptom | Check | Diagnose | Fix direction |
|---|---|---|---|
kubectl hangs / TLS errors | Certs, connectivity to VIP/LB | OpenSSL / apiserver logs | Rotate certs, fix SANs, fix LB |
| 500s / timeout on mutating API | etcd health, disk | etcdctl, apiserver logs | Restore quorum, free disk |
| Pod CrashLoop (static pod) | describe, node logs | OOMKilled, args, mounts | Increase memory, fix flags |
| “Forbidden” for system accounts | RBAC objects | auth can-i, audit | Repair RoleBindings |
kubectl get --raw /healthz kubectl get --raw /livez kubectl get --raw /readyz
# systemd control plane (many kubeadm clusters) sudo journalctl -u kube-apiserver -f # Pod-based control plane kubectl logs -n kube-system kube-apiserver-<node-name> -f
crictl ps | grep kube-apiserver ss -tlnp | grep 6443
kubectl describe pod -n kube-system kube-apiserver-control-plane often shows exit reason, OOM, mount failures, or bad flags.
Name: kube-apiserver-control-plane Namespace: kube-system ... State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 ... Events: Warning BackOff 12s (x15) kubelet Back-off restarting failed container kube-apiserver
Use events and Pod details—scheduling failures are usually explicit in Events.
kubectl get events -A --sort-by=.lastTimestamp | tail -40 kubectl describe pod <pod> -n <ns>
kubectl get pods -n kube-system | grep scheduler kubectl logs -n kube-system kube-scheduler-<node-name> --tail=100
requiredDuringScheduling or fix labels.Correlate Deployment → ReplicaSet → Pod graph with events. Controller-manager logs show sync errors, quota denials, or API errors.
kubectl describe deployment <name> -n <ns> kubectl get rs -n <ns> -l app=<label> kubectl logs -n kube-system kube-controller-manager-<node-name> --tail=200
kubectl get pods -l app=...controller-manager logs for sync failures.maxUnavailable—may throttle creation during surge.Set endpoints and TLS flags to match your cluster. Example with TLS:
export ETCDCTL_API=3 export ETCDCTL_ENDPOINTS="https://127.0.0.1:2379" export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt export ETCDCTL_CERT=/etc/kubernetes/pki/apiserver-etcd-client.crt export ETCDCTL_KEY=/etc/kubernetes/pki/apiserver-etcd-client.key etcdctl endpoint health etcdctl endpoint status -w table etcdctl alarm list
etcdctl snapshot restore into clean data dirs for members.endpoint health, then start API Server; validate with kubectl get nodes.See etcd backup & restore for the dedicated walkthrough.
| Component | Commands |
|---|---|
| API Server | kubectl get --raw /healthz|/livez|/readyz; journalctl -u kube-apiserver; kubectl logs -n kube-system kube-apiserver-<node>; ss -tlnp | grep 6443 |
| Scheduler | kubectl get events; kubectl describe pod; kubectl logs -n kube-system kube-scheduler-<node> |
| Controller Manager | kubectl describe deploy/rs; kubectl logs -n kube-system kube-controller-manager-<node> |
| etcd | etcdctl endpoint health; etcdctl endpoint status; etcdctl alarm list; snapshot restore workflow |
| General | kubectl get pods -n kube-system; crictl ps; node and disk checks on control plane |
get events, describe, then narrow to the failing component./livez (running) and /readyz (accepting traffic)—useful under load or startup.