Kubernetes Troubleshooting Commands

Cluster health

Use these first when the control plane feels slow, workloads fail cluster-wide, or you need a baseline snapshot before deeper debugging.

API & endpoints

kubectl cluster-info

Shows the Kubernetes control plane URL and (if installed) the cluster DNS service address.

When to use: Quick sanity check that your kubeconfig points at a live API and core add-ons advertise endpoints.

kubectl get nodes -o wide

Lists nodes with internal/external IPs, OS image, kernel, container runtime, and readiness.

When to use: Spot NotReady nodes, wrong zones, or capacity skew before blaming applications.

kubectl get componentstatuses

Legacy view of scheduler, controller-manager, and etcd health (where the API still exposes ComponentStatus).

When to use: Familiar smoke test on older clusters; treat as supplemental because the API type is deprecated.

kubectl get --raw /healthz

Hits the API server’s legacy health endpoint; returns ok when minimally healthy.

When to use: Fast binary check from any machine with kubeconfig access—good for monitors and curl-style probes.

kubectl get --raw /livez

Kubernetes live probe endpoint: verifies the process is alive (separate from full readiness).

When to use: Distinguish “API process up” from “ready to serve traffic” during upgrades or crash loops.

kubectl get --raw /readyz

Readiness-style check including registered informers and critical post-start hooks.

When to use: When kubectl errors or webhooks time out—confirms the apiserver finished warming up.

kubectl top nodes

Shows CPU and memory usage per node (requires Metrics Server or compatible metrics API).

When to use: Suspected node pressure, noisy neighbors, or scheduling failures tied to resource starvation.

kubectl get events -A --sort-by='.lastTimestamp'

Cluster-wide events ordered by most recent activity across all namespaces.

When to use: After unexplained restarts or FailedScheduling storms—surfaces control-plane and workload signals in one stream.

kubectl cluster-info dump --output-directory=/tmp/dump

Exports logs, descriptions, and configs for nodes and control-plane pods into a directory tree.

When to use: Opening vendor support cases or offline analysis; can be large and sensitive—scrub before sharing.

Note: componentstatuses is deprecated; prefer /livez, /readyz, and cloud-provider health signals on supported distros.

Pod debugging

Drill into workload pods: observe state, stream logs, exec shells, copy files, and attach ephemeral debug containers.

kubectl get pods -o wide

Pods in the current namespace with node assignment, IPs, and readiness.

When to use: First step for CrashLoopBackOff, ImagePullBackOff, or pods stuck Pending.

kubectl describe pod <name>

Events, conditions, volumes, tolerations, and last status transitions for one pod.

When to use: Decode scheduling failures, probe failures, mount errors, and quota denials from Kubernetes events.

kubectl logs <pod>
kubectl logs <pod> -c <container>
kubectl logs <pod> --previous

Current container logs; a specific container in a multi-container pod; or logs from the last crashed instance.

When to use: Application errors, startup panics, or post-OOM investigation via --previous.

kubectl logs <pod> -f

Streams stdout/stderr from the default or selected container like tail -f.

When to use: Reproducing intermittent bugs or watching rollout behavior in real time.

kubectl logs -l app=myapp --all-containers

Aggregates logs from every pod matching a label selector, including sidecars.

When to use: Microservices with many replicas where you need a single combined trace (may be noisy).

kubectl exec -it <pod> -- /bin/sh

Interactive shell inside a running container (use /bin/bash if available).

When to use: Inspect files, DNS resolution, or local ports when logs are insufficient—requires shell in the image.

kubectl exec <pod> -- env

Prints environment variables visible to the workload (ConfigMaps, Secrets, downward API).

When to use: Verify injected configuration without redeploying or editing manifests.

kubectl port-forward pod/<name> 8080:80

Forwards a local TCP port to a pod port through the API server tunnel.

When to use: Hit admin endpoints, metrics, or debug ports that are not exposed via Services.

kubectl cp <pod>:/path /local/path

Copies files or directories between a container filesystem and your workstation.

When to use: Grab heap dumps, config snapshots, or core files without SSH to the node.

kubectl debug <pod> --image=busybox --target=<container>

Attaches an ephemeral debug container sharing process namespaces with an existing container (cluster/feature dependent).

When to use: Minimal distroless images where kubectl exec has no shell—inspect processes and filesystem from a toolkit image.

Node debugging

Move from Kubernetes objects to the node: kubelet health, drains, and container runtime introspection.

kubectl describe node <name>

Capacity, allocatable resources, taints, conditions (MemoryPressure, DiskPressure, PIDPressure), and recent events.

When to use: Pods evicted or stuck Pending on specific nodes; correlate with cloud instance status.

kubectl top node <name>

CPU and memory utilization for a single node (Metrics API).

When to use: Confirm hot nodes before cordon/drain or autoscaling changes.

kubectl cordon <node>
kubectl uncordon <node>

Marks the node unschedulable (cordon) or restores scheduling (uncordon).

When to use: Maintenance windows—cordon stops new pods; uncordon after fixes without deleting workloads.

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Evicts workloads (respecting PDBs where possible) so the node can be serviced safely.

When to use: OS patches, hardware swaps, or kubelet upgrades—always review PDBs and local EmptyDir data first.

On the node (SSH / SSM)

systemctl status kubelet
journalctl -u kubelet --since "30 min ago"

Shows kubelet service state and recent logs from systemd’s journal.

When to use: Node NotReady, sandbox creation errors, or CNI plug-in failures reported by kubelet.

crictl ps
crictl logs <container-id>
crictl inspect <container-id>

Lists containers managed by the CRI runtime, tails their logs, and dumps low-level JSON state.

When to use: When kubelet reports sandbox or image errors but kubectl logs are empty or pods are stuck creating.

crictl images
crictl rmi --prune

Lists images present on the node; prunes unused images to reclaim disk.

When to use: ImagePullBackOff due to disk pressure, or manual cleanup before a large rollout.

Caution: Draining and image pruning can disrupt workloads; coordinate with owners and verify PodDisruptionBudgets.

Networking

Trace Services to backends, validate DNS, test pod-to-service connectivity, and inspect policies and kube-dns.

kubectl get svc -o wide

ClusterIP, NodePort, LoadBalancer addresses, ports, and selectors for Services.

When to use: Clients cannot reach a service—confirm types, ports, and external IPs.

kubectl get endpoints <svc>

Shows IP:port tuples registered for a Service name (Endpoints object).

When to use: Traffic drops after deploy—empty endpoints mean selectors or readiness probes are wrong.

kubectl get endpointslices

Scalable view of endpoints grouped by Service with topology hints.

When to use: Large Services or dual-stack clusters where classic Endpoints are truncated or hard to read.

kubectl exec -it <pod> -- nslookup kubernetes.default

Resolves the in-cluster API Service DNS name from inside a workload.

When to use: Intermittent DNS failures—validates CoreDNS (or kube-dns) reachability from app namespaces.

kubectl exec -it <pod> -- wget -qO- <svc>:<port>
# or, if curl is present:
kubectl exec -it <pod> -- curl -sS http://<svc>:<port>/

HTTP/TCP checks from a pod to a Service DNS name or IP on a given port.

When to use: Prove east-west connectivity independent of ingress or external load balancers.

kubectl get networkpolicy -A

Lists NetworkPolicies and their namespaces—shows whether default-deny or scoped rules exist.

When to use: Sudden timeouts after enabling policies—compare allowed CIDRs, ports, and pod selectors.

kubectl get ingress -A

Ingress resources with hosts, classes, and backend Services.

When to use: North-south routing issues—pairs with controller logs and cloud LB health checks.

On the node

iptables -t nat -L KUBE-SERVICES

Lists kube-proxy–managed NAT rules for ClusterIP translation (iptables mode).

When to use: Rare low-level debugging when IPVS/iptables modes mismatch or rules are stale—requires root.

kubectl logs -n kube-system -l k8s-app=kube-dns

CoreDNS (legacy label kube-dns) pod logs for the DNS deployment.

When to use: NXDOMAIN loops, upstream forwarder errors, or high DNS latency cluster-wide.

Storage

Follow volumes from PersistentVolumeClaims through binding to mounts inside running pods.

kubectl get pv
kubectl get pvc

Cluster-scoped PersistentVolumes and namespace-scoped claims with status and capacity.

When to use: Pods stuck ContainerCreating—check whether claims are Bound or Pending.

kubectl describe pv <name>
kubectl describe pvc <name>

Provisioner details, access modes, volume handle, events, and finalizers.

When to use: CSI provisioning failures, snapshot restore issues, or wrong storage class selection.

kubectl get storageclass

Available classes, default annotation, reclaim policy, and volume binding mode.

When to use: PVCs stay Pending—ensure a default exists or the claim requests a valid class.

kubectl get volumeattachments

Maps volumes to nodes for attach/detach operations performed by CSI/external attacher.

When to use: Multi-attach errors or volumes stuck attached after node failure.

kubectl exec -it <pod> -- df -h
kubectl exec -it <pod> -- ls /mount/path

Shows mounted filesystems and lists expected paths inside the container.

When to use: Verify subPath mounts, read-only flags, or disk usage inside the workload.

kubectl get events -A --field-selector involvedObject.kind=PersistentVolumeClaim

Events tied to PVC objects—provisioning, binding, and resize messages.

When to use: Faster than scanning all events when only storage binding is suspect.

Combine with kubectl describe pod volume sections to see which claim backs each mount and whether mount propagation is requested.

RBAC & security

Answer “can this identity perform this action?” and inspect roles, bindings, and legacy policy objects.

kubectl auth can-i <verb> <resource>
kubectl auth can-i <verb> <resource> --as=<user>

Boolean check for the current context or an impersonated user/service account.

When to use: CI/CD failures with “forbidden”, or debugging least-privilege ServiceAccounts.

kubectl auth can-i --list

Enumerates allowed verbs/resources for the active context (subject to aggregation rules).

When to use: Auditing overly broad ClusterRoles or comparing dev vs prod kubeconfig permissions.

kubectl get clusterroles
kubectl get clusterrolebindings

Global roles and who (users, groups, SAs) is bound to them cluster-wide.

When to use: Platform-wide incidents—find wildcard rules or unexpected admin bindings.

kubectl get roles -n <ns>
kubectl get rolebindings -n <ns>

Namespace-scoped RBAC objects listing local permissions.

When to use: Team namespaces where only certain workloads fail authorization checks.

kubectl describe clusterrolebinding <name>

Subjects attached to a ClusterRoleBinding and the referenced role name.

When to use: Tracing which Group or ServiceAccount gained cluster-admin via a binding.

kubectl get psp
kubectl get podsecuritypolicy

Legacy PodSecurityPolicy objects (removed in Kubernetes 1.25+; only on older clusters).

When to use: Historical clusters still enforcing PSP—migrate plans toward PSA/PSS or policy engines.

kubectl label nodes <node> --list

Prints labels on a node (often used with RBAC or admission that matches node metadata).

When to use: Debugging NodeRestriction, custom schedulers, or storage topology labels.

Tip: Pair can-i with --subresource for exec/log/portforward checks when debugging admission webhooks.

etcd

Operate etcd from control-plane hosts with etcdctl; always prefer snapshots before invasive changes.

Environment (API v3 + TLS)

export ETCDCTL_API=3
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  <subcommand>

Typical kubeadm-style paths—replace with your distro’s real certificate locations.

When to use: Any etcdctl call on a secured cluster; without certs the client will fail TLS handshake.

etcdctl endpoint health --cluster

Reports health for each member—latency and error strings if a peer is down.

When to use: API timeouts or leader election storms—validates etcd quorum before touching apiserver flags.

etcdctl endpoint status --cluster -w table

Shows member IDs, versions, DB size, raft index, and leader role in tabular form.

When to use: Capacity planning, defragmentation decisions, or version skew across members.

etcdctl alarm list

Lists active alarms such as NOSPACE or CORRUPT.

When to use: Writes fail with alarm errors—often disk full or data integrity issues.

etcdctl get / --prefix --keys-only | head

Samples key names under the root prefix (read-only reconnaissance).

When to use: Confirm connectivity and keyspace layout—avoid broad reads on huge clusters in production peaks.

etcdctl snapshot save /tmp/etcd-backup.db

Creates a consistent point-in-time snapshot file of etcd data.

When to use: Before upgrades, restores, or manual key edits—store off-node with encryption.

etcdctl snapshot status /tmp/etcd-backup.db -w table

Validates snapshot integrity, revision, and total keys before trusting a file.

When to use: After backup jobs or copying files across sites—detect corruption early.

etcdctl snapshot restore /tmp/etcd-backup.db --data-dir=/var/lib/etcd-restored

Rebuilds etcd data directory from a snapshot for disaster recovery workflows.

When to use: Controlled restore procedures—requires stopping etcd members and coordinated cluster rebuild per runbook.

Warning: Restores and key mutations can break the control plane; follow your distribution’s documented DR process only.

Logs & events

Tune log volume, focus on warnings, correlate events to objects, and know on-disk log locations for node forensics.

kubectl logs <pod> --tail=100

Returns only the last N log lines to avoid dumping huge files.

When to use: Large long-running pods where full logs overwhelm terminals or CI artifacts.

kubectl logs <pod> --since=1h

Restricts output to logs produced within the last hour (supports other durations).

When to use: Incidents with known start times—skip cold-start noise from days ago.

kubectl logs -n kube-system <component-pod>

Fetch logs from control-plane or add-on pods running in kube-system.

When to use: Scheduler, controller-manager, or CNI pod errors surfaced as cluster events.

On the node

journalctl -u kubelet
journalctl -u containerd

Systemd journals for kubelet and the container runtime service.

When to use: Sandbox or image pull errors beneath Kubernetes—complements crictl output.

kubectl get events --sort-by='.metadata.creationTimestamp'

Chronological event stream for the current namespace (add -A for all).

When to use: Reconstruct timelines of scaling, probe failures, or image pulls.

kubectl get events --field-selector type=Warning

Filters to Warning events only—surfaces abnormal transitions faster.

When to use: Noisy namespaces; pair with time bounds via --watch during rollouts.

kubectl get events --field-selector involvedObject.kind=Pod,involvedObject.name=<pod>

Events scoped to a single pod name.

When to use: Deep dive on one failing pod without scrolling unrelated objects.

On-disk log paths (typical containerd)

/var/log/pods/<namespace>_<pod-name>_<uid>/<container>/<n>.log
/var/log/containers/<pod-name>_<namespace>_<container>-<container-id>.log

Symlinked JSON log files written by the container runtime for kubelet/CRI.

When to use: kubectl log streaming breaks or apiserver is down—tail files directly with elevated access.