⌂ Home

Kubelet & Node Troubleshooting

Interactive reference for node health, kubelet failures, the container runtime, resource pressure, and join issues — with practical commands and diagnosis flows.

Overview: node architecture & API communication

A Kubernetes node runs user workloads. The control plane schedules Pods onto nodes; on each node, several daemons keep that node healthy and able to run containers.

kubelet Registers the node with the API server, watches Pod specs assigned to this node, talks to the container runtime via CRI, and reports node/Pod status and metrics.
Container runtime (containerd) Pulls images, creates sandboxes and containers, and manages low-level lifecycle. kubelet does not talk to Docker directly on modern clusters — it uses the CRI (often containerd or CRI-O).
kube-proxy Maintains network rules (iptables/IPVS or similar) so Services reach the right Pods on this node. Not the same as kubelet, but part of the “node agent” picture for networking.

How kubelet talks to the API server

Node conditions (common)

Condition True means Typical checks
Ready Node is healthy enough to accept new Pods (kubelet is up, runtime OK, networking OK). kubectl describe node → Conditions; kubelet & runtime service status on the node.
MemoryPressure Node is low on memory; kubelet may evict Pods. free -m, cgroup/memory, kubelet eviction logs.
DiskPressure Disk space or inodes are tight (often image/container layers or logs). df -h, df -i, image cleanup with crictl.
PIDPressure Too many processes / PIDs; kubelet may throttle or evict. Process count, sysctl kernel.pid_max, workload churn.
NetworkUnavailable Network plugin has not configured the node network (varies by CNI). CNI pods, CNI logs, node routes and interfaces.

Quick checks (control plane)

kubectl get nodes
kubectl get nodes -o wide
kubectl describe node <node-name>
Start with get nodes for Ready/NotReady, then describe node for Conditions, capacity, taints, and recent events.

Node NotReady: causes & diagnosis

When a node shows NotReady, the scheduler should stop placing new Pods there (unless tolerations override). Existing Pods may keep running, but the control plane cannot rely on kubelet heartbeats.

Common causes

Diagnosis flow

  1. From the control plane: kubectl get nodes and kubectl describe node <name>. Read Conditions (Ready, MemoryPressure, …) and Events at the bottom.
  2. On the node (SSH or console): systemctl status kubelet and systemctl status containerd (or crio if using CRI-O).
  3. Recent kubelet logs:
    journalctl -u kubelet --since "10 min ago" --no-pager

Sample: NotReady in kubectl describe node

Conditions:
  Type                 Status  LastHeartbeatTime                 Reason                       Message
  ----                 ------  -----------------                 ------                       -------
  Ready                False   Mon, 05 Apr 2026 10:22:11 +0000   KubeletNotReady              container runtime is down...
  MemoryPressure       False   Mon, 05 Apr 2026 10:21:50 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory
  DiskPressure         False   Mon, 05 Apr 2026 10:21:50 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  ...
Events:
  Warning  NodeNotReady  kubelet  Node is not ready

Fix steps (match cause to command)

CauseRemediation (examples)
kubelet failed / wedged
sudo systemctl restart kubelet
sudo systemctl status kubelet
containerd (or CRI) not running
sudo systemctl restart containerd
sudo systemctl status containerd
Certificates expired (kubeadm clusters)
sudo kubeadm certs check-expiration
sudo kubeadm certs renew all
# Then restart control plane components and kubelet per your install docs
Network to API server Verify /etc/kubernetes/kubelet.conf server URL, DNS, firewall, and routes; from node: curl -k https://<apiserver>:6443/version (with correct certs if needed).
After cert renewal, you may need to copy updated kubeconfig files to nodes or restart static pods — follow your distribution’s kubeadm documentation.

Kubelet failures: config, logs, bootstrap

Service will not start

sudo systemctl status kubelet -l
sudo journalctl -u kubelet -b --no-pager | tail -80

Common log error phrases

Message patternWhat to check
unable to load bootstrap kubeconfig TLS bootstrap: path to bootstrap kubeconfig, token, and API server reachability; file permissions.
failed to run kubelet Often invalid flags, bad config.yaml, or cgroup/runtime mismatch — read the next lines of the log for the underlying error.
node ... not found Node object missing in API (name mismatch, cluster reset, or RBAC/bootstrap timing). Confirm --hostname-override / cloud provider node name alignment.

TLS bootstrap issues

During bootstrap, kubelet uses a bootstrap kubeconfig and token to request a signed client cert. Failures usually mean wrong token, wrong API address, clock skew, or CSR approval not happening.

# Follow logs during bootstrap
sudo journalctl -u kubelet -f

Flags & configuration troubleshooting

kubelet --version
sudo journalctl -u kubelet -f
Prefer fixing config.yaml and systemd drop-ins over ad-hoc CLI flags; kubeadm and most installers manage flags via config.

Container runtime (containerd / CRI-O)

kubelet speaks CRI to the runtime. If the runtime socket is wrong or the daemon is down, Pods cannot start and the node may go NotReady.

containerd not running

sudo systemctl status containerd
sudo systemctl restart containerd
sudo crictl info

crictl info should show CRI version and a healthy response when the socket is correct.

Image pull at the runtime layer

sudo crictl pull docker.io/library/nginx:alpine
sudo crictl images
Image pull errors in kubectl describe pod often mirror what you will see when pulling the same reference with crictl pull on the node.

Socket path (containerd)

Default CRI socket is often /run/containerd/containerd.sock. kubelet must be configured to use the same socket your runtime exposes (see kubelet or cri plugin config).

ls -l /run/containerd/containerd.sock

CRI-O (alternative commands)

sudo systemctl status crio
sudo crictl --runtime-endpoint unix:///var/run/crio/crio.sock version

Socket path may be /var/run/crio/crio.sock depending on version and OS packaging.

Debugging with crictl

sudo crictl ps -a
sudo crictl pods
sudo crictl logs <container-id>
sudo crictl inspect <container-id>

Use crictl inspect for JSON detail (mounts, labels, sandbox) when kubelet or the CNI reports sandbox errors.

Resource pressure: disk, memory, PID, eviction

kubelet monitors node resources and sets pressure conditions. When thresholds are crossed, it can evict Pods to protect the node.

DiskPressure

MemoryPressure

PIDPressure

Eviction signals & soft/hard thresholds

Signal Soft threshold Hard threshold
memory.available Eviction after grace period if pressure persists; Pods terminated if node still starved. Immediate eviction pressure when available memory falls below hard limit.
nodefs.available / imagefs.available Soft: throttle + eventual eviction with grace period. Hard: stronger eviction / refusal to accept new Pods depending on state.
pid.available Soft: similar grace-period behavior for PID scarcity. Hard: more aggressive reaction to low PIDs.

Exact percentages and behavior depend on kubelet version and your KubeletConfiguration.

Configure eviction in kubelet config

Edit /var/lib/kubelet/config.yaml (or the file referenced by your kubelet service) under evictionHard, evictionSoft, evictionSoftGracePeriod, and evictionMinimumReclaim. Example pattern:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "200Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "500Mi"
evictionSoftGracePeriod:
  memory.available: "1m30s"
Test threshold changes in non-production first. Overly aggressive values cause unnecessary evictions; overly loose values risk node instability.

Node join failures (kubeadm join)

Worker nodes run kubeadm join to obtain cluster credentials and start kubelet against the control plane. Failures usually fall into token, TLS trust, network, or “already joined” categories.

Common failure modes

Diagnosis

# On control plane: list tokens (requires appropriate access)
sudo kubeadm token list

# From the joining node: test API reachability
nc -zv <control-plane-host> 6443
# or
curl -vk https://<control-plane-host>:6443/version

If TCP fails, fix networking before re-running join. If TLS fails, verify CA hash and server certificate.

Reset and rejoin

On a failed worker, clean local kubeadm state and join again with a fresh token:

sudo kubeadm reset -f
sudo systemctl restart kubelet

# On control plane: create a new token (example)
sudo kubeadm token create --print-join-command

Copy the printed kubeadm join ... line and run it on the worker.

Certificate issues during join

Pre-flight check failures (examples)

SymptomDirection
Port 10250 / 10248 / swap / br_netfilter kubeadm preflights enforce host prerequisites — enable required kernel modules, disable swap (or configure kubelet allow), open firewall ports per docs.
Container runtime not running Start containerd or crio and confirm CRI socket before join.
Hostname / MAC conflicts Ensure unique node name and stable network identity for the cloud provider or kubeadm.
Keep a short runbook: token listnc -zvjournalctl -u kubeletkubeadm reset + new join command.