Kubelet & Node Troubleshooting — Interactive Guide

Overview: node architecture & API communication

A Kubernetes node runs user workloads. The control plane schedules Pods onto nodes; on each node, several daemons keep that node healthy and able to run containers.

kubelet Registers the node with the API server, watches Pod specs assigned to this node, talks to the container runtime via CRI, and reports node/Pod status and metrics.

Container runtime (containerd) Pulls images, creates sandboxes and containers, and manages low-level lifecycle. kubelet does not talk to Docker directly on modern clusters — it uses the CRI (often containerd or CRI-O).

kube-proxy Maintains network rules (iptables/IPVS or similar) so Services reach the right Pods on this node. Not the same as kubelet, but part of the “node agent” picture for networking.

How kubelet talks to the API server

kubelet uses client credentials (for example /etc/kubernetes/kubelet.conf) to authenticate to the API server (typically port 6443 on the control plane).
It establishes a watch/list relationship for Pods bound to this node and reports status updates (node conditions, Pod phase, container state).
If connectivity or certificates break, the node may flip to NotReady while workloads already running may behave inconsistently with the control plane’s view.

Node conditions (common)

Condition	True means	Typical checks
Ready	Node is healthy enough to accept new Pods (kubelet is up, runtime OK, networking OK).	`kubectl describe node` → Conditions; kubelet & runtime service status on the node.
MemoryPressure	Node is low on memory; kubelet may evict Pods.	`free -m`, cgroup/memory, kubelet eviction logs.
DiskPressure	Disk space or inodes are tight (often image/container layers or logs).	`df -h`, `df -i`, image cleanup with `crictl`.
PIDPressure	Too many processes / PIDs; kubelet may throttle or evict.	Process count, `sysctl` kernel.pid_max, workload churn.
NetworkUnavailable	Network plugin has not configured the node network (varies by CNI).	CNI pods, CNI logs, node routes and interfaces.

Quick checks (control plane)

kubectl get nodes
kubectl get nodes -o wide
kubectl describe node <node-name>

Start with get nodes for Ready/NotReady, then describe node for Conditions, capacity, taints, and recent events.

Node NotReady: causes & diagnosis

When a node shows NotReady, the scheduler should stop placing new Pods there (unless tolerations override). Existing Pods may keep running, but the control plane cannot rely on kubelet heartbeats.

Common causes

kubelet stopped or crashing — no heartbeats to the API server.
Container runtime down — kubelet cannot run containers; node may report runtime errors.
Network partition — kubelet cannot reach the API server (firewall, wrong endpoint, broken route).
Expired or invalid certificates — TLS handshake or auth failures in kubelet logs.

Diagnosis flow

From the control plane: kubectl get nodes and kubectl describe node <name>. Read Conditions (Ready, MemoryPressure, …) and Events at the bottom.
On the node (SSH or console): systemctl status kubelet and systemctl status containerd (or crio if using CRI-O).

Recent kubelet logs:

journalctl -u kubelet --since "10 min ago" --no-pager

Sample: NotReady in `kubectl describe node`

Conditions:
  Type                 Status  LastHeartbeatTime                 Reason                       Message
  ----                 ------  -----------------                 ------                       -------
  Ready                False   Mon, 05 Apr 2026 10:22:11 +0000   KubeletNotReady              container runtime is down...
  MemoryPressure       False   Mon, 05 Apr 2026 10:21:50 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory
  DiskPressure         False   Mon, 05 Apr 2026 10:21:50 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  ...
Events:
  Warning  NodeNotReady  kubelet  Node is not ready

Fix steps (match cause to command)

Cause	Remediation (examples)
kubelet failed / wedged	`sudo systemctl restart kubelet sudo systemctl status kubelet`
containerd (or CRI) not running	`sudo systemctl restart containerd sudo systemctl status containerd`
Certificates expired (kubeadm clusters)	`sudo kubeadm certs check-expiration sudo kubeadm certs renew all # Then restart control plane components and kubelet per your install docs`
Network to API server	Verify `/etc/kubernetes/kubelet.conf` server URL, DNS, firewall, and routes; from node: `curl -k https://<apiserver>:6443/version` (with correct certs if needed).

After cert renewal, you may need to copy updated kubeconfig files to nodes or restart static pods — follow your distribution’s kubeadm documentation.

Kubelet failures: config, logs, bootstrap

Service will not start

Verify kubelet static config: /var/lib/kubelet/config.yaml (or path set by your unit file).
Verify API client kubeconfig: /etc/kubernetes/kubelet.conf (server URL, credentials, CA).
Check for typos in cgroup driver matching the runtime (often systemd).

sudo systemctl status kubelet -l
sudo journalctl -u kubelet -b --no-pager | tail -80

Common log error phrases

Message pattern	What to check
`unable to load bootstrap kubeconfig`	TLS bootstrap: path to bootstrap kubeconfig, token, and API server reachability; file permissions.
`failed to run kubelet`	Often invalid flags, bad config.yaml, or cgroup/runtime mismatch — read the next lines of the log for the underlying error.
`node ... not found`	Node object missing in API (name mismatch, cluster reset, or RBAC/bootstrap timing). Confirm `--hostname-override` / cloud provider node name alignment.

TLS bootstrap issues

During bootstrap, kubelet uses a bootstrap kubeconfig and token to request a signed client cert. Failures usually mean wrong token, wrong API address, clock skew, or CSR approval not happening.

# Follow logs during bootstrap
sudo journalctl -u kubelet -f

Flags & configuration troubleshooting

Inspect the systemd unit: systemctl cat kubelet for EnvironmentFile and dropped-in overrides.
Confirm kubelet --version matches your cluster’s supported skew policy.
Some installs also write /var/log/kubelet.log — check if your distro redirects journal only.

kubelet --version
sudo journalctl -u kubelet -f

Prefer fixing config.yaml and systemd drop-ins over ad-hoc CLI flags; kubeadm and most installers manage flags via config.

Container runtime (containerd / CRI-O)

kubelet speaks CRI to the runtime. If the runtime socket is wrong or the daemon is down, Pods cannot start and the node may go NotReady.

containerd not running

sudo systemctl status containerd
sudo systemctl restart containerd
sudo crictl info

crictl info should show CRI version and a healthy response when the socket is correct.

Image pull at the runtime layer

sudo crictl pull docker.io/library/nginx:alpine
sudo crictl images

Image pull errors in kubectl describe pod often mirror what you will see when pulling the same reference with crictl pull on the node.

Socket path (containerd)

Default CRI socket is often /run/containerd/containerd.sock. kubelet must be configured to use the same socket your runtime exposes (see kubelet or cri plugin config).

ls -l /run/containerd/containerd.sock

CRI-O (alternative commands)

sudo systemctl status crio
sudo crictl --runtime-endpoint unix:///var/run/crio/crio.sock version

Socket path may be /var/run/crio/crio.sock depending on version and OS packaging.

Debugging with crictl

sudo crictl ps -a
sudo crictl pods
sudo crictl logs <container-id>
sudo crictl inspect <container-id>

Use crictl inspect for JSON detail (mounts, labels, sandbox) when kubelet or the CNI reports sandbox errors.

Resource pressure: disk, memory, PID, eviction

kubelet monitors node resources and sets pressure conditions. When thresholds are crossed, it can evict Pods to protect the node.

DiskPressure

kubelet compares filesystem usage against eviction thresholds (defaults exist; you can tune them).
Check free space and inodes: df -h and df -i.
Remove unused images at the runtime layer:
```
sudo crictl rmi --prune
```
Also audit large container logs and emptyDir volumes on the node.

MemoryPressure

Check memory: free -m and host / cgroup metrics.
Eviction order (QoS classes): kubelet generally evicts BestEffort Pods first, then Burstable (by usage above requests), and is least likely to evict Guaranteed Pods that stay within requests.

PIDPressure

Rapid fork/exec churn (many short-lived containers, runaway processes) can exhaust PIDs.
Rough process count: ps aux | wc -l (interpretation varies by OS).
Review kernel.pid_max and workload patterns; reduce noisy DaemonSets or buggy loops.

Eviction signals & soft/hard thresholds

Signal	Soft threshold	Hard threshold
memory.available	Eviction after grace period if pressure persists; Pods terminated if node still starved.	Immediate eviction pressure when available memory falls below hard limit.
nodefs.available / imagefs.available	Soft: throttle + eventual eviction with grace period.	Hard: stronger eviction / refusal to accept new Pods depending on state.
pid.available	Soft: similar grace-period behavior for PID scarcity.	Hard: more aggressive reaction to low PIDs.

Exact percentages and behavior depend on kubelet version and your KubeletConfiguration.

Configure eviction in kubelet config

Edit /var/lib/kubelet/config.yaml (or the file referenced by your kubelet service) under evictionHard, evictionSoft, evictionSoftGracePeriod, and evictionMinimumReclaim. Example pattern:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "200Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "500Mi"
evictionSoftGracePeriod:
  memory.available: "1m30s"

Test threshold changes in non-production first. Overly aggressive values cause unnecessary evictions; overly loose values risk node instability.

Node join failures (`kubeadm join`)

Worker nodes run kubeadm join to obtain cluster credentials and start kubelet against the control plane. Failures usually fall into token, TLS trust, network, or “already joined” categories.

Common failure modes

Join token expired — bootstrap tokens are time-limited.
Wrong CA certificate hash — discovery-token-ca-cert-hash does not match the cluster CA.
Port 6443 unreachable — firewall, wrong API advertise address, or NAT/routing issues.
kubelet already configured — leftover /etc/kubernetes state from a previous join attempt.

Diagnosis

# On control plane: list tokens (requires appropriate access)
sudo kubeadm token list

# From the joining node: test API reachability
nc -zv <control-plane-host> 6443
# or
curl -vk https://<control-plane-host>:6443/version

If TCP fails, fix networking before re-running join. If TLS fails, verify CA hash and server certificate.

Reset and rejoin

On a failed worker, clean local kubeadm state and join again with a fresh token:

sudo kubeadm reset -f
sudo systemctl restart kubelet

# On control plane: create a new token (example)
sudo kubeadm token create --print-join-command

Copy the printed kubeadm join ... line and run it on the worker.

Certificate issues during join

Ensure system time is synchronized (NTP) on both control plane and worker.
Confirm you use the current CA hash from the join command output, not an old screenshot.
For manual discovery, verify ca.crt you trust matches the cluster.

Pre-flight check failures (examples)

Symptom	Direction
Port 10250 / 10248 / swap / br_netfilter	kubeadm preflights enforce host prerequisites — enable required kernel modules, disable swap (or configure kubelet allow), open firewall ports per docs.
Container runtime not running	Start `containerd` or `crio` and confirm CRI socket before join.
Hostname / MAC conflicts	Ensure unique node name and stable network identity for the cloud provider or kubeadm.

Keep a short runbook: token list → nc -zv → journalctl -u kubelet → kubeadm reset + new join command.

Overview: node architecture & API communication

How kubelet talks to the API server

Node conditions (common)

Quick checks (control plane)

Node NotReady: causes & diagnosis

Common causes

Diagnosis flow

Sample: NotReady in kubectl describe node

Fix steps (match cause to command)

Kubelet failures: config, logs, bootstrap

Service will not start

Common log error phrases

TLS bootstrap issues

Flags & configuration troubleshooting

Container runtime (containerd / CRI-O)

containerd not running

Image pull at the runtime layer

Socket path (containerd)

CRI-O (alternative commands)

Debugging with crictl

Resource pressure: disk, memory, PID, eviction

DiskPressure

MemoryPressure

PIDPressure

Eviction signals & soft/hard thresholds

Configure eviction in kubelet config

Node join failures (kubeadm join)

Common failure modes

Diagnosis

Reset and rejoin

Certificate issues during join

Pre-flight check failures (examples)

Sample: NotReady in `kubectl describe node`

Node join failures (`kubeadm join`)