Kubernetes etcd Backup & Restore Guide

etcd's Role in Kubernetes Cluster

Complete Backup Process Flow

Complete Restore Process Flow

Before Disaster vs After Recovery

✅ Before Disaster (Healthy Cluster)

📊

10 Deployments running

🔧

50 Pods across 3 nodes

🌐

15 Services exposed

📦

8 ConfigMaps, 5 Secrets

🔐

3 ServiceAccounts with RBAC

💾

6 PersistentVolumeClaims

📋

Custom namespace configurations

✅

Backup taken: 2026-03-20 02:00

🔄 After Disaster & Recovery

📊

10 Deployments restored

🔧

50 Pods being recreated

🌐

15 Services restored

📦

8 ConfigMaps, 5 Secrets intact

🔐

3 ServiceAccounts with RBAC restored

💾

6 PVCs restored (data preserved)

📋

Namespace configs recovered

✅

Cluster state from backup

What Gets Restored?

✅ Included in etcd Backup

All Pods, Deployments, Services
ConfigMaps and Secrets
RBAC roles and bindings
Namespaces and resource quotas
PersistentVolumeClaims metadata
Ingress and NetworkPolicy rules
Custom Resource Definitions (CRDs)

❌ NOT Included in Backup

Container runtime state
Actual data in Persistent Volumes
Downloaded container images
Pod logs and metrics
Node-local data (kubelet state)
CNI plugin configurations
External load balancer IPs

⏰ Recovery Time Considerations

etcd restore: 1-5 minutes (depends on backup size)
Control plane restart: 2-3 minutes
Pod recreation: 5-15 minutes (depends on image availability)
Service stabilization: 2-5 minutes
Total RTO (Recovery Time Objective): 10-30 minutes
RPO (Recovery Point Objective): Last backup time (e.g., up to 24 hours for daily backups)

⚠️ Post-Recovery Actions Required

Verify all Pods are running and healthy
Check Service endpoints and connectivity
Test application functionality end-to-end
Verify PVC bindings to Persistent Volumes
Review and restore any data created after last backup
Update DNS records if needed (LoadBalancer IPs may change)
Notify stakeholders of recovery completion
Document incident and recovery process

Backup Process

1

Find etcd Endpoint

Extract the etcd client URL from the pod manifest

# Extract etcd endpoint
export ETCD_ENDPOINTS=$(grep -oP \
  '(?<=--advertise-client-urls=)\S+' \
  /etc/kubernetes/manifests/etcd.yaml)

# Verify
echo $ETCD_ENDPOINTS
# Output: https://172.31.28.251:2379

2

Create Snapshot

Use etcdctl to save a snapshot of the cluster state

sudo ETCDCTL_API=3 /usr/bin/etcdctl \
  --endpoints="$ETCD_ENDPOINTS" \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /tmp/etcd_backup.db

3

Verify Snapshot

Confirm the backup file is valid

sudo ETCDCTL_API=3 /usr/bin/etcdctl \
  snapshot status /tmp/etcd_backup.db \
  --write-out=table

# Output:
# +----------+---------+--------+-------------+
# |   HASH   | VERSION |  SIZE  | TOTAL KEYS  |
# +----------+---------+--------+-------------+
# | 123abcd  |  3.5.0  | 5.1 MB |    3500     |
# +----------+---------+--------+-------------+

4

Store Safely

Move backup to secure off-cluster storage

# Copy to backup location
sudo cp /tmp/etcd_backup.db \
  /backup/etcd-$(date +%Y%m%d-%H%M%S).db

# Or upload to cloud storage
# aws s3 cp /tmp/etcd_backup.db \
#   s3://backups/etcd/

Restore Process

1

Stop API Server

Prevent writes during restore by moving the manifest

# Move kube-apiserver manifest
sudo mv \
  /etc/kubernetes/manifests/kube-apiserver.yaml \
  /tmp/kube-apiserver.yaml

# Wait for apiserver pod to stop
kubectl get pods -n kube-system | \
  grep apiserver

2

Restore Snapshot

Extract backup to new data directory

sudo ETCDCTL_API=3 /usr/bin/etcdctl \
  snapshot restore /tmp/etcd_backup.db \
  --data-dir=/var/lib/etcd-new \
  --name=master-restored

# Note: No certs needed for restore
# (local operation)

3

Update etcd Config

Point etcd to restored data directory

# Edit manifest:
# /etc/kubernetes/manifests/etcd.yaml

# Update these lines:
- --data-dir=/var/lib/etcd-new
- --name=master-restored
- --initial-cluster=master-restored=\
    https://172.31.28.251:2380
- --initial-cluster-state=new

# Save and exit
# etcd pod restarts automatically

4

Restart API Server

Restore kube-apiserver and verify cluster

# Move manifest back
sudo mv /tmp/kube-apiserver.yaml \
  /etc/kubernetes/manifests/kube-apiserver.yaml

# Wait for pods to come up
kubectl get pods -n kube-system

# Verify cluster state
kubectl get nodes
kubectl get pods -A

Important Notes:

• Always test restore procedures in non-production environments first

• Backup regularly: before upgrades, major changes, and on a daily schedule

• Store backups in secure, off-cluster locations with proper retention policies

• Document your cluster's specific certificate paths and etcd endpoint