⌂ Home

Kubernetes etcd Backup & Restore

Essential guide for protecting your cluster state

Critical Certificate Files Required:
etcd's Role in Kubernetes Cluster
Kubernetes Control Plane kubectl API Client kube-apiserver Control Plane Gateway Validates & Persists State API Requests etcd Distributed Key-Value Store Source of Truth Write State Read State Backup Process etcdctl snapshot save Creates .db file Backup Storage S3, NFS, Off-cluster Restore Process etcdctl snapshot restore Restores from .db file Disaster Recovery controller-manager kube-scheduler Key Points: • etcd stores all cluster state (Pods, Services, Secrets, ConfigMaps, etc.) • Only kube-apiserver communicates with etcd directly
Complete Backup Process Flow
Step 1: Find etcd Endpoint Extract from etcd.yaml manifest Step 2: Create Snapshot etcdctl snapshot save /tmp/backup.db Step 3: Verify Snapshot etcdctl snapshot status backup.db Snapshot Valid? Step 4: Copy to Storage S3, NFS, or backup server ✓ Yes Backup Failed Check errors & retry ✗ No Retry ✓ Backup Complete Cluster state preserved Best Practices • Take daily backups • Before cluster upgrades • Before major changes • Test restore procedure • Automate with CronJob • Encrypt backup files • Monitor backup size • Retention: 7-30 days Required Files 📁 etcd.yaml manifest /etc/kubernetes/manifests/ 🔐 Certificate files: • ca.crt • server.crt • server.key
Complete Restore Process Flow
Step 1: Stop kube-apiserver Move manifest to /tmp/ Step 2: Verify API Server Down Wait for pod removal Step 3: Restore Snapshot etcdctl snapshot restore backup.db Step 4: Update etcd Config Point to restored data-dir Step 5: Wait for etcd Restart Static pod auto-restarts Step 6: Restore API Server Move manifest back ✓ Verify Cluster Health kubectl get nodes, pods ⚠️ Critical Warnings • Always test in non-prod • Stop API server first! • Backup current state • Use fresh data-dir • Verify snapshot first • Expect downtime • Document all changes etcd.yaml Updates Required changes: --data-dir= /var/lib/etcd-new --name=restored --initial-cluster= restored=https://... --initial-cluster-state= new ⏱️ Time Estimate Total downtime: 5-15 minutes (depends on data size) Verify Commands kubectl get nodes kubectl get pods -A kubectl cluster-info docker ps | grep etcd docker ps | grep apiserver
Before Disaster vs After Recovery
✅ Before Disaster (Healthy Cluster)
📊
10 Deployments running
🔧
50 Pods across 3 nodes
🌐
15 Services exposed
📦
8 ConfigMaps, 5 Secrets
🔐
3 ServiceAccounts with RBAC
💾
6 PersistentVolumeClaims
📋
Custom namespace configurations
Backup taken: 2026-03-20 02:00
🔄 After Disaster & Recovery
📊
10 Deployments restored
🔧
50 Pods being recreated
🌐
15 Services restored
📦
8 ConfigMaps, 5 Secrets intact
🔐
3 ServiceAccounts with RBAC restored
💾
6 PVCs restored (data preserved)
📋
Namespace configs recovered
Cluster state from backup

What Gets Restored?

✅ Included in etcd Backup

  • All Pods, Deployments, Services
  • ConfigMaps and Secrets
  • RBAC roles and bindings
  • Namespaces and resource quotas
  • PersistentVolumeClaims metadata
  • Ingress and NetworkPolicy rules
  • Custom Resource Definitions (CRDs)

❌ NOT Included in Backup

  • Container runtime state
  • Actual data in Persistent Volumes
  • Downloaded container images
  • Pod logs and metrics
  • Node-local data (kubelet state)
  • CNI plugin configurations
  • External load balancer IPs

⏰ Recovery Time Considerations

  • etcd restore: 1-5 minutes (depends on backup size)
  • Control plane restart: 2-3 minutes
  • Pod recreation: 5-15 minutes (depends on image availability)
  • Service stabilization: 2-5 minutes
  • Total RTO (Recovery Time Objective): 10-30 minutes
  • RPO (Recovery Point Objective): Last backup time (e.g., up to 24 hours for daily backups)

⚠️ Post-Recovery Actions Required

  • Verify all Pods are running and healthy
  • Check Service endpoints and connectivity
  • Test application functionality end-to-end
  • Verify PVC bindings to Persistent Volumes
  • Review and restore any data created after last backup
  • Update DNS records if needed (LoadBalancer IPs may change)
  • Notify stakeholders of recovery completion
  • Document incident and recovery process
Backup Process
1
Find etcd Endpoint
Extract the etcd client URL from the pod manifest
# Extract etcd endpoint
export ETCD_ENDPOINTS=$(grep -oP \
  '(?<=--advertise-client-urls=)\S+' \
  /etc/kubernetes/manifests/etcd.yaml)

# Verify
echo $ETCD_ENDPOINTS
# Output: https://172.31.28.251:2379
2
Create Snapshot
Use etcdctl to save a snapshot of the cluster state
sudo ETCDCTL_API=3 /usr/bin/etcdctl \
  --endpoints="$ETCD_ENDPOINTS" \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save /tmp/etcd_backup.db
3
Verify Snapshot
Confirm the backup file is valid
sudo ETCDCTL_API=3 /usr/bin/etcdctl \
  snapshot status /tmp/etcd_backup.db \
  --write-out=table

# Output:
# +----------+---------+--------+-------------+
# |   HASH   | VERSION |  SIZE  | TOTAL KEYS  |
# +----------+---------+--------+-------------+
# | 123abcd  |  3.5.0  | 5.1 MB |    3500     |
# +----------+---------+--------+-------------+
4
Store Safely
Move backup to secure off-cluster storage
# Copy to backup location
sudo cp /tmp/etcd_backup.db \
  /backup/etcd-$(date +%Y%m%d-%H%M%S).db

# Or upload to cloud storage
# aws s3 cp /tmp/etcd_backup.db \
#   s3://backups/etcd/
Restore Process
1
Stop API Server
Prevent writes during restore by moving the manifest
# Move kube-apiserver manifest
sudo mv \
  /etc/kubernetes/manifests/kube-apiserver.yaml \
  /tmp/kube-apiserver.yaml

# Wait for apiserver pod to stop
kubectl get pods -n kube-system | \
  grep apiserver
2
Restore Snapshot
Extract backup to new data directory
sudo ETCDCTL_API=3 /usr/bin/etcdctl \
  snapshot restore /tmp/etcd_backup.db \
  --data-dir=/var/lib/etcd-new \
  --name=master-restored

# Note: No certs needed for restore
# (local operation)
3
Update etcd Config
Point etcd to restored data directory
# Edit manifest:
# /etc/kubernetes/manifests/etcd.yaml

# Update these lines:
- --data-dir=/var/lib/etcd-new
- --name=master-restored
- --initial-cluster=master-restored=\
    https://172.31.28.251:2380
- --initial-cluster-state=new

# Save and exit
# etcd pod restarts automatically
4
Restart API Server
Restore kube-apiserver and verify cluster
# Move manifest back
sudo mv /tmp/kube-apiserver.yaml \
  /etc/kubernetes/manifests/kube-apiserver.yaml

# Wait for pods to come up
kubectl get pods -n kube-system

# Verify cluster state
kubectl get nodes
kubectl get pods -A
Important Notes:

• Always test restore procedures in non-production environments first

• Backup regularly: before upgrades, major changes, and on a daily schedule

• Store backups in secure, off-cluster locations with proper retention policies

• Document your cluster's specific certificate paths and etcd endpoint