Files
homelab-docs/infrastructure/PROXMOX-RECOVERY-GUIDE.md

15 KiB

Proxmox Recovery Guide

Detailed procedures for recovering Proxmox VE installations, VMs, and containers from various failure scenarios.

Table of Contents

Overview

This guide covers recovery procedures for Proxmox VE environments, specifically:

  • Proxmox node failures (hardware issues, corruption, etc.)
  • VM/Container restoration
  • Cluster recovery
  • Configuration restoration

Backup Strategy

What to Backup

1. Proxmox Configuration

# Backup Proxmox configs
tar -czf /backup/proxmox-etc-$(date +%Y%m%d).tar.gz /etc/pve/

# Backup network configuration
cp /etc/network/interfaces /backup/interfaces.$(date +%Y%m%d)

# Backup storage configuration
pvesm status > /backup/storage-status.$(date +%Y%m%d).txt

2. VM/Container Backups

# Backup all VMs/containers
vzdump --all --mode snapshot --compress zstd --storage [backup-storage]

# Backup specific VM
vzdump VMID --mode snapshot --compress zstd --storage [backup-storage]

# Backup to network location
vzdump VMID --dumpdir /mnt/backup --mode snapshot --compress zstd

3. Boot Configuration

# Backup boot loader
dd if=/dev/sda of=/backup/mbr-backup.img bs=512 count=1

# Backup partition table
sfdisk -d /dev/sda > /backup/partition-table.$(date +%Y%m%d).txt

Automated Backup Script

Create /usr/local/bin/backup-proxmox.sh:

#!/bin/bash
# Automated Proxmox backup script

BACKUP_DIR="/mnt/backup/proxmox"
DATE=$(date +%Y%m%d-%H%M%S)
RETENTION_DAYS=30

# Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"

# Backup Proxmox configuration
tar -czf "$BACKUP_DIR/$DATE/pve-config.tar.gz" /etc/pve/ 2>/dev/null

# Backup network config
cp /etc/network/interfaces "$BACKUP_DIR/$DATE/interfaces"

# Backup storage config
pvesm status > "$BACKUP_DIR/$DATE/storage-status.txt"

# Backup firewall rules
iptables-save > "$BACKUP_DIR/$DATE/iptables-rules"

# List all VMs and containers
qm list > "$BACKUP_DIR/$DATE/vm-list.txt"
pct list > "$BACKUP_DIR/$DATE/ct-list.txt"

# Backup VM configs (excluding disks)
for vm in $(qm list | awk '{if(NR>1) print $1}'); do
    qm config $vm > "$BACKUP_DIR/$DATE/vm-$vm-config.txt"
done

# Backup container configs
for ct in $(pct list | awk '{if(NR>1) print $1}'); do
    pct config $ct > "$BACKUP_DIR/$DATE/ct-$ct-config.txt"
done

# Remove old backups
find "$BACKUP_DIR" -type d -mtime +$RETENTION_DAYS -exec rm -rf {} +

echo "Backup completed: $BACKUP_DIR/$DATE"

Set up cron job:

# Daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-proxmox.sh

Recovery Scenarios

Scenario 1: Single VM/Container Recovery

Symptoms:

  • VM won't start
  • VM corrupted
  • Accidental deletion

Recovery Procedure:

1. From Proxmox Backup

# List available backups
ls -lh /var/lib/vz/dump/

# Restore VM from backup
qmrestore /var/lib/vz/dump/vzdump-qemu-VMID-DATE.vma.zst NEW_VMID \
    --storage local-lvm

# Restore container from backup
pct restore NEW_CTID /var/lib/vz/dump/vzdump-lxc-CTID-DATE.tar.zst \
    --storage local-lvm

# Start restored VM/container
qm start NEW_VMID
pct start NEW_CTID

2. From External Backup Location

# Mount backup location if needed
mount /dev/sdX1 /mnt/backup

# Or mount network share
mount -t nfs backup-server:/backups /mnt/backup

# Restore from external location
qmrestore /mnt/backup/vzdump-qemu-VMID.vma.zst NEW_VMID \
    --storage local-lvm

3. Restore to Different Storage

# List available storage
pvesm status

# Restore to specific storage
qmrestore /path/to/backup.vma.zst NEW_VMID --storage [storage-name]

Scenario 2: Proxmox Node Complete Failure

Symptoms:

  • Hardware failure (motherboard, CPU, RAM)
  • Disk controller failure
  • Proxmox installation corrupted

Recovery Options:

Option A: Reinstall Proxmox and Restore VMs

1. Reinstall Proxmox VE

# Boot from Proxmox ISO
# Follow installation wizard
# Configure same network settings as before
# Configure same hostname

# After installation, update system
apt update && apt full-upgrade

2. Restore Network Configuration

# Copy backed up network config
scp backup-server:/backup/interfaces /etc/network/interfaces

# Restart networking
systemctl restart networking

3. Configure Storage

# Recreate storage configurations
# Web UI: Datacenter → Storage → Add

# Or via command line
pvesm add dir backup --path /mnt/backup --content backup
pvesm add nfs shared-storage --server NFS_IP --export /export/path --content images,backup

4. Restore VMs/Containers

# Copy backups if needed
scp -r backup-server:/backups/* /var/lib/vz/dump/

# Restore each VM
for backup in /var/lib/vz/dump/vzdump-qemu-*.vma.zst; do
    VMID=$(basename $backup | grep -oP '\d+')
    echo "Restoring VM $VMID..."
    qmrestore $backup $VMID --storage local-lvm
done

# Restore each container
for backup in /var/lib/vz/dump/vzdump-lxc-*.tar.zst; do
    CTID=$(basename $backup | grep -oP '\d+')
    echo "Restoring CT $CTID..."
    pct restore $CTID $backup --storage local-lvm
done

Option B: Disk Recovery (If disks are intact)

1. Boot from Proxmox Live ISO

# Don't install - boot to rescue mode

2. Mount Proxmox System Disk

# Identify system disk
lsblk

# Mount root filesystem
mkdir /mnt/pve-root
mount /dev/sdX3 /mnt/pve-root  # Adjust partition number

# Mount boot partition
mount /dev/sdX2 /mnt/pve-root/boot/efi

3. Chroot into System

# Mount proc, sys, dev
mount -t proc proc /mnt/pve-root/proc
mount -t sysfs sys /mnt/pve-root/sys
mount -o bind /dev /mnt/pve-root/dev
mount -t devpts devpts /mnt/pve-root/dev/pts

# Chroot
chroot /mnt/pve-root

# Try to repair
proxmox-boot-tool refresh
update-grub
update-initramfs -u

# Exit chroot
exit

# Unmount and reboot
umount -R /mnt/pve-root
reboot

Scenario 3: ZFS Pool Recovery

Symptoms:

  • ZFS pool degraded
  • Missing or failed disk in ZFS mirror/RAID

Recovery Procedure:

1. Check Pool Status

# Check ZFS pool health
zpool status

# Example output showing degraded pool:
# pool: rpool
#  state: DEGRADED
# scan: scrub in progress since...

2. Replace Failed Disk in ZFS Mirror

# Identify failed disk
zpool status rpool

# Replace disk (assuming /dev/sdb failed, replacing with /dev/sdc)
zpool replace rpool /dev/sdb /dev/sdc

# Monitor resilvering progress
watch zpool status rpool

3. Import Pool from Backup Disks

# If pool is not automatically imported
zpool import

# Import specific pool
zpool import rpool

# Force import if needed (use cautiously)
zpool import -f rpool

4. Scrub Pool After Recovery

# Start scrub to verify data integrity
zpool scrub rpool

# Monitor scrub progress
zpool status

Scenario 4: LVM Recovery

Symptoms:

  • LVM volume group issues
  • Corrupted LVM metadata
  • Missing physical volumes

Recovery Procedure:

1. Scan for Volume Groups

# Scan for all volume groups
vgscan

# Activate all volume groups
vgchange -ay

2. Restore LVM Metadata

# LVM automatically backs up metadata to /etc/lvm/archive/

# List available metadata backups
ls -lh /etc/lvm/archive/

# Restore from backup
vgcfgrestore pve -f /etc/lvm/archive/pve_XXXXX.vg

# Activate volume group
vgchange -ay pve

3. Recover from Failed Disk

# Remove failed physical volume from volume group
vgreduce pve /dev/sdX

# Add new physical volume
pvcreate /dev/sdY
vgextend pve /dev/sdY

# Move data from old to new disk (if old disk still readable)
pvmove /dev/sdX /dev/sdY
vgreduce pve /dev/sdX

Scenario 5: Cluster Node Recovery

Symptoms:

  • Node removed from cluster
  • Cluster quorum lost
  • Split-brain scenario

Recovery Procedure:

1. Check Cluster Status

# Check cluster status
pvecm status

# Check quorum
pvecm nodes

2. Restore Single Node from Cluster

# If node was removed from cluster and you want to use it standalone

# Stop cluster services
systemctl stop pve-cluster
systemctl stop corosync

# Start in local mode
pmxcfs -l

# Remove cluster configuration
rm /etc/pve/corosync.conf
rm -rf /etc/corosync/*

# Restart services
killall pmxcfs
systemctl start pve-cluster

3. Rejoin Node to Cluster

# On the node to be rejoined
pvecm add CLUSTER_NODE_IP

# Enter cluster network information when prompted
# Node will rejoin cluster and sync configuration

4. Recover Lost Quorum (Emergency Only)

# If majority of cluster nodes are down and you need to continue
# WARNING: This can cause split-brain if other nodes come back

# Set expected votes to current online nodes
pvecm expected 1

# This allows single node to have quorum temporarily

Scenario 6: Configuration Recovery Without Backups

If /etc/pve/ is lost but VMs/containers intact:

1. Identify Existing VMs/Containers

# List LVM volumes
lvs

# List ZFS datasets
zfs list -t all

# VM disks typically in:
# LVM: pve/vm-XXX-disk-Y
# ZFS: rpool/data/vm-XXX-disk-Y

2. Recreate VM Configuration

# Create new VM with same VMID
qm create VMID --name "recovered-vm" --memory 4096 --cores 2

# Attach existing disk (LVM example)
qm set VMID --scsi0 local-lvm:vm-VMID-disk-0

# For ZFS
qm set VMID --scsi0 local-zfs:vm-VMID-disk-0

# Set other options as needed
qm set VMID --net0 virtio,bridge=vmbr0
qm set VMID --boot c --bootdisk scsi0

# Try to start VM
qm start VMID

3. Recreate Container Configuration

# Containers are stored in /var/lib/vz/ or ZFS dataset
# Check for rootfs

# Create container pointing to existing rootfs
pct create CTID /var/lib/vz/template/cache/[template].tar.gz \
    --rootfs local-lvm:vm-CTID-disk-0 \
    --hostname recovered-ct \
    --memory 2048

# Start container
pct start CTID

Tools and Commands

Essential Proxmox Commands

VM Management:

# List all VMs
qm list

# Show VM config
qm config VMID

# Start/stop VM
qm start VMID
qm stop VMID
qm shutdown VMID

# Clone VM
qm clone VMID NEW_VMID --name new-vm-name

# Migrate VM (in cluster)
qm migrate VMID TARGET_NODE

Container Management:

# List all containers
pct list

# Show container config
pct config CTID

# Start/stop container
pct start CTID
pct stop CTID
pct shutdown CTID

# Enter container
pct enter CTID

Storage Management:

# List storage
pvesm status

# Add storage
pvesm add [type] [storage-id] [options]

# Scan for storage
pvesm scan [type]

Backup/Restore:

# Create backup
vzdump VMID --mode snapshot --compress zstd

# Restore backup
qmrestore /path/to/backup.vma.zst NEW_VMID

# List backups
pvesh get /nodes/NODE/storage/STORAGE/content --content backup

Diagnostic Commands

# Check Proxmox version
pveversion -v

# Check system resources
pvesh get /nodes/NODE/status

# Check running processes
pvesh get /nodes/NODE/tasks

# Check logs
journalctl -u pve-cluster
journalctl -u pvedaemon
journalctl -u pveproxy

# Check disk health
smartctl -a /dev/sdX

# Check network
ip addr
ip route

Recovery Tools

SystemRescue CD:

  • Boot from SystemRescue ISO
  • Access to ZFS, LVM, and filesystem tools
  • Can mount and repair Proxmox installations

Proxmox Live ISO:

  • Boot without installing
  • Can mount existing installations
  • Repair bootloader and configurations

TestDisk/PhotoRec:

  • Recover deleted files
  • Repair partition tables

Preventive Measures

Regular Maintenance

1. Daily Checks

# Check cluster/node status
pvecm status

# Check VM/CT status
qm list
pct list

# Check storage health
pvesm status

2. Weekly Tasks

# Update Proxmox
apt update && apt dist-upgrade

# Check for failed systemd services
systemctl --failed

# Review logs
journalctl -p err -b

3. Monthly Tasks

# Test backup restore
qmrestore [backup] 999 --storage local-lvm
qm start 999
# Verify VM boots correctly
qm stop 999
qm destroy 999

# Check disk health
for disk in /dev/sd?; do smartctl -H $disk; done

# Check ZFS scrub
zpool scrub rpool

Backup Best Practices

1. 3-2-1 Backup Strategy

  • 3 copies of data
  • 2 different media types
  • 1 off-site copy

2. Automated Backups

  • Schedule regular VM/CT backups
  • Backup Proxmox configuration
  • Test restore procedures regularly

3. Documentation

  • Keep network diagrams updated
  • Document IP allocations
  • Maintain runbooks for common tasks
  • Store documentation off-site

Monitoring Setup

1. Setup Email Alerts

# Configure postfix for email
apt install postfix

# Test email
echo "Test" | mail -s "Proxmox Alert Test" your@email.com

2. Monitor Resources

  • Set up monitoring for CPU, RAM, disk usage
  • Alert on high resource consumption
  • Monitor backup job success/failure

3. Health Checks

# Create health check script
cat > /usr/local/bin/health-check.sh << 'EOF'
#!/bin/bash
# Proxmox Health Check

# Check cluster status
if ! pvecm status &>/dev/null; then
    echo "WARNING: Cluster status check failed"
fi

# Check storage
pvesm status | grep -v active && echo "WARNING: Storage issue detected"

# Check for failed VMs
qm list | grep stopped && echo "INFO: Stopped VMs detected"

# Check system load
LOAD=$(cat /proc/loadavg | awk '{print $1}')
if (( $(echo "$LOAD > 8" | bc -l) )); then
    echo "WARNING: High system load: $LOAD"
fi

# Check disk space
df -h | awk '$5 ~ /^9[0-9]%/ || $5 ~ /^100%/ {print "WARNING: Disk space low on " $6 ": " $5}'
EOF

chmod +x /usr/local/bin/health-check.sh

# Add to crontab
echo "*/15 * * * * /usr/local/bin/health-check.sh | mail -s 'Proxmox Health Alert' your@email.com" | crontab -

Emergency Contacts

Proxmox Resources

Hardware Support

  • Document hardware vendor support contacts
  • Keep warranty information accessible
  • Maintain spare parts inventory

Recovery Time Objectives

Scenario Target Recovery Time Notes
Single VM restore 30 minutes From local backup
Complete node rebuild 4-8 hours Including OS reinstall
ZFS pool recovery 1-6 hours Depends on resilvering time
Cluster rejoin 1-2 hours Network reconfiguration
Full disaster recovery 24-48 hours From off-site backups

Recent Recovery Events

Event Log Template

Date: YYYY-MM-DD Affected System: [Proxmox node/VM/CT] Issue: [Description] Resolution: [Steps taken] Downtime: [Duration] Lessons Learned: [Improvements for next time]


Last Updated: 2025-12-13 Version: 1.0