# Proxmox Recovery Guide

Detailed procedures for recovering Proxmox VE installations, VMs, and containers from various failure scenarios.

## Table of Contents
- [Overview](#overview)
- [Backup Strategy](#backup-strategy)
- [Recovery Scenarios](#recovery-scenarios)
- [Tools and Commands](#tools-and-commands)
- [Preventive Measures](#preventive-measures)

## Overview

This guide covers recovery procedures for Proxmox VE environments, specifically:
- Proxmox node failures (hardware issues, corruption, etc.)
- VM/Container restoration
- Cluster recovery
- Configuration restoration

## Backup Strategy

### What to Backup

**1. Proxmox Configuration**
```bash
# Backup Proxmox configs
tar -czf /backup/proxmox-etc-$(date +%Y%m%d).tar.gz /etc/pve/

# Backup network configuration
cp /etc/network/interfaces /backup/interfaces.$(date +%Y%m%d)

# Backup storage configuration
pvesm status > /backup/storage-status.$(date +%Y%m%d).txt
```

**2. VM/Container Backups**
```bash
# Backup all VMs/containers
vzdump --all --mode snapshot --compress zstd --storage [backup-storage]

# Backup specific VM
vzdump VMID --mode snapshot --compress zstd --storage [backup-storage]

# Backup to network location
vzdump VMID --dumpdir /mnt/backup --mode snapshot --compress zstd
```

**3. Boot Configuration**
```bash
# Backup boot loader
dd if=/dev/sda of=/backup/mbr-backup.img bs=512 count=1

# Backup partition table
sfdisk -d /dev/sda > /backup/partition-table.$(date +%Y%m%d).txt
```

### Automated Backup Script

Create `/usr/local/bin/backup-proxmox.sh`:

```bash
#!/bin/bash
# Automated Proxmox backup script

BACKUP_DIR="/mnt/backup/proxmox"
DATE=$(date +%Y%m%d-%H%M%S)
RETENTION_DAYS=30

# Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"

# Backup Proxmox configuration
tar -czf "$BACKUP_DIR/$DATE/pve-config.tar.gz" /etc/pve/ 2>/dev/null

# Backup network config
cp /etc/network/interfaces "$BACKUP_DIR/$DATE/interfaces"

# Backup storage config
pvesm status > "$BACKUP_DIR/$DATE/storage-status.txt"

# Backup firewall rules
iptables-save > "$BACKUP_DIR/$DATE/iptables-rules"

# List all VMs and containers
qm list > "$BACKUP_DIR/$DATE/vm-list.txt"
pct list > "$BACKUP_DIR/$DATE/ct-list.txt"

# Backup VM configs (excluding disks)
for vm in $(qm list | awk '{if(NR>1) print $1}'); do
    qm config $vm > "$BACKUP_DIR/$DATE/vm-$vm-config.txt"
done

# Backup container configs
for ct in $(pct list | awk '{if(NR>1) print $1}'); do
    pct config $ct > "$BACKUP_DIR/$DATE/ct-$ct-config.txt"
done

# Remove old backups
find "$BACKUP_DIR" -type d -mtime +$RETENTION_DAYS -exec rm -rf {} +

echo "Backup completed: $BACKUP_DIR/$DATE"
```

Set up cron job:
```bash
# Daily backup at 2 AM
0 2 * * * /usr/local/bin/backup-proxmox.sh
```

## Recovery Scenarios

### Scenario 1: Single VM/Container Recovery

**Symptoms:**
- VM won't start
- VM corrupted
- Accidental deletion

**Recovery Procedure:**

**1. From Proxmox Backup**
```bash
# List available backups
ls -lh /var/lib/vz/dump/

# Restore VM from backup
qmrestore /var/lib/vz/dump/vzdump-qemu-VMID-DATE.vma.zst NEW_VMID \
    --storage local-lvm

# Restore container from backup
pct restore NEW_CTID /var/lib/vz/dump/vzdump-lxc-CTID-DATE.tar.zst \
    --storage local-lvm

# Start restored VM/container
qm start NEW_VMID
pct start NEW_CTID
```

**2. From External Backup Location**
```bash
# Mount backup location if needed
mount /dev/sdX1 /mnt/backup

# Or mount network share
mount -t nfs backup-server:/backups /mnt/backup

# Restore from external location
qmrestore /mnt/backup/vzdump-qemu-VMID.vma.zst NEW_VMID \
    --storage local-lvm
```

**3. Restore to Different Storage**
```bash
# List available storage
pvesm status

# Restore to specific storage
qmrestore /path/to/backup.vma.zst NEW_VMID --storage [storage-name]
```

### Scenario 2: Proxmox Node Complete Failure

**Symptoms:**
- Hardware failure (motherboard, CPU, RAM)
- Disk controller failure
- Proxmox installation corrupted

**Recovery Options:**

**Option A: Reinstall Proxmox and Restore VMs**

**1. Reinstall Proxmox VE**
```bash
# Boot from Proxmox ISO
# Follow installation wizard
# Configure same network settings as before
# Configure same hostname

# After installation, update system
apt update && apt full-upgrade
```

**2. Restore Network Configuration**
```bash
# Copy backed up network config
scp backup-server:/backup/interfaces /etc/network/interfaces

# Restart networking
systemctl restart networking
```

**3. Configure Storage**
```bash
# Recreate storage configurations
# Web UI: Datacenter → Storage → Add

# Or via command line
pvesm add dir backup --path /mnt/backup --content backup
pvesm add nfs shared-storage --server NFS_IP --export /export/path --content images,backup
```

**4. Restore VMs/Containers**
```bash
# Copy backups if needed
scp -r backup-server:/backups/* /var/lib/vz/dump/

# Restore each VM
for backup in /var/lib/vz/dump/vzdump-qemu-*.vma.zst; do
    VMID=$(basename $backup | grep -oP '\d+')
    echo "Restoring VM $VMID..."
    qmrestore $backup $VMID --storage local-lvm
done

# Restore each container
for backup in /var/lib/vz/dump/vzdump-lxc-*.tar.zst; do
    CTID=$(basename $backup | grep -oP '\d+')
    echo "Restoring CT $CTID..."
    pct restore $CTID $backup --storage local-lvm
done
```

**Option B: Disk Recovery (If disks are intact)**

**1. Boot from Proxmox Live ISO**
```bash
# Don't install - boot to rescue mode
```

**2. Mount Proxmox System Disk**
```bash
# Identify system disk
lsblk

# Mount root filesystem
mkdir /mnt/pve-root
mount /dev/sdX3 /mnt/pve-root  # Adjust partition number

# Mount boot partition
mount /dev/sdX2 /mnt/pve-root/boot/efi
```

**3. Chroot into System**
```bash
# Mount proc, sys, dev
mount -t proc proc /mnt/pve-root/proc
mount -t sysfs sys /mnt/pve-root/sys
mount -o bind /dev /mnt/pve-root/dev
mount -t devpts devpts /mnt/pve-root/dev/pts

# Chroot
chroot /mnt/pve-root

# Try to repair
proxmox-boot-tool refresh
update-grub
update-initramfs -u

# Exit chroot
exit

# Unmount and reboot
umount -R /mnt/pve-root
reboot
```

### Scenario 3: ZFS Pool Recovery

**Symptoms:**
- ZFS pool degraded
- Missing or failed disk in ZFS mirror/RAID

**Recovery Procedure:**

**1. Check Pool Status**
```bash
# Check ZFS pool health
zpool status

# Example output showing degraded pool:
# pool: rpool
#  state: DEGRADED
# scan: scrub in progress since...
```

**2. Replace Failed Disk in ZFS Mirror**
```bash
# Identify failed disk
zpool status rpool

# Replace disk (assuming /dev/sdb failed, replacing with /dev/sdc)
zpool replace rpool /dev/sdb /dev/sdc

# Monitor resilvering progress
watch zpool status rpool
```

**3. Import Pool from Backup Disks**
```bash
# If pool is not automatically imported
zpool import

# Import specific pool
zpool import rpool

# Force import if needed (use cautiously)
zpool import -f rpool
```

**4. Scrub Pool After Recovery**
```bash
# Start scrub to verify data integrity
zpool scrub rpool

# Monitor scrub progress
zpool status
```

### Scenario 4: LVM Recovery

**Symptoms:**
- LVM volume group issues
- Corrupted LVM metadata
- Missing physical volumes

**Recovery Procedure:**

**1. Scan for Volume Groups**
```bash
# Scan for all volume groups
vgscan

# Activate all volume groups
vgchange -ay
```

**2. Restore LVM Metadata**
```bash
# LVM automatically backs up metadata to /etc/lvm/archive/

# List available metadata backups
ls -lh /etc/lvm/archive/

# Restore from backup
vgcfgrestore pve -f /etc/lvm/archive/pve_XXXXX.vg

# Activate volume group
vgchange -ay pve
```

**3. Recover from Failed Disk**
```bash
# Remove failed physical volume from volume group
vgreduce pve /dev/sdX

# Add new physical volume
pvcreate /dev/sdY
vgextend pve /dev/sdY

# Move data from old to new disk (if old disk still readable)
pvmove /dev/sdX /dev/sdY
vgreduce pve /dev/sdX
```

### Scenario 5: Cluster Node Recovery

**Symptoms:**
- Node removed from cluster
- Cluster quorum lost
- Split-brain scenario

**Recovery Procedure:**

**1. Check Cluster Status**
```bash
# Check cluster status
pvecm status

# Check quorum
pvecm nodes
```

**2. Restore Single Node from Cluster**
```bash
# If node was removed from cluster and you want to use it standalone

# Stop cluster services
systemctl stop pve-cluster
systemctl stop corosync

# Start in local mode
pmxcfs -l

# Remove cluster configuration
rm /etc/pve/corosync.conf
rm -rf /etc/corosync/*

# Restart services
killall pmxcfs
systemctl start pve-cluster
```

**3. Rejoin Node to Cluster**
```bash
# On the node to be rejoined
pvecm add CLUSTER_NODE_IP

# Enter cluster network information when prompted
# Node will rejoin cluster and sync configuration
```

**4. Recover Lost Quorum (Emergency Only)**
```bash
# If majority of cluster nodes are down and you need to continue
# WARNING: This can cause split-brain if other nodes come back

# Set expected votes to current online nodes
pvecm expected 1

# This allows single node to have quorum temporarily
```

### Scenario 6: Configuration Recovery Without Backups

**If /etc/pve/ is lost but VMs/containers intact:**

**1. Identify Existing VMs/Containers**
```bash
# List LVM volumes
lvs

# List ZFS datasets
zfs list -t all

# VM disks typically in:
# LVM: pve/vm-XXX-disk-Y
# ZFS: rpool/data/vm-XXX-disk-Y
```

**2. Recreate VM Configuration**
```bash
# Create new VM with same VMID
qm create VMID --name "recovered-vm" --memory 4096 --cores 2

# Attach existing disk (LVM example)
qm set VMID --scsi0 local-lvm:vm-VMID-disk-0

# For ZFS
qm set VMID --scsi0 local-zfs:vm-VMID-disk-0

# Set other options as needed
qm set VMID --net0 virtio,bridge=vmbr0
qm set VMID --boot c --bootdisk scsi0

# Try to start VM
qm start VMID
```

**3. Recreate Container Configuration**
```bash
# Containers are stored in /var/lib/vz/ or ZFS dataset
# Check for rootfs

# Create container pointing to existing rootfs
pct create CTID /var/lib/vz/template/cache/[template].tar.gz \
    --rootfs local-lvm:vm-CTID-disk-0 \
    --hostname recovered-ct \
    --memory 2048

# Start container
pct start CTID
```

## Tools and Commands

### Essential Proxmox Commands

**VM Management:**
```bash
# List all VMs
qm list

# Show VM config
qm config VMID

# Start/stop VM
qm start VMID
qm stop VMID
qm shutdown VMID

# Clone VM
qm clone VMID NEW_VMID --name new-vm-name

# Migrate VM (in cluster)
qm migrate VMID TARGET_NODE
```

**Container Management:**
```bash
# List all containers
pct list

# Show container config
pct config CTID

# Start/stop container
pct start CTID
pct stop CTID
pct shutdown CTID

# Enter container
pct enter CTID
```

**Storage Management:**
```bash
# List storage
pvesm status

# Add storage
pvesm add [type] [storage-id] [options]

# Scan for storage
pvesm scan [type]
```

**Backup/Restore:**
```bash
# Create backup
vzdump VMID --mode snapshot --compress zstd

# Restore backup
qmrestore /path/to/backup.vma.zst NEW_VMID

# List backups
pvesh get /nodes/NODE/storage/STORAGE/content --content backup
```

### Diagnostic Commands

```bash
# Check Proxmox version
pveversion -v

# Check system resources
pvesh get /nodes/NODE/status

# Check running processes
pvesh get /nodes/NODE/tasks

# Check logs
journalctl -u pve-cluster
journalctl -u pvedaemon
journalctl -u pveproxy

# Check disk health
smartctl -a /dev/sdX

# Check network
ip addr
ip route
```

### Recovery Tools

**SystemRescue CD:**
- Boot from SystemRescue ISO
- Access to ZFS, LVM, and filesystem tools
- Can mount and repair Proxmox installations

**Proxmox Live ISO:**
- Boot without installing
- Can mount existing installations
- Repair bootloader and configurations

**TestDisk/PhotoRec:**
- Recover deleted files
- Repair partition tables

## Preventive Measures

### Regular Maintenance

**1. Daily Checks**
```bash
# Check cluster/node status
pvecm status

# Check VM/CT status
qm list
pct list

# Check storage health
pvesm status
```

**2. Weekly Tasks**
```bash
# Update Proxmox
apt update && apt dist-upgrade

# Check for failed systemd services
systemctl --failed

# Review logs
journalctl -p err -b
```

**3. Monthly Tasks**
```bash
# Test backup restore
qmrestore [backup] 999 --storage local-lvm
qm start 999
# Verify VM boots correctly
qm stop 999
qm destroy 999

# Check disk health
for disk in /dev/sd?; do smartctl -H $disk; done

# Check ZFS scrub
zpool scrub rpool
```

### Backup Best Practices

**1. 3-2-1 Backup Strategy**
- 3 copies of data
- 2 different media types
- 1 off-site copy

**2. Automated Backups**
- Schedule regular VM/CT backups
- Backup Proxmox configuration
- Test restore procedures regularly

**3. Documentation**
- Keep network diagrams updated
- Document IP allocations
- Maintain runbooks for common tasks
- Store documentation off-site

### Monitoring Setup

**1. Setup Email Alerts**
```bash
# Configure postfix for email
apt install postfix

# Test email
echo "Test" | mail -s "Proxmox Alert Test" your@email.com
```

**2. Monitor Resources**
- Set up monitoring for CPU, RAM, disk usage
- Alert on high resource consumption
- Monitor backup job success/failure

**3. Health Checks**
```bash
# Create health check script
cat > /usr/local/bin/health-check.sh << 'EOF'
#!/bin/bash
# Proxmox Health Check

# Check cluster status
if ! pvecm status &>/dev/null; then
    echo "WARNING: Cluster status check failed"
fi

# Check storage
pvesm status | grep -v active && echo "WARNING: Storage issue detected"

# Check for failed VMs
qm list | grep stopped && echo "INFO: Stopped VMs detected"

# Check system load
LOAD=$(cat /proc/loadavg | awk '{print $1}')
if (( $(echo "$LOAD > 8" | bc -l) )); then
    echo "WARNING: High system load: $LOAD"
fi

# Check disk space
df -h | awk '$5 ~ /^9[0-9]%/ || $5 ~ /^100%/ {print "WARNING: Disk space low on " $6 ": " $5}'
EOF

chmod +x /usr/local/bin/health-check.sh

# Add to crontab
echo "*/15 * * * * /usr/local/bin/health-check.sh | mail -s 'Proxmox Health Alert' your@email.com" | crontab -
```

## Emergency Contacts

### Proxmox Resources
- Proxmox Forums: https://forum.proxmox.com/
- Proxmox Documentation: https://pve.proxmox.com/pve-docs/
- Proxmox Wiki: https://pve.proxmox.com/wiki/

### Hardware Support
- Document hardware vendor support contacts
- Keep warranty information accessible
- Maintain spare parts inventory

## Recovery Time Objectives

| Scenario | Target Recovery Time | Notes |
|----------|---------------------|-------|
| Single VM restore | 30 minutes | From local backup |
| Complete node rebuild | 4-8 hours | Including OS reinstall |
| ZFS pool recovery | 1-6 hours | Depends on resilvering time |
| Cluster rejoin | 1-2 hours | Network reconfiguration |
| Full disaster recovery | 24-48 hours | From off-site backups |

## Recent Recovery Events

### Event Log Template

**Date:** YYYY-MM-DD
**Affected System:** [Proxmox node/VM/CT]
**Issue:** [Description]
**Resolution:** [Steps taken]
**Downtime:** [Duration]
**Lessons Learned:** [Improvements for next time]

---

**Last Updated:** 2025-12-13
**Version:** 1.0