homelab-docs/infrastructure/DISASTER-RECOVERY.md

# Disaster Recovery Plan

This document outlines procedures for recovering from various disaster scenarios affecting your infrastructure.

## Table of Contents
- [Emergency Contact Information](#emergency-contact-information)
- [Recovery Time Objectives](#recovery-time-objectives)
- [Backup Locations](#backup-locations)
- [Disaster Scenarios](#disaster-scenarios)
- [Recovery Procedures](#recovery-procedures)
- [Post-Recovery Checklist](#post-recovery-checklist)

---

## Emergency Contact Information

### Primary Contacts
| Role | Name | Phone | Email | Availability |
|------|------|-------|-------|--------------|
| Infrastructure Owner | _____________ | _____________ | _____________ | 24/7 |
| Network Admin | _____________ | _____________ | _____________ | Business Hours |
| Backup Contact | _____________ | _____________ | _____________ | 24/7 |

### Service Provider Contacts
| Provider | Service | Support Number | Account ID | Notes |
|----------|---------|----------------|------------|-------|
| VPS Provider | _____________ | _____________ | _____________ | _____________ |
| DNS Provider | _____________ | _____________ | _____________ | _____________ |
| Domain Registrar | _____________ | _____________ | _____________ | _____________ |
| ISP (Home Lab) | _____________ | _____________ | _____________ | _____________ |

---

## Recovery Time Objectives

Define acceptable downtime for each service tier:

| Tier | Service Type | RTO (Recovery Time Objective) | RPO (Recovery Point Objective) |
|------|-------------|------------------------------|--------------------------------|
| Critical | Public-facing services, Authentication | 1 hour | 15 minutes |
| Important | Internal services, Databases | 4 hours | 1 hour |
| Standard | Development, Testing | 24 hours | 24 hours |
| Low Priority | Monitoring, Logging | 48 hours | 24 hours |

---

## Backup Locations

### Primary Backup Location
- **Location**: _____________ (e.g., OMV storage node, external drive)
- **Path**: _____________
- **Retention**: _____________
- **Access Method**: _____________

### Secondary Backup Location (Off-site)
- **Location**: _____________ (e.g., Cloud storage, remote server)
- **Path**: _____________
- **Retention**: _____________
- **Access Method**: _____________

### Backup Schedule
- **Proxmox VMs/Containers**: Daily at _____
- **Configuration Files**: Weekly on _____
- **Critical Data**: Hourly/Daily
- **Off-site Sync**: Daily/Weekly

### Critical Items to Backup
- [ ] Proxmox VM/Container configurations and disks
- [ ] Pangolin reverse proxy configurations
- [ ] Gerbil tunnel configurations and keys
- [ ] SSL/TLS certificates and keys
- [ ] SSH keys and authorized_keys files
- [ ] Network configuration files
- [ ] DNS zone files (if self-hosted)
- [ ] Database dumps
- [ ] Application data and configurations
- [ ] Documentation and credentials (encrypted)

---

## Disaster Scenarios

### Scenario 1: VPS Complete Failure

**Impact**: All public-facing services down, no external access to home lab services

**Recovery Procedure**:
1. **Immediate Actions (0-15 minutes)**
   - Verify VPS is actually down (ping, SSH, web checks)
   - Contact VPS provider support
   - Check VPS provider status page
   - Notify users if necessary

2. **Short-term Mitigation (15-60 minutes)**
   - If hardware failure, request provider rebuild
   - If account issue, resolve with provider
   - Consider spinning up temporary VPS with another provider

3. **VPS Rebuild (1-4 hours)**
   ```bash
   # On new VPS:

   # 1. Update system
   sudo apt update && sudo apt upgrade -y

   # 2. Install Pangolin
   # [Installation commands]

   # 3. Restore Pangolin configuration from backup
   scp backup-server:/backups/pangolin-config.tar.gz .
   sudo tar -xzf pangolin-config.tar.gz -C /

   # 4. Install Gerbil server
   # [Installation commands]

   # 5. Restore Gerbil configuration
   scp backup-server:/backups/gerbil-config.tar.gz .
   sudo tar -xzf gerbil-config.tar.gz -C /

   # 6. Restore SSL certificates
   sudo tar -xzf ssl-certs-backup.tar.gz -C /etc/letsencrypt/

   # 7. Configure firewall
   sudo ufw allow 22/tcp
   sudo ufw allow 80/tcp
   sudo ufw allow 443/tcp
   sudo ufw allow [GERBIL_PORT]/tcp
   sudo ufw enable

   # 8. Start services
   sudo systemctl enable --now pangolin
   sudo systemctl enable --now gerbil

   # 9. Update DNS A record to new VPS IP
   # [DNS provider steps]

   # 10. Reconnect Gerbil tunnels from home lab
   # [See Gerbil reconnection below]
   ```

4. **Verification**
   - Test all public routes
   - Verify Gerbil tunnels are connected
   - Check SSL certificates are valid
   - Monitor logs for errors

---

### Scenario 2: Home Lab Network Outage

**Impact**: All home lab services unreachable, Gerbil tunnels down

**Recovery Procedure**:
1. **Immediate Actions (0-15 minutes)**
   - Check router/modem status
   - Verify ISP is not having outage
   - Check physical connections
   - Reboot router/modem if necessary

2. **ISP Outage (Variable duration)**
   - Contact ISP support
   - Consider failover to mobile hotspot if critical
   - Notify users of expected downtime

3. **Restore Gerbil Tunnels**
   ```bash
   # On each home lab machine with tunnels:

   # 1. Verify local services are running
   systemctl status [service-name]

   # 2. Test VPS connectivity
   ping [VPS_IP]

   # 3. Restart Gerbil tunnels
   sudo systemctl restart gerbil-tunnel-*

   # 4. Verify tunnels are connected
   gerbil status

   # 5. Check logs for errors
   journalctl -u gerbil-tunnel-* -n 50
   ```

---

### Scenario 3: Proxmox Node Failure (DL380p or i5)

**Impact**: All VMs/containers on failed node are down

**Recovery Procedure**:
1. **Immediate Actions (0-30 minutes)**
   - Identify which node has failed
   - Determine cause (power, hardware, network)
   - Check if other cluster nodes are healthy

2. **If Node Can Be Recovered**
   ```bash
   # Try to boot node
   # If successful, check cluster status:
   pvecm status

   # Check VM/Container status
   qm list
   pct list

   # Start critical VMs/Containers
   qm start VMID
   pct start CTID
   ```

3. **If Node Cannot Be Recovered - Migrate Services**
   ```bash
   # On working node:

   # 1. Check available resources
   pvesh get /nodes/NODE/status

   # 2. Restore VMs from backup to working node
   qmrestore /path/to/backup/vzdump-qemu-VMID.vma.zst NEW_VMID --storage local-lvm

   # 3. Restore containers from backup
   pct restore NEW_CTID /path/to/backup/vzdump-lxc-CTID.tar.zst --storage local-lvm

   # 4. Start restored VMs/containers
   qm start NEW_VMID
   pct start NEW_CTID

   # 5. Update internal DNS/documentation with new IPs if changed
   ```

4. **Resource Constraints**
   - If insufficient resources on remaining node:
     - Prioritize critical services only
     - Consider scaling down VM resources temporarily
     - Plan for hardware replacement/repair

---

### Scenario 4: Storage Node (OMV) Failure

**Impact**: Shared storage unavailable, backups inaccessible, data loss risk

**Recovery Procedure**:
1. **Immediate Actions (0-30 minutes)**
   - Verify storage node is down
   - Check if disks are healthy (if node boots)
   - Identify affected services using shared storage

2. **If Disk Failure**
   - Check RAID status (if configured)
   - Replace failed disk
   - Rebuild RAID array
   - Restore from off-site backup if necessary

3. **If Complete Storage Loss**
   ```bash
   # 1. Rebuild OMV on new hardware/disks
   # [OMV installation]

   # 2. Configure network shares
   # [NFS/CIFS setup]

   # 3. Restore data from off-site backup
   rsync -avz backup-location:/backups/ /mnt/storage/

   # 4. Remount shares on Proxmox nodes
   # Update /etc/fstab on each node
   mount -a

   # 5. Verify Proxmox can access storage
   pvesm status
   ```

---

### Scenario 5: DNS Provider Failure

**Impact**: Domain not resolving, all services unreachable by domain name

**Recovery Procedure**:
1. **Immediate Actions (0-15 minutes)**
   - Check DNS provider status page
   - Test DNS resolution: `nslookup domain.com`
   - Verify it's provider issue, not configuration

2. **Short-term Mitigation (15-60 minutes)**
   - Share direct IP addresses with users temporarily
   - Set up temporary DNS using Cloudflare (free tier)

3. **Migrate to New DNS Provider**
   ```bash
   # 1. Export zone file from old provider (if possible)

   # 2. Create account with new DNS provider

   # 3. Import zone file or manually create records:
   # A record: domain.com -> VPS_IP
   # A record: *.domain.com -> VPS_IP (if using wildcard)
   # Other records as needed

   # 4. Update nameservers at domain registrar
   # (Propagation takes 24-48 hours)

   # 5. Monitor DNS propagation
   dig domain.com @8.8.8.8
   ```

---

### Scenario 6: Complete Data Center Loss (Home Lab)

**Impact**: All home lab infrastructure destroyed (fire, flood, etc.)

**Recovery Procedure**:
1. **Immediate Actions**
   - Ensure safety of personnel
   - Contact insurance provider
   - Assess extent of damage
   - Secure remaining equipment

2. **Short-term (Services that must continue)**
   - Move critical services to VPS temporarily
   - Use cloud providers for temporary hosting
   - Restore from off-site backups

3. **Long-term (Infrastructure Rebuild)**
   - Procure replacement hardware
   - Rebuild Proxmox cluster
   - Restore VMs/containers from off-site backups
   - Reconfigure network
   - Re-establish Gerbil tunnels
   - Full testing and verification

---

## Recovery Procedures

### General Recovery Steps

1. **Assess the Situation**
   - Identify what has failed
   - Determine scope of impact
   - Estimate recovery time

2. **Communicate**
   - Notify affected users
   - Update status page if available
   - Keep stakeholders informed

3. **Prioritize**
   - Focus on critical services first
   - Use RTO/RPO objectives
   - Document decisions

4. **Execute Recovery**
   - Follow specific scenario procedures
   - Document all actions taken
   - Keep logs of commands executed

5. **Verify**
   - Test all restored services
   - Check data integrity
   - Monitor for issues

6. **Document**
   - Record what happened
   - Document what worked/didn't work
   - Update this document with lessons learned

---

## Post-Recovery Checklist

After any disaster recovery, complete the following:

### Immediate Post-Recovery (0-24 hours)
- [ ] All critical services are operational
- [ ] All services are monitored
- [ ] Temporary workarounds documented
- [ ] Incident logged with timeline

### Short-term (1-7 days)
- [ ] All services fully restored
- [ ] Performance is normal
- [ ] Backups are running
- [ ] Security review completed
- [ ] Post-mortem meeting scheduled

### Long-term (1-4 weeks)
- [ ] Post-mortem completed
- [ ] Lessons learned documented
- [ ] Disaster recovery plan updated
- [ ] Preventive measures implemented
- [ ] Training updated if needed
- [ ] Backup/monitoring improvements made

---

## Post-Mortem Template

After each disaster recovery event, complete a post-mortem:

**Incident Date**: _____________
**Recovery Completed**: _____________
**Total Downtime**: _____________

### What Happened?
[Detailed description of the incident]

### Timeline
| Time | Event |
|------|-------|
| _____ | _____ |
| _____ | _____ |

### Root Cause
[What caused the failure?]

### What Went Well?
-
-

### What Went Poorly?
-
-

### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| _______ | _____ | ________ | ______ |

### Improvements to This Plan
[What should be updated in the disaster recovery plan?]

---

## Testing Schedule

Regular disaster recovery testing ensures procedures work when needed:

| Test Type | Frequency | Last Test | Next Test | Status |
|-----------|-----------|-----------|-----------|--------|
| Backup restore test | Quarterly | _________ | _________ | ______ |
| VPS failover drill | Semi-annually | _________ | _________ | ______ |
| Node failure simulation | Annually | _________ | _________ | ______ |
| Full DR scenario | Annually | _________ | _________ | ______ |

---

## Document Maintenance

**Last Updated**: _____________
**Updated By**: _____________
**Next Review Date**: _____________
**Version**: 1.0