# Disaster Recovery Plan This document outlines procedures for recovering from various disaster scenarios affecting your infrastructure. ## Table of Contents - [Emergency Contact Information](#emergency-contact-information) - [Recovery Time Objectives](#recovery-time-objectives) - [Backup Locations](#backup-locations) - [Disaster Scenarios](#disaster-scenarios) - [Recovery Procedures](#recovery-procedures) - [Post-Recovery Checklist](#post-recovery-checklist) --- ## Emergency Contact Information ### Primary Contacts | Role | Name | Phone | Email | Availability | |------|------|-------|-------|--------------| | Infrastructure Owner | _____________ | _____________ | _____________ | 24/7 | | Network Admin | _____________ | _____________ | _____________ | Business Hours | | Backup Contact | _____________ | _____________ | _____________ | 24/7 | ### Service Provider Contacts | Provider | Service | Support Number | Account ID | Notes | |----------|---------|----------------|------------|-------| | VPS Provider | _____________ | _____________ | _____________ | _____________ | | DNS Provider | _____________ | _____________ | _____________ | _____________ | | Domain Registrar | _____________ | _____________ | _____________ | _____________ | | ISP (Home Lab) | _____________ | _____________ | _____________ | _____________ | --- ## Recovery Time Objectives Define acceptable downtime for each service tier: | Tier | Service Type | RTO (Recovery Time Objective) | RPO (Recovery Point Objective) | |------|-------------|------------------------------|--------------------------------| | Critical | Public-facing services, Authentication | 1 hour | 15 minutes | | Important | Internal services, Databases | 4 hours | 1 hour | | Standard | Development, Testing | 24 hours | 24 hours | | Low Priority | Monitoring, Logging | 48 hours | 24 hours | --- ## Backup Locations ### Primary Backup Location - **Location**: _____________ (e.g., OMV storage node, external drive) - **Path**: _____________ - **Retention**: _____________ - **Access Method**: _____________ ### Secondary Backup Location (Off-site) - **Location**: _____________ (e.g., Cloud storage, remote server) - **Path**: _____________ - **Retention**: _____________ - **Access Method**: _____________ ### Backup Schedule - **Proxmox VMs/Containers**: Daily at _____ - **Configuration Files**: Weekly on _____ - **Critical Data**: Hourly/Daily - **Off-site Sync**: Daily/Weekly ### Critical Items to Backup - [ ] Proxmox VM/Container configurations and disks - [ ] Pangolin reverse proxy configurations - [ ] Gerbil tunnel configurations and keys - [ ] SSL/TLS certificates and keys - [ ] SSH keys and authorized_keys files - [ ] Network configuration files - [ ] DNS zone files (if self-hosted) - [ ] Database dumps - [ ] Application data and configurations - [ ] Documentation and credentials (encrypted) --- ## Disaster Scenarios ### Scenario 1: VPS Complete Failure **Impact**: All public-facing services down, no external access to home lab services **Recovery Procedure**: 1. **Immediate Actions (0-15 minutes)** - Verify VPS is actually down (ping, SSH, web checks) - Contact VPS provider support - Check VPS provider status page - Notify users if necessary 2. **Short-term Mitigation (15-60 minutes)** - If hardware failure, request provider rebuild - If account issue, resolve with provider - Consider spinning up temporary VPS with another provider 3. **VPS Rebuild (1-4 hours)** ```bash # On new VPS: # 1. Update system sudo apt update && sudo apt upgrade -y # 2. Install Pangolin # [Installation commands] # 3. Restore Pangolin configuration from backup scp backup-server:/backups/pangolin-config.tar.gz . sudo tar -xzf pangolin-config.tar.gz -C / # 4. Install Gerbil server # [Installation commands] # 5. Restore Gerbil configuration scp backup-server:/backups/gerbil-config.tar.gz . sudo tar -xzf gerbil-config.tar.gz -C / # 6. Restore SSL certificates sudo tar -xzf ssl-certs-backup.tar.gz -C /etc/letsencrypt/ # 7. Configure firewall sudo ufw allow 22/tcp sudo ufw allow 80/tcp sudo ufw allow 443/tcp sudo ufw allow [GERBIL_PORT]/tcp sudo ufw enable # 8. Start services sudo systemctl enable --now pangolin sudo systemctl enable --now gerbil # 9. Update DNS A record to new VPS IP # [DNS provider steps] # 10. Reconnect Gerbil tunnels from home lab # [See Gerbil reconnection below] ``` 4. **Verification** - Test all public routes - Verify Gerbil tunnels are connected - Check SSL certificates are valid - Monitor logs for errors --- ### Scenario 2: Home Lab Network Outage **Impact**: All home lab services unreachable, Gerbil tunnels down **Recovery Procedure**: 1. **Immediate Actions (0-15 minutes)** - Check router/modem status - Verify ISP is not having outage - Check physical connections - Reboot router/modem if necessary 2. **ISP Outage (Variable duration)** - Contact ISP support - Consider failover to mobile hotspot if critical - Notify users of expected downtime 3. **Restore Gerbil Tunnels** ```bash # On each home lab machine with tunnels: # 1. Verify local services are running systemctl status [service-name] # 2. Test VPS connectivity ping [VPS_IP] # 3. Restart Gerbil tunnels sudo systemctl restart gerbil-tunnel-* # 4. Verify tunnels are connected gerbil status # 5. Check logs for errors journalctl -u gerbil-tunnel-* -n 50 ``` --- ### Scenario 3: Proxmox Node Failure (DL380p or i5) **Impact**: All VMs/containers on failed node are down **Recovery Procedure**: 1. **Immediate Actions (0-30 minutes)** - Identify which node has failed - Determine cause (power, hardware, network) - Check if other cluster nodes are healthy 2. **If Node Can Be Recovered** ```bash # Try to boot node # If successful, check cluster status: pvecm status # Check VM/Container status qm list pct list # Start critical VMs/Containers qm start VMID pct start CTID ``` 3. **If Node Cannot Be Recovered - Migrate Services** ```bash # On working node: # 1. Check available resources pvesh get /nodes/NODE/status # 2. Restore VMs from backup to working node qmrestore /path/to/backup/vzdump-qemu-VMID.vma.zst NEW_VMID --storage local-lvm # 3. Restore containers from backup pct restore NEW_CTID /path/to/backup/vzdump-lxc-CTID.tar.zst --storage local-lvm # 4. Start restored VMs/containers qm start NEW_VMID pct start NEW_CTID # 5. Update internal DNS/documentation with new IPs if changed ``` 4. **Resource Constraints** - If insufficient resources on remaining node: - Prioritize critical services only - Consider scaling down VM resources temporarily - Plan for hardware replacement/repair --- ### Scenario 4: Storage Node (OMV) Failure **Impact**: Shared storage unavailable, backups inaccessible, data loss risk **Recovery Procedure**: 1. **Immediate Actions (0-30 minutes)** - Verify storage node is down - Check if disks are healthy (if node boots) - Identify affected services using shared storage 2. **If Disk Failure** - Check RAID status (if configured) - Replace failed disk - Rebuild RAID array - Restore from off-site backup if necessary 3. **If Complete Storage Loss** ```bash # 1. Rebuild OMV on new hardware/disks # [OMV installation] # 2. Configure network shares # [NFS/CIFS setup] # 3. Restore data from off-site backup rsync -avz backup-location:/backups/ /mnt/storage/ # 4. Remount shares on Proxmox nodes # Update /etc/fstab on each node mount -a # 5. Verify Proxmox can access storage pvesm status ``` --- ### Scenario 5: DNS Provider Failure **Impact**: Domain not resolving, all services unreachable by domain name **Recovery Procedure**: 1. **Immediate Actions (0-15 minutes)** - Check DNS provider status page - Test DNS resolution: `nslookup domain.com` - Verify it's provider issue, not configuration 2. **Short-term Mitigation (15-60 minutes)** - Share direct IP addresses with users temporarily - Set up temporary DNS using Cloudflare (free tier) 3. **Migrate to New DNS Provider** ```bash # 1. Export zone file from old provider (if possible) # 2. Create account with new DNS provider # 3. Import zone file or manually create records: # A record: domain.com -> VPS_IP # A record: *.domain.com -> VPS_IP (if using wildcard) # Other records as needed # 4. Update nameservers at domain registrar # (Propagation takes 24-48 hours) # 5. Monitor DNS propagation dig domain.com @8.8.8.8 ``` --- ### Scenario 6: Complete Data Center Loss (Home Lab) **Impact**: All home lab infrastructure destroyed (fire, flood, etc.) **Recovery Procedure**: 1. **Immediate Actions** - Ensure safety of personnel - Contact insurance provider - Assess extent of damage - Secure remaining equipment 2. **Short-term (Services that must continue)** - Move critical services to VPS temporarily - Use cloud providers for temporary hosting - Restore from off-site backups 3. **Long-term (Infrastructure Rebuild)** - Procure replacement hardware - Rebuild Proxmox cluster - Restore VMs/containers from off-site backups - Reconfigure network - Re-establish Gerbil tunnels - Full testing and verification --- ## Recovery Procedures ### General Recovery Steps 1. **Assess the Situation** - Identify what has failed - Determine scope of impact - Estimate recovery time 2. **Communicate** - Notify affected users - Update status page if available - Keep stakeholders informed 3. **Prioritize** - Focus on critical services first - Use RTO/RPO objectives - Document decisions 4. **Execute Recovery** - Follow specific scenario procedures - Document all actions taken - Keep logs of commands executed 5. **Verify** - Test all restored services - Check data integrity - Monitor for issues 6. **Document** - Record what happened - Document what worked/didn't work - Update this document with lessons learned --- ## Post-Recovery Checklist After any disaster recovery, complete the following: ### Immediate Post-Recovery (0-24 hours) - [ ] All critical services are operational - [ ] All services are monitored - [ ] Temporary workarounds documented - [ ] Incident logged with timeline ### Short-term (1-7 days) - [ ] All services fully restored - [ ] Performance is normal - [ ] Backups are running - [ ] Security review completed - [ ] Post-mortem meeting scheduled ### Long-term (1-4 weeks) - [ ] Post-mortem completed - [ ] Lessons learned documented - [ ] Disaster recovery plan updated - [ ] Preventive measures implemented - [ ] Training updated if needed - [ ] Backup/monitoring improvements made --- ## Post-Mortem Template After each disaster recovery event, complete a post-mortem: **Incident Date**: _____________ **Recovery Completed**: _____________ **Total Downtime**: _____________ ### What Happened? [Detailed description of the incident] ### Timeline | Time | Event | |------|-------| | _____ | _____ | | _____ | _____ | ### Root Cause [What caused the failure?] ### What Went Well? - - ### What Went Poorly? - - ### Action Items | Action | Owner | Due Date | Status | |--------|-------|----------|--------| | _______ | _____ | ________ | ______ | ### Improvements to This Plan [What should be updated in the disaster recovery plan?] --- ## Testing Schedule Regular disaster recovery testing ensures procedures work when needed: | Test Type | Frequency | Last Test | Next Test | Status | |-----------|-----------|-----------|-----------|--------| | Backup restore test | Quarterly | _________ | _________ | ______ | | VPS failover drill | Semi-annually | _________ | _________ | ______ | | Node failure simulation | Annually | _________ | _________ | ______ | | Full DR scenario | Annually | _________ | _________ | ______ | --- ## Document Maintenance **Last Updated**: _____________ **Updated By**: _____________ **Next Review Date**: _____________ **Version**: 1.0