12 KiB
Disaster Recovery Plan
This document outlines procedures for recovering from various disaster scenarios affecting your infrastructure.
Table of Contents
- Emergency Contact Information
- Recovery Time Objectives
- Backup Locations
- Disaster Scenarios
- Recovery Procedures
- Post-Recovery Checklist
Emergency Contact Information
Primary Contacts
| Role | Name | Phone | Availability | |
|---|---|---|---|---|
| Infrastructure Owner | _____________ | _____________ | _____________ | 24/7 |
| Network Admin | _____________ | _____________ | _____________ | Business Hours |
| Backup Contact | _____________ | _____________ | _____________ | 24/7 |
Service Provider Contacts
| Provider | Service | Support Number | Account ID | Notes |
|---|---|---|---|---|
| VPS Provider | _____________ | _____________ | _____________ | _____________ |
| DNS Provider | _____________ | _____________ | _____________ | _____________ |
| Domain Registrar | _____________ | _____________ | _____________ | _____________ |
| ISP (Home Lab) | _____________ | _____________ | _____________ | _____________ |
Recovery Time Objectives
Define acceptable downtime for each service tier:
| Tier | Service Type | RTO (Recovery Time Objective) | RPO (Recovery Point Objective) |
|---|---|---|---|
| Critical | Public-facing services, Authentication | 1 hour | 15 minutes |
| Important | Internal services, Databases | 4 hours | 1 hour |
| Standard | Development, Testing | 24 hours | 24 hours |
| Low Priority | Monitoring, Logging | 48 hours | 24 hours |
Backup Locations
Primary Backup Location
- Location: _____________ (e.g., OMV storage node, external drive)
- Path: _____________
- Retention: _____________
- Access Method: _____________
Secondary Backup Location (Off-site)
- Location: _____________ (e.g., Cloud storage, remote server)
- Path: _____________
- Retention: _____________
- Access Method: _____________
Backup Schedule
- Proxmox VMs/Containers: Daily at _____
- Configuration Files: Weekly on _____
- Critical Data: Hourly/Daily
- Off-site Sync: Daily/Weekly
Critical Items to Backup
- Proxmox VM/Container configurations and disks
- Pangolin reverse proxy configurations
- Gerbil tunnel configurations and keys
- SSL/TLS certificates and keys
- SSH keys and authorized_keys files
- Network configuration files
- DNS zone files (if self-hosted)
- Database dumps
- Application data and configurations
- Documentation and credentials (encrypted)
Disaster Scenarios
Scenario 1: VPS Complete Failure
Impact: All public-facing services down, no external access to home lab services
Recovery Procedure:
-
Immediate Actions (0-15 minutes)
- Verify VPS is actually down (ping, SSH, web checks)
- Contact VPS provider support
- Check VPS provider status page
- Notify users if necessary
-
Short-term Mitigation (15-60 minutes)
- If hardware failure, request provider rebuild
- If account issue, resolve with provider
- Consider spinning up temporary VPS with another provider
-
VPS Rebuild (1-4 hours)
# On new VPS: # 1. Update system sudo apt update && sudo apt upgrade -y # 2. Install Pangolin # [Installation commands] # 3. Restore Pangolin configuration from backup scp backup-server:/backups/pangolin-config.tar.gz . sudo tar -xzf pangolin-config.tar.gz -C / # 4. Install Gerbil server # [Installation commands] # 5. Restore Gerbil configuration scp backup-server:/backups/gerbil-config.tar.gz . sudo tar -xzf gerbil-config.tar.gz -C / # 6. Restore SSL certificates sudo tar -xzf ssl-certs-backup.tar.gz -C /etc/letsencrypt/ # 7. Configure firewall sudo ufw allow 22/tcp sudo ufw allow 80/tcp sudo ufw allow 443/tcp sudo ufw allow [GERBIL_PORT]/tcp sudo ufw enable # 8. Start services sudo systemctl enable --now pangolin sudo systemctl enable --now gerbil # 9. Update DNS A record to new VPS IP # [DNS provider steps] # 10. Reconnect Gerbil tunnels from home lab # [See Gerbil reconnection below] -
Verification
- Test all public routes
- Verify Gerbil tunnels are connected
- Check SSL certificates are valid
- Monitor logs for errors
Scenario 2: Home Lab Network Outage
Impact: All home lab services unreachable, Gerbil tunnels down
Recovery Procedure:
-
Immediate Actions (0-15 minutes)
- Check router/modem status
- Verify ISP is not having outage
- Check physical connections
- Reboot router/modem if necessary
-
ISP Outage (Variable duration)
- Contact ISP support
- Consider failover to mobile hotspot if critical
- Notify users of expected downtime
-
Restore Gerbil Tunnels
# On each home lab machine with tunnels: # 1. Verify local services are running systemctl status [service-name] # 2. Test VPS connectivity ping [VPS_IP] # 3. Restart Gerbil tunnels sudo systemctl restart gerbil-tunnel-* # 4. Verify tunnels are connected gerbil status # 5. Check logs for errors journalctl -u gerbil-tunnel-* -n 50
Scenario 3: Proxmox Node Failure (DL380p or i5)
Impact: All VMs/containers on failed node are down
Recovery Procedure:
-
Immediate Actions (0-30 minutes)
- Identify which node has failed
- Determine cause (power, hardware, network)
- Check if other cluster nodes are healthy
-
If Node Can Be Recovered
# Try to boot node # If successful, check cluster status: pvecm status # Check VM/Container status qm list pct list # Start critical VMs/Containers qm start VMID pct start CTID -
If Node Cannot Be Recovered - Migrate Services
# On working node: # 1. Check available resources pvesh get /nodes/NODE/status # 2. Restore VMs from backup to working node qmrestore /path/to/backup/vzdump-qemu-VMID.vma.zst NEW_VMID --storage local-lvm # 3. Restore containers from backup pct restore NEW_CTID /path/to/backup/vzdump-lxc-CTID.tar.zst --storage local-lvm # 4. Start restored VMs/containers qm start NEW_VMID pct start NEW_CTID # 5. Update internal DNS/documentation with new IPs if changed -
Resource Constraints
- If insufficient resources on remaining node:
- Prioritize critical services only
- Consider scaling down VM resources temporarily
- Plan for hardware replacement/repair
- If insufficient resources on remaining node:
Scenario 4: Storage Node (OMV) Failure
Impact: Shared storage unavailable, backups inaccessible, data loss risk
Recovery Procedure:
-
Immediate Actions (0-30 minutes)
- Verify storage node is down
- Check if disks are healthy (if node boots)
- Identify affected services using shared storage
-
If Disk Failure
- Check RAID status (if configured)
- Replace failed disk
- Rebuild RAID array
- Restore from off-site backup if necessary
-
If Complete Storage Loss
# 1. Rebuild OMV on new hardware/disks # [OMV installation] # 2. Configure network shares # [NFS/CIFS setup] # 3. Restore data from off-site backup rsync -avz backup-location:/backups/ /mnt/storage/ # 4. Remount shares on Proxmox nodes # Update /etc/fstab on each node mount -a # 5. Verify Proxmox can access storage pvesm status
Scenario 5: DNS Provider Failure
Impact: Domain not resolving, all services unreachable by domain name
Recovery Procedure:
-
Immediate Actions (0-15 minutes)
- Check DNS provider status page
- Test DNS resolution:
nslookup domain.com - Verify it's provider issue, not configuration
-
Short-term Mitigation (15-60 minutes)
- Share direct IP addresses with users temporarily
- Set up temporary DNS using Cloudflare (free tier)
-
Migrate to New DNS Provider
# 1. Export zone file from old provider (if possible) # 2. Create account with new DNS provider # 3. Import zone file or manually create records: # A record: domain.com -> VPS_IP # A record: *.domain.com -> VPS_IP (if using wildcard) # Other records as needed # 4. Update nameservers at domain registrar # (Propagation takes 24-48 hours) # 5. Monitor DNS propagation dig domain.com @8.8.8.8
Scenario 6: Complete Data Center Loss (Home Lab)
Impact: All home lab infrastructure destroyed (fire, flood, etc.)
Recovery Procedure:
-
Immediate Actions
- Ensure safety of personnel
- Contact insurance provider
- Assess extent of damage
- Secure remaining equipment
-
Short-term (Services that must continue)
- Move critical services to VPS temporarily
- Use cloud providers for temporary hosting
- Restore from off-site backups
-
Long-term (Infrastructure Rebuild)
- Procure replacement hardware
- Rebuild Proxmox cluster
- Restore VMs/containers from off-site backups
- Reconfigure network
- Re-establish Gerbil tunnels
- Full testing and verification
Recovery Procedures
General Recovery Steps
-
Assess the Situation
- Identify what has failed
- Determine scope of impact
- Estimate recovery time
-
Communicate
- Notify affected users
- Update status page if available
- Keep stakeholders informed
-
Prioritize
- Focus on critical services first
- Use RTO/RPO objectives
- Document decisions
-
Execute Recovery
- Follow specific scenario procedures
- Document all actions taken
- Keep logs of commands executed
-
Verify
- Test all restored services
- Check data integrity
- Monitor for issues
-
Document
- Record what happened
- Document what worked/didn't work
- Update this document with lessons learned
Post-Recovery Checklist
After any disaster recovery, complete the following:
Immediate Post-Recovery (0-24 hours)
- All critical services are operational
- All services are monitored
- Temporary workarounds documented
- Incident logged with timeline
Short-term (1-7 days)
- All services fully restored
- Performance is normal
- Backups are running
- Security review completed
- Post-mortem meeting scheduled
Long-term (1-4 weeks)
- Post-mortem completed
- Lessons learned documented
- Disaster recovery plan updated
- Preventive measures implemented
- Training updated if needed
- Backup/monitoring improvements made
Post-Mortem Template
After each disaster recovery event, complete a post-mortem:
Incident Date: _____________ Recovery Completed: _____________ Total Downtime: _____________
What Happened?
[Detailed description of the incident]
Timeline
| Time | Event |
|---|---|
| _____ | _____ |
| _____ | _____ |
Root Cause
[What caused the failure?]
What Went Well?
What Went Poorly?
Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
| _______ | _____ | ________ | ______ |
Improvements to This Plan
[What should be updated in the disaster recovery plan?]
Testing Schedule
Regular disaster recovery testing ensures procedures work when needed:
| Test Type | Frequency | Last Test | Next Test | Status |
|---|---|---|---|---|
| Backup restore test | Quarterly | _________ | _________ | ______ |
| VPS failover drill | Semi-annually | _________ | _________ | ______ |
| Node failure simulation | Annually | _________ | _________ | ______ |
| Full DR scenario | Annually | _________ | _________ | ______ |
Document Maintenance
Last Updated: _____________ Updated By: _____________ Next Review Date: _____________ Version: 1.0