Initial infrastructure documentation - comprehensive homelab reference
This commit is contained in:
456
infrastructure/DISASTER-RECOVERY.md
Normal file
456
infrastructure/DISASTER-RECOVERY.md
Normal file
@@ -0,0 +1,456 @@
|
||||
# Disaster Recovery Plan
|
||||
|
||||
This document outlines procedures for recovering from various disaster scenarios affecting your infrastructure.
|
||||
|
||||
## Table of Contents
|
||||
- [Emergency Contact Information](#emergency-contact-information)
|
||||
- [Recovery Time Objectives](#recovery-time-objectives)
|
||||
- [Backup Locations](#backup-locations)
|
||||
- [Disaster Scenarios](#disaster-scenarios)
|
||||
- [Recovery Procedures](#recovery-procedures)
|
||||
- [Post-Recovery Checklist](#post-recovery-checklist)
|
||||
|
||||
---
|
||||
|
||||
## Emergency Contact Information
|
||||
|
||||
### Primary Contacts
|
||||
| Role | Name | Phone | Email | Availability |
|
||||
|------|------|-------|-------|--------------|
|
||||
| Infrastructure Owner | _____________ | _____________ | _____________ | 24/7 |
|
||||
| Network Admin | _____________ | _____________ | _____________ | Business Hours |
|
||||
| Backup Contact | _____________ | _____________ | _____________ | 24/7 |
|
||||
|
||||
### Service Provider Contacts
|
||||
| Provider | Service | Support Number | Account ID | Notes |
|
||||
|----------|---------|----------------|------------|-------|
|
||||
| VPS Provider | _____________ | _____________ | _____________ | _____________ |
|
||||
| DNS Provider | _____________ | _____________ | _____________ | _____________ |
|
||||
| Domain Registrar | _____________ | _____________ | _____________ | _____________ |
|
||||
| ISP (Home Lab) | _____________ | _____________ | _____________ | _____________ |
|
||||
|
||||
---
|
||||
|
||||
## Recovery Time Objectives
|
||||
|
||||
Define acceptable downtime for each service tier:
|
||||
|
||||
| Tier | Service Type | RTO (Recovery Time Objective) | RPO (Recovery Point Objective) |
|
||||
|------|-------------|------------------------------|--------------------------------|
|
||||
| Critical | Public-facing services, Authentication | 1 hour | 15 minutes |
|
||||
| Important | Internal services, Databases | 4 hours | 1 hour |
|
||||
| Standard | Development, Testing | 24 hours | 24 hours |
|
||||
| Low Priority | Monitoring, Logging | 48 hours | 24 hours |
|
||||
|
||||
---
|
||||
|
||||
## Backup Locations
|
||||
|
||||
### Primary Backup Location
|
||||
- **Location**: _____________ (e.g., OMV storage node, external drive)
|
||||
- **Path**: _____________
|
||||
- **Retention**: _____________
|
||||
- **Access Method**: _____________
|
||||
|
||||
### Secondary Backup Location (Off-site)
|
||||
- **Location**: _____________ (e.g., Cloud storage, remote server)
|
||||
- **Path**: _____________
|
||||
- **Retention**: _____________
|
||||
- **Access Method**: _____________
|
||||
|
||||
### Backup Schedule
|
||||
- **Proxmox VMs/Containers**: Daily at _____
|
||||
- **Configuration Files**: Weekly on _____
|
||||
- **Critical Data**: Hourly/Daily
|
||||
- **Off-site Sync**: Daily/Weekly
|
||||
|
||||
### Critical Items to Backup
|
||||
- [ ] Proxmox VM/Container configurations and disks
|
||||
- [ ] Pangolin reverse proxy configurations
|
||||
- [ ] Gerbil tunnel configurations and keys
|
||||
- [ ] SSL/TLS certificates and keys
|
||||
- [ ] SSH keys and authorized_keys files
|
||||
- [ ] Network configuration files
|
||||
- [ ] DNS zone files (if self-hosted)
|
||||
- [ ] Database dumps
|
||||
- [ ] Application data and configurations
|
||||
- [ ] Documentation and credentials (encrypted)
|
||||
|
||||
---
|
||||
|
||||
## Disaster Scenarios
|
||||
|
||||
### Scenario 1: VPS Complete Failure
|
||||
|
||||
**Impact**: All public-facing services down, no external access to home lab services
|
||||
|
||||
**Recovery Procedure**:
|
||||
1. **Immediate Actions (0-15 minutes)**
|
||||
- Verify VPS is actually down (ping, SSH, web checks)
|
||||
- Contact VPS provider support
|
||||
- Check VPS provider status page
|
||||
- Notify users if necessary
|
||||
|
||||
2. **Short-term Mitigation (15-60 minutes)**
|
||||
- If hardware failure, request provider rebuild
|
||||
- If account issue, resolve with provider
|
||||
- Consider spinning up temporary VPS with another provider
|
||||
|
||||
3. **VPS Rebuild (1-4 hours)**
|
||||
```bash
|
||||
# On new VPS:
|
||||
|
||||
# 1. Update system
|
||||
sudo apt update && sudo apt upgrade -y
|
||||
|
||||
# 2. Install Pangolin
|
||||
# [Installation commands]
|
||||
|
||||
# 3. Restore Pangolin configuration from backup
|
||||
scp backup-server:/backups/pangolin-config.tar.gz .
|
||||
sudo tar -xzf pangolin-config.tar.gz -C /
|
||||
|
||||
# 4. Install Gerbil server
|
||||
# [Installation commands]
|
||||
|
||||
# 5. Restore Gerbil configuration
|
||||
scp backup-server:/backups/gerbil-config.tar.gz .
|
||||
sudo tar -xzf gerbil-config.tar.gz -C /
|
||||
|
||||
# 6. Restore SSL certificates
|
||||
sudo tar -xzf ssl-certs-backup.tar.gz -C /etc/letsencrypt/
|
||||
|
||||
# 7. Configure firewall
|
||||
sudo ufw allow 22/tcp
|
||||
sudo ufw allow 80/tcp
|
||||
sudo ufw allow 443/tcp
|
||||
sudo ufw allow [GERBIL_PORT]/tcp
|
||||
sudo ufw enable
|
||||
|
||||
# 8. Start services
|
||||
sudo systemctl enable --now pangolin
|
||||
sudo systemctl enable --now gerbil
|
||||
|
||||
# 9. Update DNS A record to new VPS IP
|
||||
# [DNS provider steps]
|
||||
|
||||
# 10. Reconnect Gerbil tunnels from home lab
|
||||
# [See Gerbil reconnection below]
|
||||
```
|
||||
|
||||
4. **Verification**
|
||||
- Test all public routes
|
||||
- Verify Gerbil tunnels are connected
|
||||
- Check SSL certificates are valid
|
||||
- Monitor logs for errors
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2: Home Lab Network Outage
|
||||
|
||||
**Impact**: All home lab services unreachable, Gerbil tunnels down
|
||||
|
||||
**Recovery Procedure**:
|
||||
1. **Immediate Actions (0-15 minutes)**
|
||||
- Check router/modem status
|
||||
- Verify ISP is not having outage
|
||||
- Check physical connections
|
||||
- Reboot router/modem if necessary
|
||||
|
||||
2. **ISP Outage (Variable duration)**
|
||||
- Contact ISP support
|
||||
- Consider failover to mobile hotspot if critical
|
||||
- Notify users of expected downtime
|
||||
|
||||
3. **Restore Gerbil Tunnels**
|
||||
```bash
|
||||
# On each home lab machine with tunnels:
|
||||
|
||||
# 1. Verify local services are running
|
||||
systemctl status [service-name]
|
||||
|
||||
# 2. Test VPS connectivity
|
||||
ping [VPS_IP]
|
||||
|
||||
# 3. Restart Gerbil tunnels
|
||||
sudo systemctl restart gerbil-tunnel-*
|
||||
|
||||
# 4. Verify tunnels are connected
|
||||
gerbil status
|
||||
|
||||
# 5. Check logs for errors
|
||||
journalctl -u gerbil-tunnel-* -n 50
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Scenario 3: Proxmox Node Failure (DL380p or i5)
|
||||
|
||||
**Impact**: All VMs/containers on failed node are down
|
||||
|
||||
**Recovery Procedure**:
|
||||
1. **Immediate Actions (0-30 minutes)**
|
||||
- Identify which node has failed
|
||||
- Determine cause (power, hardware, network)
|
||||
- Check if other cluster nodes are healthy
|
||||
|
||||
2. **If Node Can Be Recovered**
|
||||
```bash
|
||||
# Try to boot node
|
||||
# If successful, check cluster status:
|
||||
pvecm status
|
||||
|
||||
# Check VM/Container status
|
||||
qm list
|
||||
pct list
|
||||
|
||||
# Start critical VMs/Containers
|
||||
qm start VMID
|
||||
pct start CTID
|
||||
```
|
||||
|
||||
3. **If Node Cannot Be Recovered - Migrate Services**
|
||||
```bash
|
||||
# On working node:
|
||||
|
||||
# 1. Check available resources
|
||||
pvesh get /nodes/NODE/status
|
||||
|
||||
# 2. Restore VMs from backup to working node
|
||||
qmrestore /path/to/backup/vzdump-qemu-VMID.vma.zst NEW_VMID --storage local-lvm
|
||||
|
||||
# 3. Restore containers from backup
|
||||
pct restore NEW_CTID /path/to/backup/vzdump-lxc-CTID.tar.zst --storage local-lvm
|
||||
|
||||
# 4. Start restored VMs/containers
|
||||
qm start NEW_VMID
|
||||
pct start NEW_CTID
|
||||
|
||||
# 5. Update internal DNS/documentation with new IPs if changed
|
||||
```
|
||||
|
||||
4. **Resource Constraints**
|
||||
- If insufficient resources on remaining node:
|
||||
- Prioritize critical services only
|
||||
- Consider scaling down VM resources temporarily
|
||||
- Plan for hardware replacement/repair
|
||||
|
||||
---
|
||||
|
||||
### Scenario 4: Storage Node (OMV) Failure
|
||||
|
||||
**Impact**: Shared storage unavailable, backups inaccessible, data loss risk
|
||||
|
||||
**Recovery Procedure**:
|
||||
1. **Immediate Actions (0-30 minutes)**
|
||||
- Verify storage node is down
|
||||
- Check if disks are healthy (if node boots)
|
||||
- Identify affected services using shared storage
|
||||
|
||||
2. **If Disk Failure**
|
||||
- Check RAID status (if configured)
|
||||
- Replace failed disk
|
||||
- Rebuild RAID array
|
||||
- Restore from off-site backup if necessary
|
||||
|
||||
3. **If Complete Storage Loss**
|
||||
```bash
|
||||
# 1. Rebuild OMV on new hardware/disks
|
||||
# [OMV installation]
|
||||
|
||||
# 2. Configure network shares
|
||||
# [NFS/CIFS setup]
|
||||
|
||||
# 3. Restore data from off-site backup
|
||||
rsync -avz backup-location:/backups/ /mnt/storage/
|
||||
|
||||
# 4. Remount shares on Proxmox nodes
|
||||
# Update /etc/fstab on each node
|
||||
mount -a
|
||||
|
||||
# 5. Verify Proxmox can access storage
|
||||
pvesm status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Scenario 5: DNS Provider Failure
|
||||
|
||||
**Impact**: Domain not resolving, all services unreachable by domain name
|
||||
|
||||
**Recovery Procedure**:
|
||||
1. **Immediate Actions (0-15 minutes)**
|
||||
- Check DNS provider status page
|
||||
- Test DNS resolution: `nslookup domain.com`
|
||||
- Verify it's provider issue, not configuration
|
||||
|
||||
2. **Short-term Mitigation (15-60 minutes)**
|
||||
- Share direct IP addresses with users temporarily
|
||||
- Set up temporary DNS using Cloudflare (free tier)
|
||||
|
||||
3. **Migrate to New DNS Provider**
|
||||
```bash
|
||||
# 1. Export zone file from old provider (if possible)
|
||||
|
||||
# 2. Create account with new DNS provider
|
||||
|
||||
# 3. Import zone file or manually create records:
|
||||
# A record: domain.com -> VPS_IP
|
||||
# A record: *.domain.com -> VPS_IP (if using wildcard)
|
||||
# Other records as needed
|
||||
|
||||
# 4. Update nameservers at domain registrar
|
||||
# (Propagation takes 24-48 hours)
|
||||
|
||||
# 5. Monitor DNS propagation
|
||||
dig domain.com @8.8.8.8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Scenario 6: Complete Data Center Loss (Home Lab)
|
||||
|
||||
**Impact**: All home lab infrastructure destroyed (fire, flood, etc.)
|
||||
|
||||
**Recovery Procedure**:
|
||||
1. **Immediate Actions**
|
||||
- Ensure safety of personnel
|
||||
- Contact insurance provider
|
||||
- Assess extent of damage
|
||||
- Secure remaining equipment
|
||||
|
||||
2. **Short-term (Services that must continue)**
|
||||
- Move critical services to VPS temporarily
|
||||
- Use cloud providers for temporary hosting
|
||||
- Restore from off-site backups
|
||||
|
||||
3. **Long-term (Infrastructure Rebuild)**
|
||||
- Procure replacement hardware
|
||||
- Rebuild Proxmox cluster
|
||||
- Restore VMs/containers from off-site backups
|
||||
- Reconfigure network
|
||||
- Re-establish Gerbil tunnels
|
||||
- Full testing and verification
|
||||
|
||||
---
|
||||
|
||||
## Recovery Procedures
|
||||
|
||||
### General Recovery Steps
|
||||
|
||||
1. **Assess the Situation**
|
||||
- Identify what has failed
|
||||
- Determine scope of impact
|
||||
- Estimate recovery time
|
||||
|
||||
2. **Communicate**
|
||||
- Notify affected users
|
||||
- Update status page if available
|
||||
- Keep stakeholders informed
|
||||
|
||||
3. **Prioritize**
|
||||
- Focus on critical services first
|
||||
- Use RTO/RPO objectives
|
||||
- Document decisions
|
||||
|
||||
4. **Execute Recovery**
|
||||
- Follow specific scenario procedures
|
||||
- Document all actions taken
|
||||
- Keep logs of commands executed
|
||||
|
||||
5. **Verify**
|
||||
- Test all restored services
|
||||
- Check data integrity
|
||||
- Monitor for issues
|
||||
|
||||
6. **Document**
|
||||
- Record what happened
|
||||
- Document what worked/didn't work
|
||||
- Update this document with lessons learned
|
||||
|
||||
---
|
||||
|
||||
## Post-Recovery Checklist
|
||||
|
||||
After any disaster recovery, complete the following:
|
||||
|
||||
### Immediate Post-Recovery (0-24 hours)
|
||||
- [ ] All critical services are operational
|
||||
- [ ] All services are monitored
|
||||
- [ ] Temporary workarounds documented
|
||||
- [ ] Incident logged with timeline
|
||||
|
||||
### Short-term (1-7 days)
|
||||
- [ ] All services fully restored
|
||||
- [ ] Performance is normal
|
||||
- [ ] Backups are running
|
||||
- [ ] Security review completed
|
||||
- [ ] Post-mortem meeting scheduled
|
||||
|
||||
### Long-term (1-4 weeks)
|
||||
- [ ] Post-mortem completed
|
||||
- [ ] Lessons learned documented
|
||||
- [ ] Disaster recovery plan updated
|
||||
- [ ] Preventive measures implemented
|
||||
- [ ] Training updated if needed
|
||||
- [ ] Backup/monitoring improvements made
|
||||
|
||||
---
|
||||
|
||||
## Post-Mortem Template
|
||||
|
||||
After each disaster recovery event, complete a post-mortem:
|
||||
|
||||
**Incident Date**: _____________
|
||||
**Recovery Completed**: _____________
|
||||
**Total Downtime**: _____________
|
||||
|
||||
### What Happened?
|
||||
[Detailed description of the incident]
|
||||
|
||||
### Timeline
|
||||
| Time | Event |
|
||||
|------|-------|
|
||||
| _____ | _____ |
|
||||
| _____ | _____ |
|
||||
|
||||
### Root Cause
|
||||
[What caused the failure?]
|
||||
|
||||
### What Went Well?
|
||||
-
|
||||
-
|
||||
|
||||
### What Went Poorly?
|
||||
-
|
||||
-
|
||||
|
||||
### Action Items
|
||||
| Action | Owner | Due Date | Status |
|
||||
|--------|-------|----------|--------|
|
||||
| _______ | _____ | ________ | ______ |
|
||||
|
||||
### Improvements to This Plan
|
||||
[What should be updated in the disaster recovery plan?]
|
||||
|
||||
---
|
||||
|
||||
## Testing Schedule
|
||||
|
||||
Regular disaster recovery testing ensures procedures work when needed:
|
||||
|
||||
| Test Type | Frequency | Last Test | Next Test | Status |
|
||||
|-----------|-----------|-----------|-----------|--------|
|
||||
| Backup restore test | Quarterly | _________ | _________ | ______ |
|
||||
| VPS failover drill | Semi-annually | _________ | _________ | ______ |
|
||||
| Node failure simulation | Annually | _________ | _________ | ______ |
|
||||
| Full DR scenario | Annually | _________ | _________ | ______ |
|
||||
|
||||
---
|
||||
|
||||
## Document Maintenance
|
||||
|
||||
**Last Updated**: _____________
|
||||
**Updated By**: _____________
|
||||
**Next Review Date**: _____________
|
||||
**Version**: 1.0
|
||||
Reference in New Issue
Block a user