fred/homelab-docs

Fork 0

Files

Funky (OpenClaw) 0682c79580 Initial infrastructure documentation - comprehensive homelab reference

2026-02-23 03:42:22 +00:00

12 KiB

Raw Blame History

Disaster Recovery Plan

This document outlines procedures for recovering from various disaster scenarios affecting your infrastructure.

Emergency Contact Information
Recovery Time Objectives
Backup Locations
Disaster Scenarios
Recovery Procedures
Post-Recovery Checklist

Emergency Contact Information

Primary Contacts

Role	Name	Phone	Email	Availability
Infrastructure Owner	_____________	_____________	_____________	24/7
Network Admin	_____________	_____________	_____________	Business Hours
Backup Contact	_____________	_____________	_____________	24/7

Service Provider Contacts

Provider	Service	Support Number	Account ID	Notes
VPS Provider	_____________	_____________	_____________	_____________
DNS Provider	_____________	_____________	_____________	_____________
Domain Registrar	_____________	_____________	_____________	_____________
ISP (Home Lab)	_____________	_____________	_____________	_____________

Recovery Time Objectives

Define acceptable downtime for each service tier:

Tier	Service Type	RTO (Recovery Time Objective)	RPO (Recovery Point Objective)
Critical	Public-facing services, Authentication	1 hour	15 minutes
Important	Internal services, Databases	4 hours	1 hour
Standard	Development, Testing	24 hours	24 hours
Low Priority	Monitoring, Logging	48 hours	24 hours

Backup Locations

Primary Backup Location

Location: _____________ (e.g., OMV storage node, external drive)
Path: _____________
Retention: _____________
Access Method: _____________

Secondary Backup Location (Off-site)

Location: _____________ (e.g., Cloud storage, remote server)
Path: _____________
Retention: _____________
Access Method: _____________

Backup Schedule

Proxmox VMs/Containers: Daily at _____
Configuration Files: Weekly on _____
Critical Data: Hourly/Daily
Off-site Sync: Daily/Weekly

Critical Items to Backup

Proxmox VM/Container configurations and disks
Pangolin reverse proxy configurations
Gerbil tunnel configurations and keys
SSL/TLS certificates and keys
SSH keys and authorized_keys files
Network configuration files
DNS zone files (if self-hosted)
Database dumps
Application data and configurations
Documentation and credentials (encrypted)

Disaster Scenarios

Scenario 1: VPS Complete Failure

Impact: All public-facing services down, no external access to home lab services

Recovery Procedure:

Immediate Actions (0-15 minutes)
- Verify VPS is actually down (ping, SSH, web checks)
- Contact VPS provider support
- Check VPS provider status page
- Notify users if necessary
Short-term Mitigation (15-60 minutes)
- If hardware failure, request provider rebuild
- If account issue, resolve with provider
- Consider spinning up temporary VPS with another provider

VPS Rebuild (1-4 hours)

# On new VPS:

# 1. Update system
sudo apt update && sudo apt upgrade -y

# 2. Install Pangolin
# [Installation commands]

# 3. Restore Pangolin configuration from backup
scp backup-server:/backups/pangolin-config.tar.gz .
sudo tar -xzf pangolin-config.tar.gz -C /

# 4. Install Gerbil server
# [Installation commands]

# 5. Restore Gerbil configuration
scp backup-server:/backups/gerbil-config.tar.gz .
sudo tar -xzf gerbil-config.tar.gz -C /

# 6. Restore SSL certificates
sudo tar -xzf ssl-certs-backup.tar.gz -C /etc/letsencrypt/

# 7. Configure firewall
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow [GERBIL_PORT]/tcp
sudo ufw enable

# 8. Start services
sudo systemctl enable --now pangolin
sudo systemctl enable --now gerbil

# 9. Update DNS A record to new VPS IP
# [DNS provider steps]

# 10. Reconnect Gerbil tunnels from home lab
# [See Gerbil reconnection below]

Verification
- Test all public routes
- Verify Gerbil tunnels are connected
- Check SSL certificates are valid
- Monitor logs for errors

Scenario 2: Home Lab Network Outage

Impact: All home lab services unreachable, Gerbil tunnels down

Recovery Procedure:

Immediate Actions (0-15 minutes)
- Check router/modem status
- Verify ISP is not having outage
- Check physical connections
- Reboot router/modem if necessary
ISP Outage (Variable duration)
- Contact ISP support
- Consider failover to mobile hotspot if critical
- Notify users of expected downtime

Restore Gerbil Tunnels

# On each home lab machine with tunnels:

# 1. Verify local services are running
systemctl status [service-name]

# 2. Test VPS connectivity
ping [VPS_IP]

# 3. Restart Gerbil tunnels
sudo systemctl restart gerbil-tunnel-*

# 4. Verify tunnels are connected
gerbil status

# 5. Check logs for errors
journalctl -u gerbil-tunnel-* -n 50

Scenario 3: Proxmox Node Failure (DL380p or i5)

Impact: All VMs/containers on failed node are down

Recovery Procedure:

Immediate Actions (0-30 minutes)
- Identify which node has failed
- Determine cause (power, hardware, network)
- Check if other cluster nodes are healthy

If Node Can Be Recovered

# Try to boot node
# If successful, check cluster status:
pvecm status

# Check VM/Container status
qm list
pct list

# Start critical VMs/Containers
qm start VMID
pct start CTID

If Node Cannot Be Recovered - Migrate Services

# On working node:

# 1. Check available resources
pvesh get /nodes/NODE/status

# 2. Restore VMs from backup to working node
qmrestore /path/to/backup/vzdump-qemu-VMID.vma.zst NEW_VMID --storage local-lvm

# 3. Restore containers from backup
pct restore NEW_CTID /path/to/backup/vzdump-lxc-CTID.tar.zst --storage local-lvm

# 4. Start restored VMs/containers
qm start NEW_VMID
pct start NEW_CTID

# 5. Update internal DNS/documentation with new IPs if changed

Resource Constraints
- If insufficient resources on remaining node:
  - Prioritize critical services only
  - Consider scaling down VM resources temporarily
  - Plan for hardware replacement/repair

Scenario 4: Storage Node (OMV) Failure

Impact: Shared storage unavailable, backups inaccessible, data loss risk

Recovery Procedure:

Immediate Actions (0-30 minutes)
- Verify storage node is down
- Check if disks are healthy (if node boots)
- Identify affected services using shared storage
If Disk Failure
- Check RAID status (if configured)
- Replace failed disk
- Rebuild RAID array
- Restore from off-site backup if necessary

If Complete Storage Loss

# 1. Rebuild OMV on new hardware/disks
# [OMV installation]

# 2. Configure network shares
# [NFS/CIFS setup]

# 3. Restore data from off-site backup
rsync -avz backup-location:/backups/ /mnt/storage/

# 4. Remount shares on Proxmox nodes
# Update /etc/fstab on each node
mount -a

# 5. Verify Proxmox can access storage
pvesm status

Scenario 5: DNS Provider Failure

Impact: Domain not resolving, all services unreachable by domain name

Recovery Procedure:

Immediate Actions (0-15 minutes)
- Check DNS provider status page
- Test DNS resolution: nslookup domain.com
- Verify it's provider issue, not configuration
Short-term Mitigation (15-60 minutes)
- Share direct IP addresses with users temporarily
- Set up temporary DNS using Cloudflare (free tier)

Migrate to New DNS Provider

# 1. Export zone file from old provider (if possible)

# 2. Create account with new DNS provider

# 3. Import zone file or manually create records:
# A record: domain.com -> VPS_IP
# A record: *.domain.com -> VPS_IP (if using wildcard)
# Other records as needed

# 4. Update nameservers at domain registrar
# (Propagation takes 24-48 hours)

# 5. Monitor DNS propagation
dig domain.com @8.8.8.8

Scenario 6: Complete Data Center Loss (Home Lab)

Impact: All home lab infrastructure destroyed (fire, flood, etc.)

Recovery Procedure:

Immediate Actions
- Ensure safety of personnel
- Contact insurance provider
- Assess extent of damage
- Secure remaining equipment
Short-term (Services that must continue)
- Move critical services to VPS temporarily
- Use cloud providers for temporary hosting
- Restore from off-site backups
Long-term (Infrastructure Rebuild)
- Procure replacement hardware
- Rebuild Proxmox cluster
- Restore VMs/containers from off-site backups
- Reconfigure network
- Re-establish Gerbil tunnels
- Full testing and verification

Recovery Procedures

General Recovery Steps

Assess the Situation
- Identify what has failed
- Determine scope of impact
- Estimate recovery time
Communicate
- Notify affected users
- Update status page if available
- Keep stakeholders informed
Prioritize
- Focus on critical services first
- Use RTO/RPO objectives
- Document decisions
Execute Recovery
- Follow specific scenario procedures
- Document all actions taken
- Keep logs of commands executed
Verify
- Test all restored services
- Check data integrity
- Monitor for issues
Document
- Record what happened
- Document what worked/didn't work
- Update this document with lessons learned

Post-Recovery Checklist

After any disaster recovery, complete the following:

Immediate Post-Recovery (0-24 hours)

All critical services are operational
All services are monitored
Temporary workarounds documented
Incident logged with timeline

Short-term (1-7 days)

All services fully restored
Performance is normal
Backups are running
Security review completed
Post-mortem meeting scheduled

Long-term (1-4 weeks)

Post-mortem completed
Lessons learned documented
Disaster recovery plan updated
Preventive measures implemented
Training updated if needed
Backup/monitoring improvements made

Post-Mortem Template

After each disaster recovery event, complete a post-mortem:

Incident Date: _____________ Recovery Completed: _____________ Total Downtime: _____________

What Happened?

[Detailed description of the incident]

Timeline

Time	Event
_____	_____
_____	_____

Root Cause

[What caused the failure?]

What Went Well?

What Went Poorly?

Action Items

Action	Owner	Due Date	Status
_______	_____	________	______

Improvements to This Plan

[What should be updated in the disaster recovery plan?]

Testing Schedule

Regular disaster recovery testing ensures procedures work when needed:

Test Type	Frequency	Last Test	Next Test	Status
Backup restore test	Quarterly	_________	_________	______
VPS failover drill	Semi-annually	_________	_________	______
Node failure simulation	Annually	_________	_________	______
Full DR scenario	Annually	_________	_________	______

Document Maintenance

Last Updated: _____________ Updated By: _____________ Next Review Date: _____________ Version: 1.0

12 KiB Raw Blame History

Disaster Recovery Plan

Table of Contents

Emergency Contact Information

Primary Contacts

Service Provider Contacts

Recovery Time Objectives

Backup Locations

Primary Backup Location

Secondary Backup Location (Off-site)

Backup Schedule

Critical Items to Backup

Disaster Scenarios

Scenario 1: VPS Complete Failure

Scenario 2: Home Lab Network Outage

Scenario 3: Proxmox Node Failure (DL380p or i5)

Scenario 4: Storage Node (OMV) Failure

Scenario 5: DNS Provider Failure

Scenario 6: Complete Data Center Loss (Home Lab)

Recovery Procedures

General Recovery Steps

Post-Recovery Checklist

Immediate Post-Recovery (0-24 hours)

Short-term (1-7 days)

Long-term (1-4 weeks)

Post-Mortem Template

What Happened?

Timeline

Root Cause

What Went Well?

What Went Poorly?

Action Items

Improvements to This Plan

Testing Schedule

Document Maintenance

12 KiB

Raw Blame History