Initial infrastructure documentation - comprehensive homelab reference

This commit is contained in:
Funky (OpenClaw)
2026-02-23 03:42:22 +00:00
commit 0682c79580
169 changed files with 63913 additions and 0 deletions

View File

@@ -0,0 +1,456 @@
# Disaster Recovery Plan
This document outlines procedures for recovering from various disaster scenarios affecting your infrastructure.
## Table of Contents
- [Emergency Contact Information](#emergency-contact-information)
- [Recovery Time Objectives](#recovery-time-objectives)
- [Backup Locations](#backup-locations)
- [Disaster Scenarios](#disaster-scenarios)
- [Recovery Procedures](#recovery-procedures)
- [Post-Recovery Checklist](#post-recovery-checklist)
---
## Emergency Contact Information
### Primary Contacts
| Role | Name | Phone | Email | Availability |
|------|------|-------|-------|--------------|
| Infrastructure Owner | _____________ | _____________ | _____________ | 24/7 |
| Network Admin | _____________ | _____________ | _____________ | Business Hours |
| Backup Contact | _____________ | _____________ | _____________ | 24/7 |
### Service Provider Contacts
| Provider | Service | Support Number | Account ID | Notes |
|----------|---------|----------------|------------|-------|
| VPS Provider | _____________ | _____________ | _____________ | _____________ |
| DNS Provider | _____________ | _____________ | _____________ | _____________ |
| Domain Registrar | _____________ | _____________ | _____________ | _____________ |
| ISP (Home Lab) | _____________ | _____________ | _____________ | _____________ |
---
## Recovery Time Objectives
Define acceptable downtime for each service tier:
| Tier | Service Type | RTO (Recovery Time Objective) | RPO (Recovery Point Objective) |
|------|-------------|------------------------------|--------------------------------|
| Critical | Public-facing services, Authentication | 1 hour | 15 minutes |
| Important | Internal services, Databases | 4 hours | 1 hour |
| Standard | Development, Testing | 24 hours | 24 hours |
| Low Priority | Monitoring, Logging | 48 hours | 24 hours |
---
## Backup Locations
### Primary Backup Location
- **Location**: _____________ (e.g., OMV storage node, external drive)
- **Path**: _____________
- **Retention**: _____________
- **Access Method**: _____________
### Secondary Backup Location (Off-site)
- **Location**: _____________ (e.g., Cloud storage, remote server)
- **Path**: _____________
- **Retention**: _____________
- **Access Method**: _____________
### Backup Schedule
- **Proxmox VMs/Containers**: Daily at _____
- **Configuration Files**: Weekly on _____
- **Critical Data**: Hourly/Daily
- **Off-site Sync**: Daily/Weekly
### Critical Items to Backup
- [ ] Proxmox VM/Container configurations and disks
- [ ] Pangolin reverse proxy configurations
- [ ] Gerbil tunnel configurations and keys
- [ ] SSL/TLS certificates and keys
- [ ] SSH keys and authorized_keys files
- [ ] Network configuration files
- [ ] DNS zone files (if self-hosted)
- [ ] Database dumps
- [ ] Application data and configurations
- [ ] Documentation and credentials (encrypted)
---
## Disaster Scenarios
### Scenario 1: VPS Complete Failure
**Impact**: All public-facing services down, no external access to home lab services
**Recovery Procedure**:
1. **Immediate Actions (0-15 minutes)**
- Verify VPS is actually down (ping, SSH, web checks)
- Contact VPS provider support
- Check VPS provider status page
- Notify users if necessary
2. **Short-term Mitigation (15-60 minutes)**
- If hardware failure, request provider rebuild
- If account issue, resolve with provider
- Consider spinning up temporary VPS with another provider
3. **VPS Rebuild (1-4 hours)**
```bash
# On new VPS:
# 1. Update system
sudo apt update && sudo apt upgrade -y
# 2. Install Pangolin
# [Installation commands]
# 3. Restore Pangolin configuration from backup
scp backup-server:/backups/pangolin-config.tar.gz .
sudo tar -xzf pangolin-config.tar.gz -C /
# 4. Install Gerbil server
# [Installation commands]
# 5. Restore Gerbil configuration
scp backup-server:/backups/gerbil-config.tar.gz .
sudo tar -xzf gerbil-config.tar.gz -C /
# 6. Restore SSL certificates
sudo tar -xzf ssl-certs-backup.tar.gz -C /etc/letsencrypt/
# 7. Configure firewall
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow [GERBIL_PORT]/tcp
sudo ufw enable
# 8. Start services
sudo systemctl enable --now pangolin
sudo systemctl enable --now gerbil
# 9. Update DNS A record to new VPS IP
# [DNS provider steps]
# 10. Reconnect Gerbil tunnels from home lab
# [See Gerbil reconnection below]
```
4. **Verification**
- Test all public routes
- Verify Gerbil tunnels are connected
- Check SSL certificates are valid
- Monitor logs for errors
---
### Scenario 2: Home Lab Network Outage
**Impact**: All home lab services unreachable, Gerbil tunnels down
**Recovery Procedure**:
1. **Immediate Actions (0-15 minutes)**
- Check router/modem status
- Verify ISP is not having outage
- Check physical connections
- Reboot router/modem if necessary
2. **ISP Outage (Variable duration)**
- Contact ISP support
- Consider failover to mobile hotspot if critical
- Notify users of expected downtime
3. **Restore Gerbil Tunnels**
```bash
# On each home lab machine with tunnels:
# 1. Verify local services are running
systemctl status [service-name]
# 2. Test VPS connectivity
ping [VPS_IP]
# 3. Restart Gerbil tunnels
sudo systemctl restart gerbil-tunnel-*
# 4. Verify tunnels are connected
gerbil status
# 5. Check logs for errors
journalctl -u gerbil-tunnel-* -n 50
```
---
### Scenario 3: Proxmox Node Failure (DL380p or i5)
**Impact**: All VMs/containers on failed node are down
**Recovery Procedure**:
1. **Immediate Actions (0-30 minutes)**
- Identify which node has failed
- Determine cause (power, hardware, network)
- Check if other cluster nodes are healthy
2. **If Node Can Be Recovered**
```bash
# Try to boot node
# If successful, check cluster status:
pvecm status
# Check VM/Container status
qm list
pct list
# Start critical VMs/Containers
qm start VMID
pct start CTID
```
3. **If Node Cannot Be Recovered - Migrate Services**
```bash
# On working node:
# 1. Check available resources
pvesh get /nodes/NODE/status
# 2. Restore VMs from backup to working node
qmrestore /path/to/backup/vzdump-qemu-VMID.vma.zst NEW_VMID --storage local-lvm
# 3. Restore containers from backup
pct restore NEW_CTID /path/to/backup/vzdump-lxc-CTID.tar.zst --storage local-lvm
# 4. Start restored VMs/containers
qm start NEW_VMID
pct start NEW_CTID
# 5. Update internal DNS/documentation with new IPs if changed
```
4. **Resource Constraints**
- If insufficient resources on remaining node:
- Prioritize critical services only
- Consider scaling down VM resources temporarily
- Plan for hardware replacement/repair
---
### Scenario 4: Storage Node (OMV) Failure
**Impact**: Shared storage unavailable, backups inaccessible, data loss risk
**Recovery Procedure**:
1. **Immediate Actions (0-30 minutes)**
- Verify storage node is down
- Check if disks are healthy (if node boots)
- Identify affected services using shared storage
2. **If Disk Failure**
- Check RAID status (if configured)
- Replace failed disk
- Rebuild RAID array
- Restore from off-site backup if necessary
3. **If Complete Storage Loss**
```bash
# 1. Rebuild OMV on new hardware/disks
# [OMV installation]
# 2. Configure network shares
# [NFS/CIFS setup]
# 3. Restore data from off-site backup
rsync -avz backup-location:/backups/ /mnt/storage/
# 4. Remount shares on Proxmox nodes
# Update /etc/fstab on each node
mount -a
# 5. Verify Proxmox can access storage
pvesm status
```
---
### Scenario 5: DNS Provider Failure
**Impact**: Domain not resolving, all services unreachable by domain name
**Recovery Procedure**:
1. **Immediate Actions (0-15 minutes)**
- Check DNS provider status page
- Test DNS resolution: `nslookup domain.com`
- Verify it's provider issue, not configuration
2. **Short-term Mitigation (15-60 minutes)**
- Share direct IP addresses with users temporarily
- Set up temporary DNS using Cloudflare (free tier)
3. **Migrate to New DNS Provider**
```bash
# 1. Export zone file from old provider (if possible)
# 2. Create account with new DNS provider
# 3. Import zone file or manually create records:
# A record: domain.com -> VPS_IP
# A record: *.domain.com -> VPS_IP (if using wildcard)
# Other records as needed
# 4. Update nameservers at domain registrar
# (Propagation takes 24-48 hours)
# 5. Monitor DNS propagation
dig domain.com @8.8.8.8
```
---
### Scenario 6: Complete Data Center Loss (Home Lab)
**Impact**: All home lab infrastructure destroyed (fire, flood, etc.)
**Recovery Procedure**:
1. **Immediate Actions**
- Ensure safety of personnel
- Contact insurance provider
- Assess extent of damage
- Secure remaining equipment
2. **Short-term (Services that must continue)**
- Move critical services to VPS temporarily
- Use cloud providers for temporary hosting
- Restore from off-site backups
3. **Long-term (Infrastructure Rebuild)**
- Procure replacement hardware
- Rebuild Proxmox cluster
- Restore VMs/containers from off-site backups
- Reconfigure network
- Re-establish Gerbil tunnels
- Full testing and verification
---
## Recovery Procedures
### General Recovery Steps
1. **Assess the Situation**
- Identify what has failed
- Determine scope of impact
- Estimate recovery time
2. **Communicate**
- Notify affected users
- Update status page if available
- Keep stakeholders informed
3. **Prioritize**
- Focus on critical services first
- Use RTO/RPO objectives
- Document decisions
4. **Execute Recovery**
- Follow specific scenario procedures
- Document all actions taken
- Keep logs of commands executed
5. **Verify**
- Test all restored services
- Check data integrity
- Monitor for issues
6. **Document**
- Record what happened
- Document what worked/didn't work
- Update this document with lessons learned
---
## Post-Recovery Checklist
After any disaster recovery, complete the following:
### Immediate Post-Recovery (0-24 hours)
- [ ] All critical services are operational
- [ ] All services are monitored
- [ ] Temporary workarounds documented
- [ ] Incident logged with timeline
### Short-term (1-7 days)
- [ ] All services fully restored
- [ ] Performance is normal
- [ ] Backups are running
- [ ] Security review completed
- [ ] Post-mortem meeting scheduled
### Long-term (1-4 weeks)
- [ ] Post-mortem completed
- [ ] Lessons learned documented
- [ ] Disaster recovery plan updated
- [ ] Preventive measures implemented
- [ ] Training updated if needed
- [ ] Backup/monitoring improvements made
---
## Post-Mortem Template
After each disaster recovery event, complete a post-mortem:
**Incident Date**: _____________
**Recovery Completed**: _____________
**Total Downtime**: _____________
### What Happened?
[Detailed description of the incident]
### Timeline
| Time | Event |
|------|-------|
| _____ | _____ |
| _____ | _____ |
### Root Cause
[What caused the failure?]
### What Went Well?
-
-
### What Went Poorly?
-
-
### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| _______ | _____ | ________ | ______ |
### Improvements to This Plan
[What should be updated in the disaster recovery plan?]
---
## Testing Schedule
Regular disaster recovery testing ensures procedures work when needed:
| Test Type | Frequency | Last Test | Next Test | Status |
|-----------|-----------|-----------|-----------|--------|
| Backup restore test | Quarterly | _________ | _________ | ______ |
| VPS failover drill | Semi-annually | _________ | _________ | ______ |
| Node failure simulation | Annually | _________ | _________ | ______ |
| Full DR scenario | Annually | _________ | _________ | ______ |
---
## Document Maintenance
**Last Updated**: _____________
**Updated By**: _____________
**Next Review Date**: _____________
**Version**: 1.0