Initial infrastructure documentation - comprehensive homelab reference

This commit is contained in:
Funky (OpenClaw)
2026-02-23 03:42:22 +00:00
commit 0682c79580
169 changed files with 63913 additions and 0 deletions

479
infrastructure/RUNBOOK.md Normal file
View File

@@ -0,0 +1,479 @@
# Infrastructure Runbook
This runbook provides step-by-step procedures for common operational tasks in your infrastructure.
## Table of Contents
- [Pangolin Reverse Proxy Operations](#pangolin-reverse-proxy-operations)
- [Gerbil Tunnel Management](#gerbil-tunnel-management)
- [Proxmox Operations](#proxmox-operations)
- [SSL/TLS Certificate Management](#ssltls-certificate-management)
- [Network Troubleshooting](#network-troubleshooting)
- [Security Procedures](#security-procedures)
- [Backup Operations](#backup-operations)
---
## Pangolin Reverse Proxy Operations
### Add a New Route
```bash
# 1. SSH into VPS
ssh user@your-vps-ip
# 2. Edit Pangolin configuration
sudo nano /path/to/pangolin/config.yml
# 3. Add new route configuration
# domain.example.com -> backend:port
# 4. Test configuration
sudo pangolin config test
# 5. Reload Pangolin
sudo systemctl reload pangolin
# OR
sudo pangolin reload
# 6. Verify route is active
curl -I https://domain.example.com
```
### Remove a Route
```bash
# 1. Edit configuration and comment out or remove route
sudo nano /path/to/pangolin/config.yml
# 2. Reload Pangolin
sudo systemctl reload pangolin
# 3. Verify route is removed
curl -I https://domain.example.com
```
### View Pangolin Logs
```bash
# Real-time logs
sudo tail -f /var/log/pangolin/access.log
sudo tail -f /var/log/pangolin/error.log
# Search for specific domain
grep "domain.example.com" /var/log/pangolin/access.log
# Check last 100 errors
sudo tail -n 100 /var/log/pangolin/error.log
```
### Restart Pangolin Service
```bash
# Check status
sudo systemctl status pangolin
# Restart
sudo systemctl restart pangolin
# Verify it's running
sudo systemctl is-active pangolin
```
---
## Gerbil Tunnel Management
### Check Active Tunnels
```bash
# On VPS - check listening Gerbil server
ss -tlnp | grep gerbil
# On home lab - check active tunnel connections
gerbil status
# OR
ps aux | grep gerbil
```
### Start a Tunnel
```bash
# On home lab machine
gerbil connect --name tunnel-name \
--local localhost:PORT \
--remote VPS_IP:REMOTE_PORT \
--auth-key /path/to/auth.key
# Start as systemd service
sudo systemctl start gerbil-tunnel-name
```
### Stop a Tunnel
```bash
# If running as service
sudo systemctl stop gerbil-tunnel-name
# If running manually
pkill -f "gerbil.*tunnel-name"
```
### Restart a Tunnel
```bash
sudo systemctl restart gerbil-tunnel-name
# Verify tunnel is active
gerbil status tunnel-name
# OR
ss -tn | grep REMOTE_PORT
```
### Debug Tunnel Connection Issues
```bash
# 1. Check if local service is running
curl http://localhost:LOCAL_PORT
# 2. Check if tunnel process is running
ps aux | grep gerbil
# 3. Check tunnel logs
journalctl -u gerbil-tunnel-name -n 50
# 4. Test VPS endpoint
# On VPS:
curl http://localhost:REMOTE_PORT
# 5. Check firewall on VPS
sudo ufw status
sudo iptables -L -n | grep REMOTE_PORT
```
---
## Proxmox Operations
### Create a New VM
```bash
# Via Proxmox web UI: https://PROXMOX_IP:8006
# Via CLI on Proxmox node:
qm create VMID --name vm-name --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0
# Attach disk
qm set VMID --scsi0 local-lvm:32
# Set boot order
qm set VMID --boot order=scsi0
# Start VM
qm start VMID
```
### Create a New Container (LXC)
```bash
# Download template
pveam update
pveam available
pveam download local ubuntu-22.04-standard
# Create container
pct create CTID local:vztmpl/ubuntu-22.04-standard.tar.gz \
--hostname ct-name \
--memory 1024 \
--cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp
# Start container
pct start CTID
# Enter container
pct enter CTID
```
### Stop/Start VM or Container
```bash
# VM operations
qm stop VMID # Stop
qm start VMID # Start
qm shutdown VMID # Graceful shutdown
qm reboot VMID # Reboot
qm status VMID # Check status
# Container operations
pct stop CTID
pct start CTID
pct shutdown CTID
pct reboot CTID
pct status CTID
```
### Migrate VM Between Nodes
```bash
# Online migration (VM stays running)
qm migrate VMID target-node --online
# Offline migration
qm migrate VMID target-node
# Check migration status
qm status VMID
```
### Check Resource Usage
```bash
# Overall cluster resources
pvesh get /cluster/resources
# Specific node resources
pvesh get /nodes/NODE_NAME/status
# VM resource usage
qm status VMID --verbose
# Storage usage
pvesm status
```
### Backup VM or Container
```bash
# Backup VM
vzdump VMID --storage STORAGE_NAME --mode snapshot
# Backup container
vzdump CTID --storage STORAGE_NAME
# List backups
pvesm list STORAGE_NAME
```
### Restore from Backup
```bash
# Restore VM
qmrestore /path/to/backup/vzdump-qemu-VMID.vma.zst VMID
# Restore container
pct restore CTID /path/to/backup/vzdump-lxc-CTID.tar.zst
```
---
## SSL/TLS Certificate Management
### Request New Let's Encrypt Certificate
```bash
# Install certbot if needed
sudo apt install certbot
# Request certificate (HTTP-01 challenge)
sudo certbot certonly --standalone -d domain.example.com
# Request wildcard certificate (DNS-01 challenge)
sudo certbot certonly --manual --preferred-challenges dns -d "*.example.com"
# Certificates are stored in: /etc/letsencrypt/live/domain.example.com/
```
### Renew Certificates
```bash
# Dry run to test renewal
sudo certbot renew --dry-run
# Renew all certificates
sudo certbot renew
# Renew specific certificate
sudo certbot renew --cert-name domain.example.com
# Set up auto-renewal (check if already configured)
sudo systemctl status certbot.timer
```
### Check Certificate Expiration
```bash
# Check local certificate
sudo certbot certificates
# Check remote certificate
echo | openssl s_client -servername domain.example.com -connect domain.example.com:443 2>/dev/null | openssl x509 -noout -dates
# Check all certificates expiring in 30 days
sudo certbot certificates | grep "Expiry Date"
```
### Deploy Certificate to Service
```bash
# Copy certificate to service location
sudo cp /etc/letsencrypt/live/domain.example.com/fullchain.pem /path/to/service/cert.pem
sudo cp /etc/letsencrypt/live/domain.example.com/privkey.pem /path/to/service/key.pem
# Set permissions
sudo chmod 644 /path/to/service/cert.pem
sudo chmod 600 /path/to/service/key.pem
# Reload service
sudo systemctl reload service-name
```
---
## Network Troubleshooting
### Check Network Connectivity
```bash
# Ping test
ping -c 4 8.8.8.8
# DNS resolution
nslookup domain.example.com
dig domain.example.com
# Trace route
traceroute domain.example.com
mtr domain.example.com
```
### Check Open Ports
```bash
# Check listening ports
ss -tlnp
netstat -tlnp
# Check if specific port is open
ss -tlnp | grep :PORT
nc -zv localhost PORT
# Check firewall rules
sudo ufw status numbered
sudo iptables -L -n -v
```
### Test Service Availability
```bash
# HTTP/HTTPS test
curl -I https://domain.example.com
curl -v https://domain.example.com
# Test specific port
nc -zv host PORT
telnet host PORT
# Check service status
sudo systemctl status service-name
```
### Check Network Interface Status
```bash
# List all interfaces
ip addr show
ip link show
# Check interface statistics
ip -s link show eth0
# Restart interface
sudo ip link set eth0 down
sudo ip link set eth0 up
```
---
## Security Procedures
### Update SSH Key
```bash
# Generate new SSH key
ssh-keygen -t ed25519 -C "description"
# Copy to server
ssh-copy-id -i ~/.ssh/new_key.pub user@server
# Test new key
ssh -i ~/.ssh/new_key user@server
# Update SSH config
nano ~/.ssh/config
```
### Review Failed Login Attempts
```bash
# Check auth logs
sudo grep "Failed password" /var/log/auth.log
sudo journalctl -u ssh -n 100
# Check fail2ban status (if installed)
sudo fail2ban-client status sshd
```
### Update Firewall Rules
```bash
# Add new rule
sudo ufw allow PORT/tcp
sudo ufw allow from IP_ADDRESS to any port PORT
# Remove rule
sudo ufw delete allow PORT/tcp
sudo ufw status numbered
sudo ufw delete NUMBER
# Reload firewall
sudo ufw reload
```
### Security Updates
```bash
# Check for updates
sudo apt update
sudo apt list --upgradable
# Install security updates only
sudo apt upgrade -y
# Reboot if kernel updated
sudo needrestart -r a
```
---
## Backup Operations
### Manual Backup
```bash
# Backup specific VM/Container
vzdump VMID --storage STORAGE_NAME --mode snapshot --compress zstd
# Backup configuration files
tar -czf config-backup-$(date +%Y%m%d).tar.gz /etc/pangolin /etc/gerbil
# Backup to remote location
rsync -avz /path/to/data/ user@backup-server:/path/to/backup/
```
### Verify Backup
```bash
# List backup contents
tar -tzf backup.tar.gz | less
# Check backup integrity
tar -tzf backup.tar.gz > /dev/null && echo "OK" || echo "CORRUPTED"
# Check vzdump backup
cat /path/to/backup/vzdump-qemu-VMID.log
```
### Restore Specific Files
```bash
# Extract specific file from backup
tar -xzf backup.tar.gz path/to/specific/file
# Restore from rsync backup
rsync -avz user@backup-server:/path/to/backup/ /path/to/restore/
```
---
## Emergency Contacts
- Infrastructure Owner: _______________
- Network Administrator: _______________
- VPS Provider Support: _______________
- DNS Provider Support: _______________
## Additional Resources
- Pangolin Documentation: _______________
- Gerbil Documentation: _______________
- Proxmox Documentation: https://pve.proxmox.com/pve-docs/
- Internal Wiki: _______________