Initial infrastructure documentation - comprehensive homelab reference
This commit is contained in:
479
infrastructure/RUNBOOK.md
Normal file
479
infrastructure/RUNBOOK.md
Normal file
@@ -0,0 +1,479 @@
|
||||
# Infrastructure Runbook
|
||||
|
||||
This runbook provides step-by-step procedures for common operational tasks in your infrastructure.
|
||||
|
||||
## Table of Contents
|
||||
- [Pangolin Reverse Proxy Operations](#pangolin-reverse-proxy-operations)
|
||||
- [Gerbil Tunnel Management](#gerbil-tunnel-management)
|
||||
- [Proxmox Operations](#proxmox-operations)
|
||||
- [SSL/TLS Certificate Management](#ssltls-certificate-management)
|
||||
- [Network Troubleshooting](#network-troubleshooting)
|
||||
- [Security Procedures](#security-procedures)
|
||||
- [Backup Operations](#backup-operations)
|
||||
|
||||
---
|
||||
|
||||
## Pangolin Reverse Proxy Operations
|
||||
|
||||
### Add a New Route
|
||||
```bash
|
||||
# 1. SSH into VPS
|
||||
ssh user@your-vps-ip
|
||||
|
||||
# 2. Edit Pangolin configuration
|
||||
sudo nano /path/to/pangolin/config.yml
|
||||
|
||||
# 3. Add new route configuration
|
||||
# domain.example.com -> backend:port
|
||||
|
||||
# 4. Test configuration
|
||||
sudo pangolin config test
|
||||
|
||||
# 5. Reload Pangolin
|
||||
sudo systemctl reload pangolin
|
||||
# OR
|
||||
sudo pangolin reload
|
||||
|
||||
# 6. Verify route is active
|
||||
curl -I https://domain.example.com
|
||||
```
|
||||
|
||||
### Remove a Route
|
||||
```bash
|
||||
# 1. Edit configuration and comment out or remove route
|
||||
sudo nano /path/to/pangolin/config.yml
|
||||
|
||||
# 2. Reload Pangolin
|
||||
sudo systemctl reload pangolin
|
||||
|
||||
# 3. Verify route is removed
|
||||
curl -I https://domain.example.com
|
||||
```
|
||||
|
||||
### View Pangolin Logs
|
||||
```bash
|
||||
# Real-time logs
|
||||
sudo tail -f /var/log/pangolin/access.log
|
||||
sudo tail -f /var/log/pangolin/error.log
|
||||
|
||||
# Search for specific domain
|
||||
grep "domain.example.com" /var/log/pangolin/access.log
|
||||
|
||||
# Check last 100 errors
|
||||
sudo tail -n 100 /var/log/pangolin/error.log
|
||||
```
|
||||
|
||||
### Restart Pangolin Service
|
||||
```bash
|
||||
# Check status
|
||||
sudo systemctl status pangolin
|
||||
|
||||
# Restart
|
||||
sudo systemctl restart pangolin
|
||||
|
||||
# Verify it's running
|
||||
sudo systemctl is-active pangolin
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Gerbil Tunnel Management
|
||||
|
||||
### Check Active Tunnels
|
||||
```bash
|
||||
# On VPS - check listening Gerbil server
|
||||
ss -tlnp | grep gerbil
|
||||
|
||||
# On home lab - check active tunnel connections
|
||||
gerbil status
|
||||
# OR
|
||||
ps aux | grep gerbil
|
||||
```
|
||||
|
||||
### Start a Tunnel
|
||||
```bash
|
||||
# On home lab machine
|
||||
gerbil connect --name tunnel-name \
|
||||
--local localhost:PORT \
|
||||
--remote VPS_IP:REMOTE_PORT \
|
||||
--auth-key /path/to/auth.key
|
||||
|
||||
# Start as systemd service
|
||||
sudo systemctl start gerbil-tunnel-name
|
||||
```
|
||||
|
||||
### Stop a Tunnel
|
||||
```bash
|
||||
# If running as service
|
||||
sudo systemctl stop gerbil-tunnel-name
|
||||
|
||||
# If running manually
|
||||
pkill -f "gerbil.*tunnel-name"
|
||||
```
|
||||
|
||||
### Restart a Tunnel
|
||||
```bash
|
||||
sudo systemctl restart gerbil-tunnel-name
|
||||
|
||||
# Verify tunnel is active
|
||||
gerbil status tunnel-name
|
||||
# OR
|
||||
ss -tn | grep REMOTE_PORT
|
||||
```
|
||||
|
||||
### Debug Tunnel Connection Issues
|
||||
```bash
|
||||
# 1. Check if local service is running
|
||||
curl http://localhost:LOCAL_PORT
|
||||
|
||||
# 2. Check if tunnel process is running
|
||||
ps aux | grep gerbil
|
||||
|
||||
# 3. Check tunnel logs
|
||||
journalctl -u gerbil-tunnel-name -n 50
|
||||
|
||||
# 4. Test VPS endpoint
|
||||
# On VPS:
|
||||
curl http://localhost:REMOTE_PORT
|
||||
|
||||
# 5. Check firewall on VPS
|
||||
sudo ufw status
|
||||
sudo iptables -L -n | grep REMOTE_PORT
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Proxmox Operations
|
||||
|
||||
### Create a New VM
|
||||
```bash
|
||||
# Via Proxmox web UI: https://PROXMOX_IP:8006
|
||||
|
||||
# Via CLI on Proxmox node:
|
||||
qm create VMID --name vm-name --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0
|
||||
|
||||
# Attach disk
|
||||
qm set VMID --scsi0 local-lvm:32
|
||||
|
||||
# Set boot order
|
||||
qm set VMID --boot order=scsi0
|
||||
|
||||
# Start VM
|
||||
qm start VMID
|
||||
```
|
||||
|
||||
### Create a New Container (LXC)
|
||||
```bash
|
||||
# Download template
|
||||
pveam update
|
||||
pveam available
|
||||
pveam download local ubuntu-22.04-standard
|
||||
|
||||
# Create container
|
||||
pct create CTID local:vztmpl/ubuntu-22.04-standard.tar.gz \
|
||||
--hostname ct-name \
|
||||
--memory 1024 \
|
||||
--cores 2 \
|
||||
--net0 name=eth0,bridge=vmbr0,ip=dhcp
|
||||
|
||||
# Start container
|
||||
pct start CTID
|
||||
|
||||
# Enter container
|
||||
pct enter CTID
|
||||
```
|
||||
|
||||
### Stop/Start VM or Container
|
||||
```bash
|
||||
# VM operations
|
||||
qm stop VMID # Stop
|
||||
qm start VMID # Start
|
||||
qm shutdown VMID # Graceful shutdown
|
||||
qm reboot VMID # Reboot
|
||||
qm status VMID # Check status
|
||||
|
||||
# Container operations
|
||||
pct stop CTID
|
||||
pct start CTID
|
||||
pct shutdown CTID
|
||||
pct reboot CTID
|
||||
pct status CTID
|
||||
```
|
||||
|
||||
### Migrate VM Between Nodes
|
||||
```bash
|
||||
# Online migration (VM stays running)
|
||||
qm migrate VMID target-node --online
|
||||
|
||||
# Offline migration
|
||||
qm migrate VMID target-node
|
||||
|
||||
# Check migration status
|
||||
qm status VMID
|
||||
```
|
||||
|
||||
### Check Resource Usage
|
||||
```bash
|
||||
# Overall cluster resources
|
||||
pvesh get /cluster/resources
|
||||
|
||||
# Specific node resources
|
||||
pvesh get /nodes/NODE_NAME/status
|
||||
|
||||
# VM resource usage
|
||||
qm status VMID --verbose
|
||||
|
||||
# Storage usage
|
||||
pvesm status
|
||||
```
|
||||
|
||||
### Backup VM or Container
|
||||
```bash
|
||||
# Backup VM
|
||||
vzdump VMID --storage STORAGE_NAME --mode snapshot
|
||||
|
||||
# Backup container
|
||||
vzdump CTID --storage STORAGE_NAME
|
||||
|
||||
# List backups
|
||||
pvesm list STORAGE_NAME
|
||||
```
|
||||
|
||||
### Restore from Backup
|
||||
```bash
|
||||
# Restore VM
|
||||
qmrestore /path/to/backup/vzdump-qemu-VMID.vma.zst VMID
|
||||
|
||||
# Restore container
|
||||
pct restore CTID /path/to/backup/vzdump-lxc-CTID.tar.zst
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SSL/TLS Certificate Management
|
||||
|
||||
### Request New Let's Encrypt Certificate
|
||||
```bash
|
||||
# Install certbot if needed
|
||||
sudo apt install certbot
|
||||
|
||||
# Request certificate (HTTP-01 challenge)
|
||||
sudo certbot certonly --standalone -d domain.example.com
|
||||
|
||||
# Request wildcard certificate (DNS-01 challenge)
|
||||
sudo certbot certonly --manual --preferred-challenges dns -d "*.example.com"
|
||||
|
||||
# Certificates are stored in: /etc/letsencrypt/live/domain.example.com/
|
||||
```
|
||||
|
||||
### Renew Certificates
|
||||
```bash
|
||||
# Dry run to test renewal
|
||||
sudo certbot renew --dry-run
|
||||
|
||||
# Renew all certificates
|
||||
sudo certbot renew
|
||||
|
||||
# Renew specific certificate
|
||||
sudo certbot renew --cert-name domain.example.com
|
||||
|
||||
# Set up auto-renewal (check if already configured)
|
||||
sudo systemctl status certbot.timer
|
||||
```
|
||||
|
||||
### Check Certificate Expiration
|
||||
```bash
|
||||
# Check local certificate
|
||||
sudo certbot certificates
|
||||
|
||||
# Check remote certificate
|
||||
echo | openssl s_client -servername domain.example.com -connect domain.example.com:443 2>/dev/null | openssl x509 -noout -dates
|
||||
|
||||
# Check all certificates expiring in 30 days
|
||||
sudo certbot certificates | grep "Expiry Date"
|
||||
```
|
||||
|
||||
### Deploy Certificate to Service
|
||||
```bash
|
||||
# Copy certificate to service location
|
||||
sudo cp /etc/letsencrypt/live/domain.example.com/fullchain.pem /path/to/service/cert.pem
|
||||
sudo cp /etc/letsencrypt/live/domain.example.com/privkey.pem /path/to/service/key.pem
|
||||
|
||||
# Set permissions
|
||||
sudo chmod 644 /path/to/service/cert.pem
|
||||
sudo chmod 600 /path/to/service/key.pem
|
||||
|
||||
# Reload service
|
||||
sudo systemctl reload service-name
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Network Troubleshooting
|
||||
|
||||
### Check Network Connectivity
|
||||
```bash
|
||||
# Ping test
|
||||
ping -c 4 8.8.8.8
|
||||
|
||||
# DNS resolution
|
||||
nslookup domain.example.com
|
||||
dig domain.example.com
|
||||
|
||||
# Trace route
|
||||
traceroute domain.example.com
|
||||
mtr domain.example.com
|
||||
```
|
||||
|
||||
### Check Open Ports
|
||||
```bash
|
||||
# Check listening ports
|
||||
ss -tlnp
|
||||
netstat -tlnp
|
||||
|
||||
# Check if specific port is open
|
||||
ss -tlnp | grep :PORT
|
||||
nc -zv localhost PORT
|
||||
|
||||
# Check firewall rules
|
||||
sudo ufw status numbered
|
||||
sudo iptables -L -n -v
|
||||
```
|
||||
|
||||
### Test Service Availability
|
||||
```bash
|
||||
# HTTP/HTTPS test
|
||||
curl -I https://domain.example.com
|
||||
curl -v https://domain.example.com
|
||||
|
||||
# Test specific port
|
||||
nc -zv host PORT
|
||||
telnet host PORT
|
||||
|
||||
# Check service status
|
||||
sudo systemctl status service-name
|
||||
```
|
||||
|
||||
### Check Network Interface Status
|
||||
```bash
|
||||
# List all interfaces
|
||||
ip addr show
|
||||
ip link show
|
||||
|
||||
# Check interface statistics
|
||||
ip -s link show eth0
|
||||
|
||||
# Restart interface
|
||||
sudo ip link set eth0 down
|
||||
sudo ip link set eth0 up
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Security Procedures
|
||||
|
||||
### Update SSH Key
|
||||
```bash
|
||||
# Generate new SSH key
|
||||
ssh-keygen -t ed25519 -C "description"
|
||||
|
||||
# Copy to server
|
||||
ssh-copy-id -i ~/.ssh/new_key.pub user@server
|
||||
|
||||
# Test new key
|
||||
ssh -i ~/.ssh/new_key user@server
|
||||
|
||||
# Update SSH config
|
||||
nano ~/.ssh/config
|
||||
```
|
||||
|
||||
### Review Failed Login Attempts
|
||||
```bash
|
||||
# Check auth logs
|
||||
sudo grep "Failed password" /var/log/auth.log
|
||||
sudo journalctl -u ssh -n 100
|
||||
|
||||
# Check fail2ban status (if installed)
|
||||
sudo fail2ban-client status sshd
|
||||
```
|
||||
|
||||
### Update Firewall Rules
|
||||
```bash
|
||||
# Add new rule
|
||||
sudo ufw allow PORT/tcp
|
||||
sudo ufw allow from IP_ADDRESS to any port PORT
|
||||
|
||||
# Remove rule
|
||||
sudo ufw delete allow PORT/tcp
|
||||
sudo ufw status numbered
|
||||
sudo ufw delete NUMBER
|
||||
|
||||
# Reload firewall
|
||||
sudo ufw reload
|
||||
```
|
||||
|
||||
### Security Updates
|
||||
```bash
|
||||
# Check for updates
|
||||
sudo apt update
|
||||
sudo apt list --upgradable
|
||||
|
||||
# Install security updates only
|
||||
sudo apt upgrade -y
|
||||
|
||||
# Reboot if kernel updated
|
||||
sudo needrestart -r a
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backup Operations
|
||||
|
||||
### Manual Backup
|
||||
```bash
|
||||
# Backup specific VM/Container
|
||||
vzdump VMID --storage STORAGE_NAME --mode snapshot --compress zstd
|
||||
|
||||
# Backup configuration files
|
||||
tar -czf config-backup-$(date +%Y%m%d).tar.gz /etc/pangolin /etc/gerbil
|
||||
|
||||
# Backup to remote location
|
||||
rsync -avz /path/to/data/ user@backup-server:/path/to/backup/
|
||||
```
|
||||
|
||||
### Verify Backup
|
||||
```bash
|
||||
# List backup contents
|
||||
tar -tzf backup.tar.gz | less
|
||||
|
||||
# Check backup integrity
|
||||
tar -tzf backup.tar.gz > /dev/null && echo "OK" || echo "CORRUPTED"
|
||||
|
||||
# Check vzdump backup
|
||||
cat /path/to/backup/vzdump-qemu-VMID.log
|
||||
```
|
||||
|
||||
### Restore Specific Files
|
||||
```bash
|
||||
# Extract specific file from backup
|
||||
tar -xzf backup.tar.gz path/to/specific/file
|
||||
|
||||
# Restore from rsync backup
|
||||
rsync -avz user@backup-server:/path/to/backup/ /path/to/restore/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Emergency Contacts
|
||||
|
||||
- Infrastructure Owner: _______________
|
||||
- Network Administrator: _______________
|
||||
- VPS Provider Support: _______________
|
||||
- DNS Provider Support: _______________
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- Pangolin Documentation: _______________
|
||||
- Gerbil Documentation: _______________
|
||||
- Proxmox Documentation: https://pve.proxmox.com/pve-docs/
|
||||
- Internal Wiki: _______________
|
||||
Reference in New Issue
Block a user