Initial infrastructure documentation - comprehensive homelab reference
This commit is contained in:
655
infrastructure/MONITORING.md
Normal file
655
infrastructure/MONITORING.md
Normal file
@@ -0,0 +1,655 @@
|
||||
# Monitoring Setup Guide
|
||||
|
||||
This guide provides step-by-step instructions for setting up monitoring for your infrastructure.
|
||||
|
||||
## Table of Contents
|
||||
- [Monitoring Strategy](#monitoring-strategy)
|
||||
- [Quick Start: Uptime Kuma](#quick-start-uptime-kuma)
|
||||
- [Comprehensive Stack: Prometheus + Grafana](#comprehensive-stack-prometheus--grafana)
|
||||
- [Log Aggregation: Loki](#log-aggregation-loki)
|
||||
- [Alerting Setup](#alerting-setup)
|
||||
- [Dashboards](#dashboards)
|
||||
- [Maintenance](#maintenance)
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Strategy
|
||||
|
||||
### What to Monitor
|
||||
|
||||
**Infrastructure Level**:
|
||||
- VPS: CPU, RAM, disk, network
|
||||
- Proxmox nodes: CPU, RAM, disk, network
|
||||
- OMV storage: Disk usage, SMART status
|
||||
- Network: Bandwidth, connectivity
|
||||
|
||||
**Service Level**:
|
||||
- Service uptime and response time
|
||||
- HTTP/HTTPS endpoints
|
||||
- Gerbil tunnel status
|
||||
- SSL certificate expiration
|
||||
- Backup job success/failure
|
||||
|
||||
**Application Level**:
|
||||
- Application-specific metrics
|
||||
- Error rates
|
||||
- Request rates
|
||||
- Database performance
|
||||
|
||||
### Monitoring Tiers
|
||||
|
||||
| Tier | Solution | Complexity | Setup Time | Cost |
|
||||
|------|----------|------------|------------|------|
|
||||
| Basic | Uptime Kuma | Low | 30 min | Free |
|
||||
| Intermediate | Prometheus + Grafana | Medium | 2-4 hours | Free |
|
||||
| Advanced | Full observability stack | High | 8+ hours | Free/Paid |
|
||||
|
||||
---
|
||||
|
||||
## Quick Start: Uptime Kuma
|
||||
|
||||
**Best for**: Simple uptime monitoring with alerts
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# On a Proxmox container or VM
|
||||
# Create container with Ubuntu/Debian
|
||||
|
||||
# Install Docker
|
||||
curl -fsSL https://get.docker.com | sh
|
||||
|
||||
# Run Uptime Kuma
|
||||
docker run -d --restart=always \
|
||||
-p 3001:3001 \
|
||||
-v uptime-kuma:/app/data \
|
||||
--name uptime-kuma \
|
||||
louislam/uptime-kuma:1
|
||||
|
||||
# Access at http://HOST_IP:3001
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
1. **Create Admin Account**
|
||||
- First login creates admin user
|
||||
|
||||
2. **Add Monitors**
|
||||
|
||||
**HTTP(S) Monitor**:
|
||||
- Monitor Type: HTTP(s)
|
||||
- Friendly Name: Service Name
|
||||
- URL: https://service.example.com
|
||||
- Heartbeat Interval: 60 seconds
|
||||
- Retries: 3
|
||||
|
||||
**Ping Monitor**:
|
||||
- Monitor Type: Ping
|
||||
- Hostname: VPS or node IP
|
||||
- Interval: 60 seconds
|
||||
|
||||
**Port Monitor**:
|
||||
- Monitor Type: Port
|
||||
- Hostname: IP address
|
||||
- Port: Service port
|
||||
|
||||
3. **Set Up Notifications**
|
||||
- Settings → Notifications
|
||||
- Add notification method (email, Slack, Discord, etc.)
|
||||
- Test notification
|
||||
|
||||
4. **Create Status Page** (Optional)
|
||||
- Status Pages → Add Status Page
|
||||
- Add monitors to display
|
||||
- Make public or private
|
||||
|
||||
### Monitors to Create
|
||||
|
||||
- [ ] VPS SSH (Port 22 or custom)
|
||||
- [ ] VPS HTTP/HTTPS (Port 80/443)
|
||||
- [ ] Each public service endpoint
|
||||
- [ ] Proxmox web interface (each node)
|
||||
- [ ] OMV web interface
|
||||
|
||||
---
|
||||
|
||||
## Comprehensive Stack: Prometheus + Grafana
|
||||
|
||||
**Best for**: Detailed metrics, trending, and advanced alerting
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌───────────┐ ┌─────────┐
|
||||
│ Exporters │────▶│ Prometheus│────▶│ Grafana │
|
||||
│ (Metrics) │ │ (Storage) │ │ (UI) │
|
||||
└─────────────┘ └───────────┘ └─────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────┐
|
||||
│Alertmanager │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
### Installation (Docker Compose)
|
||||
|
||||
```bash
|
||||
# Create monitoring directory
|
||||
mkdir -p ~/monitoring
|
||||
cd ~/monitoring
|
||||
|
||||
# Create docker-compose.yml
|
||||
cat > docker-compose.yml <<'EOF'
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
container_name: prometheus
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "9090:9090"
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- prometheus-data:/prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.retention.time=30d'
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
container_name: grafana
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "3000:3000"
|
||||
volumes:
|
||||
- grafana-data:/var/lib/grafana
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_PASSWORD=changeme
|
||||
- GF_INSTALL_PLUGINS=grafana-piechart-panel
|
||||
|
||||
node-exporter:
|
||||
image: prom/node-exporter:latest
|
||||
container_name: node-exporter
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "9100:9100"
|
||||
command:
|
||||
- '--path.rootfs=/host'
|
||||
volumes:
|
||||
- '/:/host:ro,rslave'
|
||||
|
||||
volumes:
|
||||
prometheus-data:
|
||||
grafana-data:
|
||||
EOF
|
||||
|
||||
# Create Prometheus config
|
||||
cat > prometheus.yml <<'EOF'
|
||||
global:
|
||||
scrape_interval: 15s
|
||||
evaluation_interval: 15s
|
||||
|
||||
scrape_configs:
|
||||
# Prometheus itself
|
||||
- job_name: 'prometheus'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
|
||||
# Node exporter (system metrics)
|
||||
- job_name: 'node'
|
||||
static_configs:
|
||||
- targets: ['node-exporter:9100']
|
||||
labels:
|
||||
instance: 'monitoring-host'
|
||||
|
||||
# Add more exporters here
|
||||
# - job_name: 'proxmox'
|
||||
# static_configs:
|
||||
# - targets: ['proxmox-node-1:9221']
|
||||
EOF
|
||||
|
||||
# Start services
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### Access
|
||||
|
||||
- **Prometheus**: http://HOST_IP:9090
|
||||
- **Grafana**: http://HOST_IP:3000 (admin/changeme)
|
||||
|
||||
### Configure Grafana
|
||||
|
||||
1. **Add Prometheus Data Source**
|
||||
- Configuration → Data Sources → Add data source
|
||||
- Select Prometheus
|
||||
- URL: http://prometheus:9090
|
||||
- Click "Save & Test"
|
||||
|
||||
2. **Import Dashboards**
|
||||
- Dashboard → Import
|
||||
- Import these popular dashboards:
|
||||
- 1860: Node Exporter Full
|
||||
- 10180: Proxmox via Prometheus
|
||||
- 763: Disk I/O performance
|
||||
|
||||
### Install Exporters
|
||||
|
||||
**Node Exporter** (on each host to monitor):
|
||||
```bash
|
||||
# Download
|
||||
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
|
||||
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
|
||||
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
|
||||
|
||||
# Create systemd service
|
||||
sudo cat > /etc/systemd/system/node_exporter.service <<'EOF'
|
||||
[Unit]
|
||||
Description=Node Exporter
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/usr/local/bin/node_exporter
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable node_exporter
|
||||
sudo systemctl start node_exporter
|
||||
|
||||
# Verify
|
||||
curl http://localhost:9100/metrics
|
||||
```
|
||||
|
||||
**Proxmox VE Exporter**:
|
||||
```bash
|
||||
# On Proxmox node
|
||||
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/pve_exporter
|
||||
chmod +x pve_exporter
|
||||
sudo mv pve_exporter /usr/local/bin/
|
||||
|
||||
# Create config
|
||||
sudo mkdir -p /etc/prometheus
|
||||
sudo cat > /etc/prometheus/pve.yml <<EOF
|
||||
default:
|
||||
user: monitoring@pve
|
||||
password: your_password
|
||||
verify_ssl: false
|
||||
EOF
|
||||
|
||||
# Create systemd service
|
||||
sudo cat > /etc/systemd/system/pve_exporter.service <<'EOF'
|
||||
[Unit]
|
||||
Description=Proxmox VE Exporter
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/usr/local/bin/pve_exporter /etc/prometheus/pve.yml
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
EOF
|
||||
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable pve_exporter
|
||||
sudo systemctl start pve_exporter
|
||||
```
|
||||
|
||||
**Blackbox Exporter** (for HTTP/HTTPS probing):
|
||||
```bash
|
||||
# Add to docker-compose.yml
|
||||
blackbox:
|
||||
image: prom/blackbox-exporter:latest
|
||||
container_name: blackbox-exporter
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "9115:9115"
|
||||
volumes:
|
||||
- ./blackbox.yml:/etc/blackbox_exporter/config.yml
|
||||
```
|
||||
|
||||
```yaml
|
||||
# blackbox.yml
|
||||
modules:
|
||||
http_2xx:
|
||||
prober: http
|
||||
timeout: 5s
|
||||
http:
|
||||
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
|
||||
valid_status_codes: []
|
||||
method: GET
|
||||
tcp_connect:
|
||||
prober: tcp
|
||||
timeout: 5s
|
||||
```
|
||||
|
||||
### Add Scrape Targets
|
||||
|
||||
Add to `prometheus.yml`:
|
||||
```yaml
|
||||
# VPS Node Exporter
|
||||
- job_name: 'vps'
|
||||
static_configs:
|
||||
- targets: ['VPS_IP:9100']
|
||||
|
||||
# Proxmox Nodes
|
||||
- job_name: 'proxmox'
|
||||
static_configs:
|
||||
- targets: ['PROXMOX_IP_1:9221', 'PROXMOX_IP_2:9221']
|
||||
|
||||
# HTTP Endpoints
|
||||
- job_name: 'blackbox'
|
||||
metrics_path: /probe
|
||||
params:
|
||||
module: [http_2xx]
|
||||
static_configs:
|
||||
- targets:
|
||||
- https://service1.example.com
|
||||
- https://service2.example.com
|
||||
relabel_configs:
|
||||
- source_labels: [__address__]
|
||||
target_label: __param_target
|
||||
- source_labels: [__param_target]
|
||||
target_label: instance
|
||||
- target_label: __address__
|
||||
replacement: blackbox:9115
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Log Aggregation: Loki
|
||||
|
||||
**Best for**: Centralized logging from all services
|
||||
|
||||
### Installation
|
||||
|
||||
Add to `docker-compose.yml`:
|
||||
```yaml
|
||||
loki:
|
||||
image: grafana/loki:latest
|
||||
container_name: loki
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- "3100:3100"
|
||||
volumes:
|
||||
- ./loki-config.yml:/etc/loki/local-config.yaml
|
||||
- loki-data:/loki
|
||||
|
||||
promtail:
|
||||
image: grafana/promtail:latest
|
||||
container_name: promtail
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- ./promtail-config.yml:/etc/promtail/config.yml
|
||||
- /var/log:/var/log
|
||||
command: -config.file=/etc/promtail/config.yml
|
||||
|
||||
volumes:
|
||||
loki-data:
|
||||
```
|
||||
|
||||
### Configure Loki in Grafana
|
||||
|
||||
1. Configuration → Data Sources → Add data source
|
||||
2. Select Loki
|
||||
3. URL: http://loki:3100
|
||||
4. Save & Test
|
||||
|
||||
### Query Logs
|
||||
|
||||
In Grafana Explore:
|
||||
```logql
|
||||
# All logs
|
||||
{job="varlogs"}
|
||||
|
||||
# Filter by service
|
||||
{job="varlogs"} |= "pangolin"
|
||||
|
||||
# Error logs
|
||||
{job="varlogs"} |= "error"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Alerting Setup
|
||||
|
||||
### Prometheus Alerting Rules
|
||||
|
||||
Create `alerts.yml`:
|
||||
```yaml
|
||||
groups:
|
||||
- name: infrastructure
|
||||
interval: 30s
|
||||
rules:
|
||||
# Node down
|
||||
- alert: InstanceDown
|
||||
expr: up == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Instance {{ $labels.instance }} down"
|
||||
description: "{{ $labels.instance }} has been down for more than 5 minutes"
|
||||
|
||||
# High CPU
|
||||
- alert: HighCPU
|
||||
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High CPU on {{ $labels.instance }}"
|
||||
description: "CPU usage is above 80% for 10 minutes"
|
||||
|
||||
# High Memory
|
||||
- alert: HighMemory
|
||||
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Low memory on {{ $labels.instance }}"
|
||||
description: "Available memory is below 10%"
|
||||
|
||||
# Disk Space
|
||||
- alert: LowDiskSpace
|
||||
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Low disk space on {{ $labels.instance }}"
|
||||
description: "Disk space is below 10% on {{ $labels.mountpoint }}"
|
||||
|
||||
# SSL Certificate Expiring
|
||||
- alert: SSLCertExpiringSoon
|
||||
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
|
||||
for: 1h
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "SSL certificate expiring soon"
|
||||
description: "Certificate for {{ $labels.instance }} expires in less than 30 days"
|
||||
```
|
||||
|
||||
### Alertmanager Configuration
|
||||
|
||||
```yaml
|
||||
# alertmanager.yml
|
||||
global:
|
||||
resolve_timeout: 5m
|
||||
|
||||
route:
|
||||
group_by: ['alertname', 'cluster']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 12h
|
||||
receiver: 'default'
|
||||
|
||||
receivers:
|
||||
- name: 'default'
|
||||
email_configs:
|
||||
- to: 'your-email@example.com'
|
||||
from: 'alertmanager@example.com'
|
||||
smarthost: 'smtp.example.com:587'
|
||||
auth_username: 'username'
|
||||
auth_password: 'password'
|
||||
|
||||
# Slack
|
||||
- name: 'slack'
|
||||
slack_configs:
|
||||
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
|
||||
channel: '#alerts'
|
||||
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dashboards
|
||||
|
||||
### Essential Dashboards
|
||||
|
||||
1. **Infrastructure Overview**
|
||||
- All nodes status
|
||||
- Overall resource utilization
|
||||
- Service uptime
|
||||
|
||||
2. **VPS Dashboard**
|
||||
- CPU, RAM, disk, network
|
||||
- Running services
|
||||
- Firewall connections
|
||||
|
||||
3. **Proxmox Cluster**
|
||||
- Cluster health
|
||||
- VM/container count and status
|
||||
- Resource allocation vs usage
|
||||
|
||||
4. **Storage**
|
||||
- Disk space trends
|
||||
- I/O performance
|
||||
- SMART status
|
||||
|
||||
5. **Services**
|
||||
- Uptime percentage
|
||||
- Response times
|
||||
- Error rates
|
||||
|
||||
6. **Tunnels**
|
||||
- Gerbil tunnel status
|
||||
- Connection count
|
||||
- Bandwidth usage
|
||||
|
||||
### Creating Custom Dashboard
|
||||
|
||||
1. Grafana → Create → Dashboard
|
||||
2. Add Panel → Select visualization
|
||||
3. Write PromQL query
|
||||
4. Configure thresholds and alerts
|
||||
5. Save dashboard
|
||||
|
||||
---
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Regular Tasks
|
||||
|
||||
**Daily**:
|
||||
- Review alerts
|
||||
- Check dashboard for anomalies
|
||||
|
||||
**Weekly**:
|
||||
- Review resource trends
|
||||
- Check for unused monitors
|
||||
- Update dashboards
|
||||
|
||||
**Monthly**:
|
||||
- Review and tune alert thresholds
|
||||
- Clean up old metrics
|
||||
- Update monitoring stack
|
||||
- Test alerting
|
||||
|
||||
**Quarterly**:
|
||||
- Review monitoring strategy
|
||||
- Evaluate new monitoring tools
|
||||
- Update documentation
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**Prometheus not scraping**:
|
||||
```bash
|
||||
# Check targets
|
||||
curl http://localhost:9090/targets
|
||||
|
||||
# Check Prometheus logs
|
||||
docker logs prometheus
|
||||
```
|
||||
|
||||
**Grafana dashboard empty**:
|
||||
- Verify data source connection
|
||||
- Check time range
|
||||
- Verify metrics exist in Prometheus
|
||||
|
||||
**No alerts firing**:
|
||||
- Check alerting rules syntax
|
||||
- Verify Alertmanager connection
|
||||
- Test alert evaluation
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Checklist
|
||||
|
||||
### Initial Setup
|
||||
- [ ] Choose monitoring tier (Basic/Intermediate/Advanced)
|
||||
- [ ] Deploy monitoring stack
|
||||
- [ ] Install exporters on all hosts
|
||||
- [ ] Configure Grafana data sources
|
||||
- [ ] Import/create dashboards
|
||||
- [ ] Set up alerting
|
||||
- [ ] Configure notification channels
|
||||
- [ ] Test alerts
|
||||
|
||||
### Monitors to Configure
|
||||
- [ ] VPS uptime and resources
|
||||
- [ ] Proxmox node resources
|
||||
- [ ] OMV storage capacity
|
||||
- [ ] All public HTTP(S) endpoints
|
||||
- [ ] Gerbil tunnel status
|
||||
- [ ] SSL certificate expiration
|
||||
- [ ] Backup job success
|
||||
- [ ] Network connectivity
|
||||
- [ ] Service-specific metrics
|
||||
|
||||
### Alerts to Configure
|
||||
- [ ] Service down (>5 min)
|
||||
- [ ] High CPU (>80% for 10 min)
|
||||
- [ ] High memory (>90% for 5 min)
|
||||
- [ ] Low disk space (<10%)
|
||||
- [ ] SSL cert expiring (<30 days)
|
||||
- [ ] Backup failure
|
||||
- [ ] Tunnel disconnected
|
||||
|
||||
---
|
||||
|
||||
## Cost Considerations
|
||||
|
||||
### Free Tier Options
|
||||
- **Uptime Kuma**: Fully free, self-hosted
|
||||
- **Prometheus + Grafana**: Free, self-hosted
|
||||
- **Grafana Cloud**: Free tier available (limited)
|
||||
|
||||
### Paid Options (if needed)
|
||||
- **Datadog**: $15/host/month
|
||||
- **New Relic**: $99/month+
|
||||
- **Better Uptime**: $10/month+
|
||||
|
||||
**Recommendation**: Start with free self-hosted tools, upgrade only if needed.
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: _____________
|
||||
**Next Review**: _____________
|
||||
**Version**: 1.0
|
||||
Reference in New Issue
Block a user