656 lines
15 KiB
Markdown
656 lines
15 KiB
Markdown
# Monitoring Setup Guide
|
|
|
|
This guide provides step-by-step instructions for setting up monitoring for your infrastructure.
|
|
|
|
## Table of Contents
|
|
- [Monitoring Strategy](#monitoring-strategy)
|
|
- [Quick Start: Uptime Kuma](#quick-start-uptime-kuma)
|
|
- [Comprehensive Stack: Prometheus + Grafana](#comprehensive-stack-prometheus--grafana)
|
|
- [Log Aggregation: Loki](#log-aggregation-loki)
|
|
- [Alerting Setup](#alerting-setup)
|
|
- [Dashboards](#dashboards)
|
|
- [Maintenance](#maintenance)
|
|
|
|
---
|
|
|
|
## Monitoring Strategy
|
|
|
|
### What to Monitor
|
|
|
|
**Infrastructure Level**:
|
|
- VPS: CPU, RAM, disk, network
|
|
- Proxmox nodes: CPU, RAM, disk, network
|
|
- OMV storage: Disk usage, SMART status
|
|
- Network: Bandwidth, connectivity
|
|
|
|
**Service Level**:
|
|
- Service uptime and response time
|
|
- HTTP/HTTPS endpoints
|
|
- Gerbil tunnel status
|
|
- SSL certificate expiration
|
|
- Backup job success/failure
|
|
|
|
**Application Level**:
|
|
- Application-specific metrics
|
|
- Error rates
|
|
- Request rates
|
|
- Database performance
|
|
|
|
### Monitoring Tiers
|
|
|
|
| Tier | Solution | Complexity | Setup Time | Cost |
|
|
|------|----------|------------|------------|------|
|
|
| Basic | Uptime Kuma | Low | 30 min | Free |
|
|
| Intermediate | Prometheus + Grafana | Medium | 2-4 hours | Free |
|
|
| Advanced | Full observability stack | High | 8+ hours | Free/Paid |
|
|
|
|
---
|
|
|
|
## Quick Start: Uptime Kuma
|
|
|
|
**Best for**: Simple uptime monitoring with alerts
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
# On a Proxmox container or VM
|
|
# Create container with Ubuntu/Debian
|
|
|
|
# Install Docker
|
|
curl -fsSL https://get.docker.com | sh
|
|
|
|
# Run Uptime Kuma
|
|
docker run -d --restart=always \
|
|
-p 3001:3001 \
|
|
-v uptime-kuma:/app/data \
|
|
--name uptime-kuma \
|
|
louislam/uptime-kuma:1
|
|
|
|
# Access at http://HOST_IP:3001
|
|
```
|
|
|
|
### Configuration
|
|
|
|
1. **Create Admin Account**
|
|
- First login creates admin user
|
|
|
|
2. **Add Monitors**
|
|
|
|
**HTTP(S) Monitor**:
|
|
- Monitor Type: HTTP(s)
|
|
- Friendly Name: Service Name
|
|
- URL: https://service.example.com
|
|
- Heartbeat Interval: 60 seconds
|
|
- Retries: 3
|
|
|
|
**Ping Monitor**:
|
|
- Monitor Type: Ping
|
|
- Hostname: VPS or node IP
|
|
- Interval: 60 seconds
|
|
|
|
**Port Monitor**:
|
|
- Monitor Type: Port
|
|
- Hostname: IP address
|
|
- Port: Service port
|
|
|
|
3. **Set Up Notifications**
|
|
- Settings → Notifications
|
|
- Add notification method (email, Slack, Discord, etc.)
|
|
- Test notification
|
|
|
|
4. **Create Status Page** (Optional)
|
|
- Status Pages → Add Status Page
|
|
- Add monitors to display
|
|
- Make public or private
|
|
|
|
### Monitors to Create
|
|
|
|
- [ ] VPS SSH (Port 22 or custom)
|
|
- [ ] VPS HTTP/HTTPS (Port 80/443)
|
|
- [ ] Each public service endpoint
|
|
- [ ] Proxmox web interface (each node)
|
|
- [ ] OMV web interface
|
|
|
|
---
|
|
|
|
## Comprehensive Stack: Prometheus + Grafana
|
|
|
|
**Best for**: Detailed metrics, trending, and advanced alerting
|
|
|
|
### Architecture
|
|
|
|
```
|
|
┌─────────────┐ ┌───────────┐ ┌─────────┐
|
|
│ Exporters │────▶│ Prometheus│────▶│ Grafana │
|
|
│ (Metrics) │ │ (Storage) │ │ (UI) │
|
|
└─────────────┘ └───────────┘ └─────────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│Alertmanager │
|
|
└─────────────┘
|
|
```
|
|
|
|
### Installation (Docker Compose)
|
|
|
|
```bash
|
|
# Create monitoring directory
|
|
mkdir -p ~/monitoring
|
|
cd ~/monitoring
|
|
|
|
# Create docker-compose.yml
|
|
cat > docker-compose.yml <<'EOF'
|
|
version: '3.8'
|
|
|
|
services:
|
|
prometheus:
|
|
image: prom/prometheus:latest
|
|
container_name: prometheus
|
|
restart: unless-stopped
|
|
ports:
|
|
- "9090:9090"
|
|
volumes:
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
|
- prometheus-data:/prometheus
|
|
command:
|
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
|
- '--storage.tsdb.retention.time=30d'
|
|
|
|
grafana:
|
|
image: grafana/grafana:latest
|
|
container_name: grafana
|
|
restart: unless-stopped
|
|
ports:
|
|
- "3000:3000"
|
|
volumes:
|
|
- grafana-data:/var/lib/grafana
|
|
environment:
|
|
- GF_SECURITY_ADMIN_PASSWORD=changeme
|
|
- GF_INSTALL_PLUGINS=grafana-piechart-panel
|
|
|
|
node-exporter:
|
|
image: prom/node-exporter:latest
|
|
container_name: node-exporter
|
|
restart: unless-stopped
|
|
ports:
|
|
- "9100:9100"
|
|
command:
|
|
- '--path.rootfs=/host'
|
|
volumes:
|
|
- '/:/host:ro,rslave'
|
|
|
|
volumes:
|
|
prometheus-data:
|
|
grafana-data:
|
|
EOF
|
|
|
|
# Create Prometheus config
|
|
cat > prometheus.yml <<'EOF'
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
scrape_configs:
|
|
# Prometheus itself
|
|
- job_name: 'prometheus'
|
|
static_configs:
|
|
- targets: ['localhost:9090']
|
|
|
|
# Node exporter (system metrics)
|
|
- job_name: 'node'
|
|
static_configs:
|
|
- targets: ['node-exporter:9100']
|
|
labels:
|
|
instance: 'monitoring-host'
|
|
|
|
# Add more exporters here
|
|
# - job_name: 'proxmox'
|
|
# static_configs:
|
|
# - targets: ['proxmox-node-1:9221']
|
|
EOF
|
|
|
|
# Start services
|
|
docker-compose up -d
|
|
```
|
|
|
|
### Access
|
|
|
|
- **Prometheus**: http://HOST_IP:9090
|
|
- **Grafana**: http://HOST_IP:3000 (admin/changeme)
|
|
|
|
### Configure Grafana
|
|
|
|
1. **Add Prometheus Data Source**
|
|
- Configuration → Data Sources → Add data source
|
|
- Select Prometheus
|
|
- URL: http://prometheus:9090
|
|
- Click "Save & Test"
|
|
|
|
2. **Import Dashboards**
|
|
- Dashboard → Import
|
|
- Import these popular dashboards:
|
|
- 1860: Node Exporter Full
|
|
- 10180: Proxmox via Prometheus
|
|
- 763: Disk I/O performance
|
|
|
|
### Install Exporters
|
|
|
|
**Node Exporter** (on each host to monitor):
|
|
```bash
|
|
# Download
|
|
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
|
|
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
|
|
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
|
|
|
|
# Create systemd service
|
|
sudo cat > /etc/systemd/system/node_exporter.service <<'EOF'
|
|
[Unit]
|
|
Description=Node Exporter
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
ExecStart=/usr/local/bin/node_exporter
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
EOF
|
|
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable node_exporter
|
|
sudo systemctl start node_exporter
|
|
|
|
# Verify
|
|
curl http://localhost:9100/metrics
|
|
```
|
|
|
|
**Proxmox VE Exporter**:
|
|
```bash
|
|
# On Proxmox node
|
|
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/pve_exporter
|
|
chmod +x pve_exporter
|
|
sudo mv pve_exporter /usr/local/bin/
|
|
|
|
# Create config
|
|
sudo mkdir -p /etc/prometheus
|
|
sudo cat > /etc/prometheus/pve.yml <<EOF
|
|
default:
|
|
user: monitoring@pve
|
|
password: your_password
|
|
verify_ssl: false
|
|
EOF
|
|
|
|
# Create systemd service
|
|
sudo cat > /etc/systemd/system/pve_exporter.service <<'EOF'
|
|
[Unit]
|
|
Description=Proxmox VE Exporter
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=simple
|
|
ExecStart=/usr/local/bin/pve_exporter /etc/prometheus/pve.yml
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
EOF
|
|
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable pve_exporter
|
|
sudo systemctl start pve_exporter
|
|
```
|
|
|
|
**Blackbox Exporter** (for HTTP/HTTPS probing):
|
|
```bash
|
|
# Add to docker-compose.yml
|
|
blackbox:
|
|
image: prom/blackbox-exporter:latest
|
|
container_name: blackbox-exporter
|
|
restart: unless-stopped
|
|
ports:
|
|
- "9115:9115"
|
|
volumes:
|
|
- ./blackbox.yml:/etc/blackbox_exporter/config.yml
|
|
```
|
|
|
|
```yaml
|
|
# blackbox.yml
|
|
modules:
|
|
http_2xx:
|
|
prober: http
|
|
timeout: 5s
|
|
http:
|
|
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
|
|
valid_status_codes: []
|
|
method: GET
|
|
tcp_connect:
|
|
prober: tcp
|
|
timeout: 5s
|
|
```
|
|
|
|
### Add Scrape Targets
|
|
|
|
Add to `prometheus.yml`:
|
|
```yaml
|
|
# VPS Node Exporter
|
|
- job_name: 'vps'
|
|
static_configs:
|
|
- targets: ['VPS_IP:9100']
|
|
|
|
# Proxmox Nodes
|
|
- job_name: 'proxmox'
|
|
static_configs:
|
|
- targets: ['PROXMOX_IP_1:9221', 'PROXMOX_IP_2:9221']
|
|
|
|
# HTTP Endpoints
|
|
- job_name: 'blackbox'
|
|
metrics_path: /probe
|
|
params:
|
|
module: [http_2xx]
|
|
static_configs:
|
|
- targets:
|
|
- https://service1.example.com
|
|
- https://service2.example.com
|
|
relabel_configs:
|
|
- source_labels: [__address__]
|
|
target_label: __param_target
|
|
- source_labels: [__param_target]
|
|
target_label: instance
|
|
- target_label: __address__
|
|
replacement: blackbox:9115
|
|
```
|
|
|
|
---
|
|
|
|
## Log Aggregation: Loki
|
|
|
|
**Best for**: Centralized logging from all services
|
|
|
|
### Installation
|
|
|
|
Add to `docker-compose.yml`:
|
|
```yaml
|
|
loki:
|
|
image: grafana/loki:latest
|
|
container_name: loki
|
|
restart: unless-stopped
|
|
ports:
|
|
- "3100:3100"
|
|
volumes:
|
|
- ./loki-config.yml:/etc/loki/local-config.yaml
|
|
- loki-data:/loki
|
|
|
|
promtail:
|
|
image: grafana/promtail:latest
|
|
container_name: promtail
|
|
restart: unless-stopped
|
|
volumes:
|
|
- ./promtail-config.yml:/etc/promtail/config.yml
|
|
- /var/log:/var/log
|
|
command: -config.file=/etc/promtail/config.yml
|
|
|
|
volumes:
|
|
loki-data:
|
|
```
|
|
|
|
### Configure Loki in Grafana
|
|
|
|
1. Configuration → Data Sources → Add data source
|
|
2. Select Loki
|
|
3. URL: http://loki:3100
|
|
4. Save & Test
|
|
|
|
### Query Logs
|
|
|
|
In Grafana Explore:
|
|
```logql
|
|
# All logs
|
|
{job="varlogs"}
|
|
|
|
# Filter by service
|
|
{job="varlogs"} |= "pangolin"
|
|
|
|
# Error logs
|
|
{job="varlogs"} |= "error"
|
|
```
|
|
|
|
---
|
|
|
|
## Alerting Setup
|
|
|
|
### Prometheus Alerting Rules
|
|
|
|
Create `alerts.yml`:
|
|
```yaml
|
|
groups:
|
|
- name: infrastructure
|
|
interval: 30s
|
|
rules:
|
|
# Node down
|
|
- alert: InstanceDown
|
|
expr: up == 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Instance {{ $labels.instance }} down"
|
|
description: "{{ $labels.instance }} has been down for more than 5 minutes"
|
|
|
|
# High CPU
|
|
- alert: HighCPU
|
|
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High CPU on {{ $labels.instance }}"
|
|
description: "CPU usage is above 80% for 10 minutes"
|
|
|
|
# High Memory
|
|
- alert: HighMemory
|
|
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Low memory on {{ $labels.instance }}"
|
|
description: "Available memory is below 10%"
|
|
|
|
# Disk Space
|
|
- alert: LowDiskSpace
|
|
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Low disk space on {{ $labels.instance }}"
|
|
description: "Disk space is below 10% on {{ $labels.mountpoint }}"
|
|
|
|
# SSL Certificate Expiring
|
|
- alert: SSLCertExpiringSoon
|
|
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "SSL certificate expiring soon"
|
|
description: "Certificate for {{ $labels.instance }} expires in less than 30 days"
|
|
```
|
|
|
|
### Alertmanager Configuration
|
|
|
|
```yaml
|
|
# alertmanager.yml
|
|
global:
|
|
resolve_timeout: 5m
|
|
|
|
route:
|
|
group_by: ['alertname', 'cluster']
|
|
group_wait: 10s
|
|
group_interval: 10s
|
|
repeat_interval: 12h
|
|
receiver: 'default'
|
|
|
|
receivers:
|
|
- name: 'default'
|
|
email_configs:
|
|
- to: 'your-email@example.com'
|
|
from: 'alertmanager@example.com'
|
|
smarthost: 'smtp.example.com:587'
|
|
auth_username: 'username'
|
|
auth_password: 'password'
|
|
|
|
# Slack
|
|
- name: 'slack'
|
|
slack_configs:
|
|
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
|
|
channel: '#alerts'
|
|
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
|
```
|
|
|
|
---
|
|
|
|
## Dashboards
|
|
|
|
### Essential Dashboards
|
|
|
|
1. **Infrastructure Overview**
|
|
- All nodes status
|
|
- Overall resource utilization
|
|
- Service uptime
|
|
|
|
2. **VPS Dashboard**
|
|
- CPU, RAM, disk, network
|
|
- Running services
|
|
- Firewall connections
|
|
|
|
3. **Proxmox Cluster**
|
|
- Cluster health
|
|
- VM/container count and status
|
|
- Resource allocation vs usage
|
|
|
|
4. **Storage**
|
|
- Disk space trends
|
|
- I/O performance
|
|
- SMART status
|
|
|
|
5. **Services**
|
|
- Uptime percentage
|
|
- Response times
|
|
- Error rates
|
|
|
|
6. **Tunnels**
|
|
- Gerbil tunnel status
|
|
- Connection count
|
|
- Bandwidth usage
|
|
|
|
### Creating Custom Dashboard
|
|
|
|
1. Grafana → Create → Dashboard
|
|
2. Add Panel → Select visualization
|
|
3. Write PromQL query
|
|
4. Configure thresholds and alerts
|
|
5. Save dashboard
|
|
|
|
---
|
|
|
|
## Maintenance
|
|
|
|
### Regular Tasks
|
|
|
|
**Daily**:
|
|
- Review alerts
|
|
- Check dashboard for anomalies
|
|
|
|
**Weekly**:
|
|
- Review resource trends
|
|
- Check for unused monitors
|
|
- Update dashboards
|
|
|
|
**Monthly**:
|
|
- Review and tune alert thresholds
|
|
- Clean up old metrics
|
|
- Update monitoring stack
|
|
- Test alerting
|
|
|
|
**Quarterly**:
|
|
- Review monitoring strategy
|
|
- Evaluate new monitoring tools
|
|
- Update documentation
|
|
|
|
### Troubleshooting
|
|
|
|
**Prometheus not scraping**:
|
|
```bash
|
|
# Check targets
|
|
curl http://localhost:9090/targets
|
|
|
|
# Check Prometheus logs
|
|
docker logs prometheus
|
|
```
|
|
|
|
**Grafana dashboard empty**:
|
|
- Verify data source connection
|
|
- Check time range
|
|
- Verify metrics exist in Prometheus
|
|
|
|
**No alerts firing**:
|
|
- Check alerting rules syntax
|
|
- Verify Alertmanager connection
|
|
- Test alert evaluation
|
|
|
|
---
|
|
|
|
## Monitoring Checklist
|
|
|
|
### Initial Setup
|
|
- [ ] Choose monitoring tier (Basic/Intermediate/Advanced)
|
|
- [ ] Deploy monitoring stack
|
|
- [ ] Install exporters on all hosts
|
|
- [ ] Configure Grafana data sources
|
|
- [ ] Import/create dashboards
|
|
- [ ] Set up alerting
|
|
- [ ] Configure notification channels
|
|
- [ ] Test alerts
|
|
|
|
### Monitors to Configure
|
|
- [ ] VPS uptime and resources
|
|
- [ ] Proxmox node resources
|
|
- [ ] OMV storage capacity
|
|
- [ ] All public HTTP(S) endpoints
|
|
- [ ] Gerbil tunnel status
|
|
- [ ] SSL certificate expiration
|
|
- [ ] Backup job success
|
|
- [ ] Network connectivity
|
|
- [ ] Service-specific metrics
|
|
|
|
### Alerts to Configure
|
|
- [ ] Service down (>5 min)
|
|
- [ ] High CPU (>80% for 10 min)
|
|
- [ ] High memory (>90% for 5 min)
|
|
- [ ] Low disk space (<10%)
|
|
- [ ] SSL cert expiring (<30 days)
|
|
- [ ] Backup failure
|
|
- [ ] Tunnel disconnected
|
|
|
|
---
|
|
|
|
## Cost Considerations
|
|
|
|
### Free Tier Options
|
|
- **Uptime Kuma**: Fully free, self-hosted
|
|
- **Prometheus + Grafana**: Free, self-hosted
|
|
- **Grafana Cloud**: Free tier available (limited)
|
|
|
|
### Paid Options (if needed)
|
|
- **Datadog**: $15/host/month
|
|
- **New Relic**: $99/month+
|
|
- **Better Uptime**: $10/month+
|
|
|
|
**Recommendation**: Start with free self-hosted tools, upgrade only if needed.
|
|
|
|
---
|
|
|
|
**Last Updated**: _____________
|
|
**Next Review**: _____________
|
|
**Version**: 1.0
|