15 KiB
Monitoring Setup Guide
This guide provides step-by-step instructions for setting up monitoring for your infrastructure.
Table of Contents
- Monitoring Strategy
- Quick Start: Uptime Kuma
- Comprehensive Stack: Prometheus + Grafana
- Log Aggregation: Loki
- Alerting Setup
- Dashboards
- Maintenance
Monitoring Strategy
What to Monitor
Infrastructure Level:
- VPS: CPU, RAM, disk, network
- Proxmox nodes: CPU, RAM, disk, network
- OMV storage: Disk usage, SMART status
- Network: Bandwidth, connectivity
Service Level:
- Service uptime and response time
- HTTP/HTTPS endpoints
- Gerbil tunnel status
- SSL certificate expiration
- Backup job success/failure
Application Level:
- Application-specific metrics
- Error rates
- Request rates
- Database performance
Monitoring Tiers
| Tier | Solution | Complexity | Setup Time | Cost |
|---|---|---|---|---|
| Basic | Uptime Kuma | Low | 30 min | Free |
| Intermediate | Prometheus + Grafana | Medium | 2-4 hours | Free |
| Advanced | Full observability stack | High | 8+ hours | Free/Paid |
Quick Start: Uptime Kuma
Best for: Simple uptime monitoring with alerts
Installation
# On a Proxmox container or VM
# Create container with Ubuntu/Debian
# Install Docker
curl -fsSL https://get.docker.com | sh
# Run Uptime Kuma
docker run -d --restart=always \
-p 3001:3001 \
-v uptime-kuma:/app/data \
--name uptime-kuma \
louislam/uptime-kuma:1
# Access at http://HOST_IP:3001
Configuration
-
Create Admin Account
- First login creates admin user
-
Add Monitors
HTTP(S) Monitor:
- Monitor Type: HTTP(s)
- Friendly Name: Service Name
- URL: https://service.example.com
- Heartbeat Interval: 60 seconds
- Retries: 3
Ping Monitor:
- Monitor Type: Ping
- Hostname: VPS or node IP
- Interval: 60 seconds
Port Monitor:
- Monitor Type: Port
- Hostname: IP address
- Port: Service port
-
Set Up Notifications
- Settings → Notifications
- Add notification method (email, Slack, Discord, etc.)
- Test notification
-
Create Status Page (Optional)
- Status Pages → Add Status Page
- Add monitors to display
- Make public or private
Monitors to Create
- VPS SSH (Port 22 or custom)
- VPS HTTP/HTTPS (Port 80/443)
- Each public service endpoint
- Proxmox web interface (each node)
- OMV web interface
Comprehensive Stack: Prometheus + Grafana
Best for: Detailed metrics, trending, and advanced alerting
Architecture
┌─────────────┐ ┌───────────┐ ┌─────────┐
│ Exporters │────▶│ Prometheus│────▶│ Grafana │
│ (Metrics) │ │ (Storage) │ │ (UI) │
└─────────────┘ └───────────┘ └─────────┘
│
▼
┌─────────────┐
│Alertmanager │
└─────────────┘
Installation (Docker Compose)
# Create monitoring directory
mkdir -p ~/monitoring
cd ~/monitoring
# Create docker-compose.yml
cat > docker-compose.yml <<'EOF'
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_INSTALL_PLUGINS=grafana-piechart-panel
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
command:
- '--path.rootfs=/host'
volumes:
- '/:/host:ro,rslave'
volumes:
prometheus-data:
grafana-data:
EOF
# Create Prometheus config
cat > prometheus.yml <<'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporter (system metrics)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
labels:
instance: 'monitoring-host'
# Add more exporters here
# - job_name: 'proxmox'
# static_configs:
# - targets: ['proxmox-node-1:9221']
EOF
# Start services
docker-compose up -d
Access
- Prometheus: http://HOST_IP:9090
- Grafana: http://HOST_IP:3000 (admin/changeme)
Configure Grafana
-
Add Prometheus Data Source
- Configuration → Data Sources → Add data source
- Select Prometheus
- URL: http://prometheus:9090
- Click "Save & Test"
-
Import Dashboards
- Dashboard → Import
- Import these popular dashboards:
- 1860: Node Exporter Full
- 10180: Proxmox via Prometheus
- 763: Disk I/O performance
Install Exporters
Node Exporter (on each host to monitor):
# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
sudo cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
# Verify
curl http://localhost:9100/metrics
Proxmox VE Exporter:
# On Proxmox node
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/pve_exporter
chmod +x pve_exporter
sudo mv pve_exporter /usr/local/bin/
# Create config
sudo mkdir -p /etc/prometheus
sudo cat > /etc/prometheus/pve.yml <<EOF
default:
user: monitoring@pve
password: your_password
verify_ssl: false
EOF
# Create systemd service
sudo cat > /etc/systemd/system/pve_exporter.service <<'EOF'
[Unit]
Description=Proxmox VE Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/pve_exporter /etc/prometheus/pve.yml
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable pve_exporter
sudo systemctl start pve_exporter
Blackbox Exporter (for HTTP/HTTPS probing):
# Add to docker-compose.yml
blackbox:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
restart: unless-stopped
ports:
- "9115:9115"
volumes:
- ./blackbox.yml:/etc/blackbox_exporter/config.yml
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: []
method: GET
tcp_connect:
prober: tcp
timeout: 5s
Add Scrape Targets
Add to prometheus.yml:
# VPS Node Exporter
- job_name: 'vps'
static_configs:
- targets: ['VPS_IP:9100']
# Proxmox Nodes
- job_name: 'proxmox'
static_configs:
- targets: ['PROXMOX_IP_1:9221', 'PROXMOX_IP_2:9221']
# HTTP Endpoints
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://service1.example.com
- https://service2.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115
Log Aggregation: Loki
Best for: Centralized logging from all services
Installation
Add to docker-compose.yml:
loki:
image: grafana/loki:latest
container_name: loki
restart: unless-stopped
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
promtail:
image: grafana/promtail:latest
container_name: promtail
restart: unless-stopped
volumes:
- ./promtail-config.yml:/etc/promtail/config.yml
- /var/log:/var/log
command: -config.file=/etc/promtail/config.yml
volumes:
loki-data:
Configure Loki in Grafana
- Configuration → Data Sources → Add data source
- Select Loki
- URL: http://loki:3100
- Save & Test
Query Logs
In Grafana Explore:
# All logs
{job="varlogs"}
# Filter by service
{job="varlogs"} |= "pangolin"
# Error logs
{job="varlogs"} |= "error"
Alerting Setup
Prometheus Alerting Rules
Create alerts.yml:
groups:
- name: infrastructure
interval: 30s
rules:
# Node down
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} has been down for more than 5 minutes"
# High CPU
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage is above 80% for 10 minutes"
# High Memory
- alert: HighMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low memory on {{ $labels.instance }}"
description: "Available memory is below 10%"
# Disk Space
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is below 10% on {{ $labels.mountpoint }}"
# SSL Certificate Expiring
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon"
description: "Certificate for {{ $labels.instance }} expires in less than 30 days"
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
receivers:
- name: 'default'
email_configs:
- to: 'your-email@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'username'
auth_password: 'password'
# Slack
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Dashboards
Essential Dashboards
-
Infrastructure Overview
- All nodes status
- Overall resource utilization
- Service uptime
-
VPS Dashboard
- CPU, RAM, disk, network
- Running services
- Firewall connections
-
Proxmox Cluster
- Cluster health
- VM/container count and status
- Resource allocation vs usage
-
Storage
- Disk space trends
- I/O performance
- SMART status
-
Services
- Uptime percentage
- Response times
- Error rates
-
Tunnels
- Gerbil tunnel status
- Connection count
- Bandwidth usage
Creating Custom Dashboard
- Grafana → Create → Dashboard
- Add Panel → Select visualization
- Write PromQL query
- Configure thresholds and alerts
- Save dashboard
Maintenance
Regular Tasks
Daily:
- Review alerts
- Check dashboard for anomalies
Weekly:
- Review resource trends
- Check for unused monitors
- Update dashboards
Monthly:
- Review and tune alert thresholds
- Clean up old metrics
- Update monitoring stack
- Test alerting
Quarterly:
- Review monitoring strategy
- Evaluate new monitoring tools
- Update documentation
Troubleshooting
Prometheus not scraping:
# Check targets
curl http://localhost:9090/targets
# Check Prometheus logs
docker logs prometheus
Grafana dashboard empty:
- Verify data source connection
- Check time range
- Verify metrics exist in Prometheus
No alerts firing:
- Check alerting rules syntax
- Verify Alertmanager connection
- Test alert evaluation
Monitoring Checklist
Initial Setup
- Choose monitoring tier (Basic/Intermediate/Advanced)
- Deploy monitoring stack
- Install exporters on all hosts
- Configure Grafana data sources
- Import/create dashboards
- Set up alerting
- Configure notification channels
- Test alerts
Monitors to Configure
- VPS uptime and resources
- Proxmox node resources
- OMV storage capacity
- All public HTTP(S) endpoints
- Gerbil tunnel status
- SSL certificate expiration
- Backup job success
- Network connectivity
- Service-specific metrics
Alerts to Configure
- Service down (>5 min)
- High CPU (>80% for 10 min)
- High memory (>90% for 5 min)
- Low disk space (<10%)
- SSL cert expiring (<30 days)
- Backup failure
- Tunnel disconnected
Cost Considerations
Free Tier Options
- Uptime Kuma: Fully free, self-hosted
- Prometheus + Grafana: Free, self-hosted
- Grafana Cloud: Free tier available (limited)
Paid Options (if needed)
- Datadog: $15/host/month
- New Relic: $99/month+
- Better Uptime: $10/month+
Recommendation: Start with free self-hosted tools, upgrade only if needed.
Last Updated: _____________ Next Review: _____________ Version: 1.0