Files
homelab-docs/infrastructure/MONITORING.md

15 KiB

Monitoring Setup Guide

This guide provides step-by-step instructions for setting up monitoring for your infrastructure.

Table of Contents


Monitoring Strategy

What to Monitor

Infrastructure Level:

  • VPS: CPU, RAM, disk, network
  • Proxmox nodes: CPU, RAM, disk, network
  • OMV storage: Disk usage, SMART status
  • Network: Bandwidth, connectivity

Service Level:

  • Service uptime and response time
  • HTTP/HTTPS endpoints
  • Gerbil tunnel status
  • SSL certificate expiration
  • Backup job success/failure

Application Level:

  • Application-specific metrics
  • Error rates
  • Request rates
  • Database performance

Monitoring Tiers

Tier Solution Complexity Setup Time Cost
Basic Uptime Kuma Low 30 min Free
Intermediate Prometheus + Grafana Medium 2-4 hours Free
Advanced Full observability stack High 8+ hours Free/Paid

Quick Start: Uptime Kuma

Best for: Simple uptime monitoring with alerts

Installation

# On a Proxmox container or VM
# Create container with Ubuntu/Debian

# Install Docker
curl -fsSL https://get.docker.com | sh

# Run Uptime Kuma
docker run -d --restart=always \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  --name uptime-kuma \
  louislam/uptime-kuma:1

# Access at http://HOST_IP:3001

Configuration

  1. Create Admin Account

    • First login creates admin user
  2. Add Monitors

    HTTP(S) Monitor:

    Ping Monitor:

    • Monitor Type: Ping
    • Hostname: VPS or node IP
    • Interval: 60 seconds

    Port Monitor:

    • Monitor Type: Port
    • Hostname: IP address
    • Port: Service port
  3. Set Up Notifications

    • Settings → Notifications
    • Add notification method (email, Slack, Discord, etc.)
    • Test notification
  4. Create Status Page (Optional)

    • Status Pages → Add Status Page
    • Add monitors to display
    • Make public or private

Monitors to Create

  • VPS SSH (Port 22 or custom)
  • VPS HTTP/HTTPS (Port 80/443)
  • Each public service endpoint
  • Proxmox web interface (each node)
  • OMV web interface

Comprehensive Stack: Prometheus + Grafana

Best for: Detailed metrics, trending, and advanced alerting

Architecture

┌─────────────┐     ┌───────────┐     ┌─────────┐
│  Exporters  │────▶│ Prometheus│────▶│ Grafana │
│  (Metrics)  │     │ (Storage) │     │  (UI)   │
└─────────────┘     └───────────┘     └─────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │Alertmanager │
                    └─────────────┘

Installation (Docker Compose)

# Create monitoring directory
mkdir -p ~/monitoring
cd ~/monitoring

# Create docker-compose.yml
cat > docker-compose.yml <<'EOF'
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_INSTALL_PLUGINS=grafana-piechart-panel

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    command:
      - '--path.rootfs=/host'
    volumes:
      - '/:/host:ro,rslave'

volumes:
  prometheus-data:
  grafana-data:
EOF

# Create Prometheus config
cat > prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          instance: 'monitoring-host'

  # Add more exporters here
  # - job_name: 'proxmox'
  #   static_configs:
  #     - targets: ['proxmox-node-1:9221']
EOF

# Start services
docker-compose up -d

Access

Configure Grafana

  1. Add Prometheus Data Source

    • Configuration → Data Sources → Add data source
    • Select Prometheus
    • URL: http://prometheus:9090
    • Click "Save & Test"
  2. Import Dashboards

    • Dashboard → Import
    • Import these popular dashboards:
      • 1860: Node Exporter Full
      • 10180: Proxmox via Prometheus
      • 763: Disk I/O performance

Install Exporters

Node Exporter (on each host to monitor):

# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify
curl http://localhost:9100/metrics

Proxmox VE Exporter:

# On Proxmox node
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/pve_exporter
chmod +x pve_exporter
sudo mv pve_exporter /usr/local/bin/

# Create config
sudo mkdir -p /etc/prometheus
sudo cat > /etc/prometheus/pve.yml <<EOF
default:
  user: monitoring@pve
  password: your_password
  verify_ssl: false
EOF

# Create systemd service
sudo cat > /etc/systemd/system/pve_exporter.service <<'EOF'
[Unit]
Description=Proxmox VE Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/pve_exporter /etc/prometheus/pve.yml

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable pve_exporter
sudo systemctl start pve_exporter

Blackbox Exporter (for HTTP/HTTPS probing):

# Add to docker-compose.yml
  blackbox:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    restart: unless-stopped
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox.yml:/etc/blackbox_exporter/config.yml
# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []
      method: GET
  tcp_connect:
    prober: tcp
    timeout: 5s

Add Scrape Targets

Add to prometheus.yml:

  # VPS Node Exporter
  - job_name: 'vps'
    static_configs:
      - targets: ['VPS_IP:9100']

  # Proxmox Nodes
  - job_name: 'proxmox'
    static_configs:
      - targets: ['PROXMOX_IP_1:9221', 'PROXMOX_IP_2:9221']

  # HTTP Endpoints
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://service1.example.com
          - https://service2.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115

Log Aggregation: Loki

Best for: Centralized logging from all services

Installation

Add to docker-compose.yml:

  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    restart: unless-stopped
    volumes:
      - ./promtail-config.yml:/etc/promtail/config.yml
      - /var/log:/var/log
    command: -config.file=/etc/promtail/config.yml

volumes:
  loki-data:

Configure Loki in Grafana

  1. Configuration → Data Sources → Add data source
  2. Select Loki
  3. URL: http://loki:3100
  4. Save & Test

Query Logs

In Grafana Explore:

# All logs
{job="varlogs"}

# Filter by service
{job="varlogs"} |= "pangolin"

# Error logs
{job="varlogs"} |= "error"

Alerting Setup

Prometheus Alerting Rules

Create alerts.yml:

groups:
  - name: infrastructure
    interval: 30s
    rules:
      # Node down
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes"

      # High CPU
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 10 minutes"

      # High Memory
      - alert: HighMemory
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory is below 10%"

      # Disk Space
      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 10% on {{ $labels.mountpoint }}"

      # SSL Certificate Expiring
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon"
          description: "Certificate for {{ $labels.instance }} expires in less than 30 days"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'your-email@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'username'
        auth_password: 'password'

  # Slack
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Dashboards

Essential Dashboards

  1. Infrastructure Overview

    • All nodes status
    • Overall resource utilization
    • Service uptime
  2. VPS Dashboard

    • CPU, RAM, disk, network
    • Running services
    • Firewall connections
  3. Proxmox Cluster

    • Cluster health
    • VM/container count and status
    • Resource allocation vs usage
  4. Storage

    • Disk space trends
    • I/O performance
    • SMART status
  5. Services

    • Uptime percentage
    • Response times
    • Error rates
  6. Tunnels

    • Gerbil tunnel status
    • Connection count
    • Bandwidth usage

Creating Custom Dashboard

  1. Grafana → Create → Dashboard
  2. Add Panel → Select visualization
  3. Write PromQL query
  4. Configure thresholds and alerts
  5. Save dashboard

Maintenance

Regular Tasks

Daily:

  • Review alerts
  • Check dashboard for anomalies

Weekly:

  • Review resource trends
  • Check for unused monitors
  • Update dashboards

Monthly:

  • Review and tune alert thresholds
  • Clean up old metrics
  • Update monitoring stack
  • Test alerting

Quarterly:

  • Review monitoring strategy
  • Evaluate new monitoring tools
  • Update documentation

Troubleshooting

Prometheus not scraping:

# Check targets
curl http://localhost:9090/targets

# Check Prometheus logs
docker logs prometheus

Grafana dashboard empty:

  • Verify data source connection
  • Check time range
  • Verify metrics exist in Prometheus

No alerts firing:

  • Check alerting rules syntax
  • Verify Alertmanager connection
  • Test alert evaluation

Monitoring Checklist

Initial Setup

  • Choose monitoring tier (Basic/Intermediate/Advanced)
  • Deploy monitoring stack
  • Install exporters on all hosts
  • Configure Grafana data sources
  • Import/create dashboards
  • Set up alerting
  • Configure notification channels
  • Test alerts

Monitors to Configure

  • VPS uptime and resources
  • Proxmox node resources
  • OMV storage capacity
  • All public HTTP(S) endpoints
  • Gerbil tunnel status
  • SSL certificate expiration
  • Backup job success
  • Network connectivity
  • Service-specific metrics

Alerts to Configure

  • Service down (>5 min)
  • High CPU (>80% for 10 min)
  • High memory (>90% for 5 min)
  • Low disk space (<10%)
  • SSL cert expiring (<30 days)
  • Backup failure
  • Tunnel disconnected

Cost Considerations

Free Tier Options

  • Uptime Kuma: Fully free, self-hosted
  • Prometheus + Grafana: Free, self-hosted
  • Grafana Cloud: Free tier available (limited)

Paid Options (if needed)

  • Datadog: $15/host/month
  • New Relic: $99/month+
  • Better Uptime: $10/month+

Recommendation: Start with free self-hosted tools, upgrade only if needed.


Last Updated: _____________ Next Review: _____________ Version: 1.0