# Monitoring Setup Guide This guide provides step-by-step instructions for setting up monitoring for your infrastructure. ## Table of Contents - [Monitoring Strategy](#monitoring-strategy) - [Quick Start: Uptime Kuma](#quick-start-uptime-kuma) - [Comprehensive Stack: Prometheus + Grafana](#comprehensive-stack-prometheus--grafana) - [Log Aggregation: Loki](#log-aggregation-loki) - [Alerting Setup](#alerting-setup) - [Dashboards](#dashboards) - [Maintenance](#maintenance) --- ## Monitoring Strategy ### What to Monitor **Infrastructure Level**: - VPS: CPU, RAM, disk, network - Proxmox nodes: CPU, RAM, disk, network - OMV storage: Disk usage, SMART status - Network: Bandwidth, connectivity **Service Level**: - Service uptime and response time - HTTP/HTTPS endpoints - Gerbil tunnel status - SSL certificate expiration - Backup job success/failure **Application Level**: - Application-specific metrics - Error rates - Request rates - Database performance ### Monitoring Tiers | Tier | Solution | Complexity | Setup Time | Cost | |------|----------|------------|------------|------| | Basic | Uptime Kuma | Low | 30 min | Free | | Intermediate | Prometheus + Grafana | Medium | 2-4 hours | Free | | Advanced | Full observability stack | High | 8+ hours | Free/Paid | --- ## Quick Start: Uptime Kuma **Best for**: Simple uptime monitoring with alerts ### Installation ```bash # On a Proxmox container or VM # Create container with Ubuntu/Debian # Install Docker curl -fsSL https://get.docker.com | sh # Run Uptime Kuma docker run -d --restart=always \ -p 3001:3001 \ -v uptime-kuma:/app/data \ --name uptime-kuma \ louislam/uptime-kuma:1 # Access at http://HOST_IP:3001 ``` ### Configuration 1. **Create Admin Account** - First login creates admin user 2. **Add Monitors** **HTTP(S) Monitor**: - Monitor Type: HTTP(s) - Friendly Name: Service Name - URL: https://service.example.com - Heartbeat Interval: 60 seconds - Retries: 3 **Ping Monitor**: - Monitor Type: Ping - Hostname: VPS or node IP - Interval: 60 seconds **Port Monitor**: - Monitor Type: Port - Hostname: IP address - Port: Service port 3. **Set Up Notifications** - Settings → Notifications - Add notification method (email, Slack, Discord, etc.) - Test notification 4. **Create Status Page** (Optional) - Status Pages → Add Status Page - Add monitors to display - Make public or private ### Monitors to Create - [ ] VPS SSH (Port 22 or custom) - [ ] VPS HTTP/HTTPS (Port 80/443) - [ ] Each public service endpoint - [ ] Proxmox web interface (each node) - [ ] OMV web interface --- ## Comprehensive Stack: Prometheus + Grafana **Best for**: Detailed metrics, trending, and advanced alerting ### Architecture ``` ┌─────────────┐ ┌───────────┐ ┌─────────┐ │ Exporters │────▶│ Prometheus│────▶│ Grafana │ │ (Metrics) │ │ (Storage) │ │ (UI) │ └─────────────┘ └───────────┘ └─────────┘ │ ▼ ┌─────────────┐ │Alertmanager │ └─────────────┘ ``` ### Installation (Docker Compose) ```bash # Create monitoring directory mkdir -p ~/monitoring cd ~/monitoring # Create docker-compose.yml cat > docker-compose.yml <<'EOF' version: '3.8' services: prometheus: image: prom/prometheus:latest container_name: prometheus restart: unless-stopped ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.retention.time=30d' grafana: image: grafana/grafana:latest container_name: grafana restart: unless-stopped ports: - "3000:3000" volumes: - grafana-data:/var/lib/grafana environment: - GF_SECURITY_ADMIN_PASSWORD=changeme - GF_INSTALL_PLUGINS=grafana-piechart-panel node-exporter: image: prom/node-exporter:latest container_name: node-exporter restart: unless-stopped ports: - "9100:9100" command: - '--path.rootfs=/host' volumes: - '/:/host:ro,rslave' volumes: prometheus-data: grafana-data: EOF # Create Prometheus config cat > prometheus.yml <<'EOF' global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: # Prometheus itself - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Node exporter (system metrics) - job_name: 'node' static_configs: - targets: ['node-exporter:9100'] labels: instance: 'monitoring-host' # Add more exporters here # - job_name: 'proxmox' # static_configs: # - targets: ['proxmox-node-1:9221'] EOF # Start services docker-compose up -d ``` ### Access - **Prometheus**: http://HOST_IP:9090 - **Grafana**: http://HOST_IP:3000 (admin/changeme) ### Configure Grafana 1. **Add Prometheus Data Source** - Configuration → Data Sources → Add data source - Select Prometheus - URL: http://prometheus:9090 - Click "Save & Test" 2. **Import Dashboards** - Dashboard → Import - Import these popular dashboards: - 1860: Node Exporter Full - 10180: Proxmox via Prometheus - 763: Disk I/O performance ### Install Exporters **Node Exporter** (on each host to monitor): ```bash # Download wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ # Create systemd service sudo cat > /etc/systemd/system/node_exporter.service <<'EOF' [Unit] Description=Node Exporter After=network.target [Service] Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable node_exporter sudo systemctl start node_exporter # Verify curl http://localhost:9100/metrics ``` **Proxmox VE Exporter**: ```bash # On Proxmox node wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/pve_exporter chmod +x pve_exporter sudo mv pve_exporter /usr/local/bin/ # Create config sudo mkdir -p /etc/prometheus sudo cat > /etc/prometheus/pve.yml < /etc/systemd/system/pve_exporter.service <<'EOF' [Unit] Description=Proxmox VE Exporter After=network.target [Service] Type=simple ExecStart=/usr/local/bin/pve_exporter /etc/prometheus/pve.yml [Install] WantedBy=multi-user.target EOF sudo systemctl daemon-reload sudo systemctl enable pve_exporter sudo systemctl start pve_exporter ``` **Blackbox Exporter** (for HTTP/HTTPS probing): ```bash # Add to docker-compose.yml blackbox: image: prom/blackbox-exporter:latest container_name: blackbox-exporter restart: unless-stopped ports: - "9115:9115" volumes: - ./blackbox.yml:/etc/blackbox_exporter/config.yml ``` ```yaml # blackbox.yml modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [] method: GET tcp_connect: prober: tcp timeout: 5s ``` ### Add Scrape Targets Add to `prometheus.yml`: ```yaml # VPS Node Exporter - job_name: 'vps' static_configs: - targets: ['VPS_IP:9100'] # Proxmox Nodes - job_name: 'proxmox' static_configs: - targets: ['PROXMOX_IP_1:9221', 'PROXMOX_IP_2:9221'] # HTTP Endpoints - job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://service1.example.com - https://service2.example.com relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox:9115 ``` --- ## Log Aggregation: Loki **Best for**: Centralized logging from all services ### Installation Add to `docker-compose.yml`: ```yaml loki: image: grafana/loki:latest container_name: loki restart: unless-stopped ports: - "3100:3100" volumes: - ./loki-config.yml:/etc/loki/local-config.yaml - loki-data:/loki promtail: image: grafana/promtail:latest container_name: promtail restart: unless-stopped volumes: - ./promtail-config.yml:/etc/promtail/config.yml - /var/log:/var/log command: -config.file=/etc/promtail/config.yml volumes: loki-data: ``` ### Configure Loki in Grafana 1. Configuration → Data Sources → Add data source 2. Select Loki 3. URL: http://loki:3100 4. Save & Test ### Query Logs In Grafana Explore: ```logql # All logs {job="varlogs"} # Filter by service {job="varlogs"} |= "pangolin" # Error logs {job="varlogs"} |= "error" ``` --- ## Alerting Setup ### Prometheus Alerting Rules Create `alerts.yml`: ```yaml groups: - name: infrastructure interval: 30s rules: # Node down - alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} has been down for more than 5 minutes" # High CPU - alert: HighCPU expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: severity: warning annotations: summary: "High CPU on {{ $labels.instance }}" description: "CPU usage is above 80% for 10 minutes" # High Memory - alert: HighMemory expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10 for: 5m labels: severity: warning annotations: summary: "Low memory on {{ $labels.instance }}" description: "Available memory is below 10%" # Disk Space - alert: LowDiskSpace expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is below 10% on {{ $labels.mountpoint }}" # SSL Certificate Expiring - alert: SSLCertExpiringSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 1h labels: severity: warning annotations: summary: "SSL certificate expiring soon" description: "Certificate for {{ $labels.instance }} expires in less than 30 days" ``` ### Alertmanager Configuration ```yaml # alertmanager.yml global: resolve_timeout: 5m route: group_by: ['alertname', 'cluster'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' receivers: - name: 'default' email_configs: - to: 'your-email@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.example.com:587' auth_username: 'username' auth_password: 'password' # Slack - name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL' channel: '#alerts' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' ``` --- ## Dashboards ### Essential Dashboards 1. **Infrastructure Overview** - All nodes status - Overall resource utilization - Service uptime 2. **VPS Dashboard** - CPU, RAM, disk, network - Running services - Firewall connections 3. **Proxmox Cluster** - Cluster health - VM/container count and status - Resource allocation vs usage 4. **Storage** - Disk space trends - I/O performance - SMART status 5. **Services** - Uptime percentage - Response times - Error rates 6. **Tunnels** - Gerbil tunnel status - Connection count - Bandwidth usage ### Creating Custom Dashboard 1. Grafana → Create → Dashboard 2. Add Panel → Select visualization 3. Write PromQL query 4. Configure thresholds and alerts 5. Save dashboard --- ## Maintenance ### Regular Tasks **Daily**: - Review alerts - Check dashboard for anomalies **Weekly**: - Review resource trends - Check for unused monitors - Update dashboards **Monthly**: - Review and tune alert thresholds - Clean up old metrics - Update monitoring stack - Test alerting **Quarterly**: - Review monitoring strategy - Evaluate new monitoring tools - Update documentation ### Troubleshooting **Prometheus not scraping**: ```bash # Check targets curl http://localhost:9090/targets # Check Prometheus logs docker logs prometheus ``` **Grafana dashboard empty**: - Verify data source connection - Check time range - Verify metrics exist in Prometheus **No alerts firing**: - Check alerting rules syntax - Verify Alertmanager connection - Test alert evaluation --- ## Monitoring Checklist ### Initial Setup - [ ] Choose monitoring tier (Basic/Intermediate/Advanced) - [ ] Deploy monitoring stack - [ ] Install exporters on all hosts - [ ] Configure Grafana data sources - [ ] Import/create dashboards - [ ] Set up alerting - [ ] Configure notification channels - [ ] Test alerts ### Monitors to Configure - [ ] VPS uptime and resources - [ ] Proxmox node resources - [ ] OMV storage capacity - [ ] All public HTTP(S) endpoints - [ ] Gerbil tunnel status - [ ] SSL certificate expiration - [ ] Backup job success - [ ] Network connectivity - [ ] Service-specific metrics ### Alerts to Configure - [ ] Service down (>5 min) - [ ] High CPU (>80% for 10 min) - [ ] High memory (>90% for 5 min) - [ ] Low disk space (<10%) - [ ] SSL cert expiring (<30 days) - [ ] Backup failure - [ ] Tunnel disconnected --- ## Cost Considerations ### Free Tier Options - **Uptime Kuma**: Fully free, self-hosted - **Prometheus + Grafana**: Free, self-hosted - **Grafana Cloud**: Free tier available (limited) ### Paid Options (if needed) - **Datadog**: $15/host/month - **New Relic**: $99/month+ - **Better Uptime**: $10/month+ **Recommendation**: Start with free self-hosted tools, upgrade only if needed. --- **Last Updated**: _____________ **Next Review**: _____________ **Version**: 1.0