fred/homelab-docs

Fork 0

Files

Funky (OpenClaw) 0682c79580 Initial infrastructure documentation - comprehensive homelab reference

2026-02-23 03:42:22 +00:00

15 KiB

Raw Blame History

Monitoring Setup Guide

This guide provides step-by-step instructions for setting up monitoring for your infrastructure.

Monitoring Strategy
Quick Start: Uptime Kuma
Comprehensive Stack: Prometheus + Grafana
Log Aggregation: Loki
Alerting Setup
Dashboards
Maintenance

Monitoring Strategy

What to Monitor

Infrastructure Level:

VPS: CPU, RAM, disk, network
Proxmox nodes: CPU, RAM, disk, network
OMV storage: Disk usage, SMART status
Network: Bandwidth, connectivity

Service Level:

Service uptime and response time
HTTP/HTTPS endpoints
Gerbil tunnel status
SSL certificate expiration
Backup job success/failure

Application Level:

Application-specific metrics
Error rates
Request rates
Database performance

Monitoring Tiers

Tier	Solution	Complexity	Setup Time	Cost
Basic	Uptime Kuma	Low	30 min	Free
Intermediate	Prometheus + Grafana	Medium	2-4 hours	Free
Advanced	Full observability stack	High	8+ hours	Free/Paid

Quick Start: Uptime Kuma

Best for: Simple uptime monitoring with alerts

Installation

# On a Proxmox container or VM
# Create container with Ubuntu/Debian

# Install Docker
curl -fsSL https://get.docker.com | sh

# Run Uptime Kuma
docker run -d --restart=always \
  -p 3001:3001 \
  -v uptime-kuma:/app/data \
  --name uptime-kuma \
  louislam/uptime-kuma:1

# Access at http://HOST_IP:3001

Configuration

Create Admin Account
- First login creates admin user
Add Monitors

HTTP(S) Monitor:
- Monitor Type: HTTP(s)
- Friendly Name: Service Name
- URL: https://service.example.com
- Heartbeat Interval: 60 seconds
- Retries: 3
Ping Monitor:
- Monitor Type: Ping
- Hostname: VPS or node IP
- Interval: 60 seconds
Port Monitor:
- Monitor Type: Port
- Hostname: IP address
- Port: Service port
Set Up Notifications
- Settings → Notifications
- Add notification method (email, Slack, Discord, etc.)
- Test notification
Create Status Page (Optional)
- Status Pages → Add Status Page
- Add monitors to display
- Make public or private

Monitors to Create

VPS SSH (Port 22 or custom)
VPS HTTP/HTTPS (Port 80/443)
Each public service endpoint
Proxmox web interface (each node)
OMV web interface

Comprehensive Stack: Prometheus + Grafana

Best for: Detailed metrics, trending, and advanced alerting

Architecture

┌─────────────┐     ┌───────────┐     ┌─────────┐
│  Exporters  │────▶│ Prometheus│────▶│ Grafana │
│  (Metrics)  │     │ (Storage) │     │  (UI)   │
└─────────────┘     └───────────┘     └─────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │Alertmanager │
                    └─────────────┘

Installation (Docker Compose)

# Create monitoring directory
mkdir -p ~/monitoring
cd ~/monitoring

# Create docker-compose.yml
cat > docker-compose.yml <<'EOF'
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
      - GF_INSTALL_PLUGINS=grafana-piechart-panel

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    command:
      - '--path.rootfs=/host'
    volumes:
      - '/:/host:ro,rslave'

volumes:
  prometheus-data:
  grafana-data:
EOF

# Create Prometheus config
cat > prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
        labels:
          instance: 'monitoring-host'

  # Add more exporters here
  # - job_name: 'proxmox'
  #   static_configs:
  #     - targets: ['proxmox-node-1:9221']
EOF

# Start services
docker-compose up -d

Access

Prometheus: http://HOST_IP:9090
Grafana: http://HOST_IP:3000 (admin/changeme)

Configure Grafana

Add Prometheus Data Source
- Configuration → Data Sources → Add data source
- Select Prometheus
- URL: http://prometheus:9090
- Click "Save & Test"
Import Dashboards
- Dashboard → Import
- Import these popular dashboards:
  - 1860: Node Exporter Full
  - 10180: Proxmox via Prometheus
  - 763: Disk I/O performance

Install Exporters

Node Exporter (on each host to monitor):

# Download
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

# Verify
curl http://localhost:9100/metrics

Proxmox VE Exporter:

# On Proxmox node
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/pve_exporter
chmod +x pve_exporter
sudo mv pve_exporter /usr/local/bin/

# Create config
sudo mkdir -p /etc/prometheus
sudo cat > /etc/prometheus/pve.yml <<EOF
default:
  user: monitoring@pve
  password: your_password
  verify_ssl: false
EOF

# Create systemd service
sudo cat > /etc/systemd/system/pve_exporter.service <<'EOF'
[Unit]
Description=Proxmox VE Exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/pve_exporter /etc/prometheus/pve.yml

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable pve_exporter
sudo systemctl start pve_exporter

Blackbox Exporter (for HTTP/HTTPS probing):

# Add to docker-compose.yml
  blackbox:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    restart: unless-stopped
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox.yml:/etc/blackbox_exporter/config.yml

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []
      method: GET
  tcp_connect:
    prober: tcp
    timeout: 5s

Add Scrape Targets

Add to prometheus.yml:

  # VPS Node Exporter
  - job_name: 'vps'
    static_configs:
      - targets: ['VPS_IP:9100']

  # Proxmox Nodes
  - job_name: 'proxmox'
    static_configs:
      - targets: ['PROXMOX_IP_1:9221', 'PROXMOX_IP_2:9221']

  # HTTP Endpoints
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://service1.example.com
          - https://service2.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115

Log Aggregation: Loki

Best for: Centralized logging from all services

Installation

Add to docker-compose.yml:

  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    restart: unless-stopped
    volumes:
      - ./promtail-config.yml:/etc/promtail/config.yml
      - /var/log:/var/log
    command: -config.file=/etc/promtail/config.yml

volumes:
  loki-data:

Configure Loki in Grafana

Configuration → Data Sources → Add data source
Select Loki
URL: http://loki:3100
Save & Test

Query Logs

In Grafana Explore:

# All logs
{job="varlogs"}

# Filter by service
{job="varlogs"} |= "pangolin"

# Error logs
{job="varlogs"} |= "error"

Alerting Setup

Prometheus Alerting Rules

Create alerts.yml:

groups:
  - name: infrastructure
    interval: 30s
    rules:
      # Node down
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been down for more than 5 minutes"

      # High CPU
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 10 minutes"

      # High Memory
      - alert: HighMemory
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory is below 10%"

      # Disk Space
      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 10% on {{ $labels.mountpoint }}"

      # SSL Certificate Expiring
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate expiring soon"
          description: "Certificate for {{ $labels.instance }} expires in less than 30 days"

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'your-email@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'username'
        auth_password: 'password'

  # Slack
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

Dashboards

Essential Dashboards

Infrastructure Overview
- All nodes status
- Overall resource utilization
- Service uptime
VPS Dashboard
- CPU, RAM, disk, network
- Running services
- Firewall connections
Proxmox Cluster
- Cluster health
- VM/container count and status
- Resource allocation vs usage
Storage
- Disk space trends
- I/O performance
- SMART status
Services
- Uptime percentage
- Response times
- Error rates
Tunnels
- Gerbil tunnel status
- Connection count
- Bandwidth usage

Creating Custom Dashboard

Grafana → Create → Dashboard
Add Panel → Select visualization
Write PromQL query
Configure thresholds and alerts
Save dashboard

Maintenance

Regular Tasks

Daily:

Review alerts
Check dashboard for anomalies

Weekly:

Review resource trends
Check for unused monitors
Update dashboards

Monthly:

Review and tune alert thresholds
Clean up old metrics
Update monitoring stack
Test alerting

Quarterly:

Review monitoring strategy
Evaluate new monitoring tools
Update documentation

Troubleshooting

Prometheus not scraping:

# Check targets
curl http://localhost:9090/targets

# Check Prometheus logs
docker logs prometheus

Grafana dashboard empty:

Verify data source connection
Check time range
Verify metrics exist in Prometheus

No alerts firing:

Check alerting rules syntax
Verify Alertmanager connection
Test alert evaluation

Monitoring Checklist

Initial Setup

Choose monitoring tier (Basic/Intermediate/Advanced)
Deploy monitoring stack
Install exporters on all hosts
Configure Grafana data sources
Import/create dashboards
Set up alerting
Configure notification channels
Test alerts

Monitors to Configure

VPS uptime and resources
Proxmox node resources
OMV storage capacity
All public HTTP(S) endpoints
Gerbil tunnel status
SSL certificate expiration
Backup job success
Network connectivity
Service-specific metrics

Alerts to Configure

Service down (>5 min)
High CPU (>80% for 10 min)
High memory (>90% for 5 min)
Low disk space (<10%)
SSL cert expiring (<30 days)
Backup failure
Tunnel disconnected

Cost Considerations

Free Tier Options

Uptime Kuma: Fully free, self-hosted
Prometheus + Grafana: Free, self-hosted
Grafana Cloud: Free tier available (limited)

Paid Options (if needed)

Datadog: $15/host/month
New Relic: $99/month+
Better Uptime: $10/month+

Recommendation: Start with free self-hosted tools, upgrade only if needed.

Last Updated: _____________ Next Review: _____________ Version: 1.0

15 KiB Raw Blame History

Monitoring Setup Guide

Table of Contents

Monitoring Strategy

What to Monitor

Monitoring Tiers

Quick Start: Uptime Kuma

Installation

Configuration

Monitors to Create

Comprehensive Stack: Prometheus + Grafana

Architecture

Installation (Docker Compose)

Access

Configure Grafana

Install Exporters

Add Scrape Targets

Log Aggregation: Loki

Installation

Configure Loki in Grafana

Query Logs

Alerting Setup

Prometheus Alerting Rules

Alertmanager Configuration

Dashboards

Essential Dashboards

Creating Custom Dashboard

Maintenance

Regular Tasks

Troubleshooting

Monitoring Checklist

Initial Setup

Monitors to Configure

Alerts to Configure

Cost Considerations

Free Tier Options

Paid Options (if needed)

15 KiB

Raw Blame History