Files
homelab-docs/alert-investigation-2026-02-03.md

4.1 KiB

Prometheus Alert Investigation - Feb 3, 2026

🔍 Investigation Summary

Time: 1:58 PM CST
Investigator: OpenClaw (Funky)
Scope: 4 hosts showing as DOWN in Prometheus


📊 Findings

1. pve-router (10.0.10.2) - Proxmox Host

Status: ⚠️ Host UP, Monitoring DOWN
Issue: node_exporter not responding on port 9100

✅ ICMP ping: Responding (0.4ms latency)
❌ node_exporter (port 9100): Timeout after 2 seconds

Diagnosis:

  • Host is online and reachable
  • node_exporter service likely stopped or not installed
  • This is your office Proxmox host (i5)

Action Required:

ssh root@10.0.10.2
systemctl status prometheus-node-exporter
systemctl start prometheus-node-exporter
systemctl enable prometheus-node-exporter

2. vps-gaming (51.222.12.162) - OVH Gaming VPS

Status: ⚠️ Host UP, Monitoring DOWN
Issue: node_exporter not responding on port 9100

✅ ICMP ping: Responding (23.8ms latency - normal for OVH Canada)
❌ node_exporter (port 9100): Timeout after 2 seconds

Diagnosis:

  • Host is online (WireGuard VPN likely working)
  • node_exporter either not installed or firewall blocking port 9100
  • Provider: OVH (deadeyeg4ming.vip)

Action Required:

ssh root@51.222.12.162
# Check if installed
systemctl status prometheus-node-exporter

# If not installed
apt update && apt install prometheus-node-exporter -y

# Check firewall
ufw status
ufw allow 9100/tcp  # If using UFW

3. OpenClaw Gateway (10.0.10.41) - CT 130

Status: 🔴 Host DOWN / Missing node_exporter Issue: Container reachable but node_exporter not installed

Note: This is the OpenClaw container (me!) - node_exporter should be installed for self-monitoring.

Action Required:

ssh root@10.0.10.41
apt update && apt install prometheus-node-exporter -y
systemctl enable --now prometheus-node-exporter

4. Available Container (10.0.10.42) - CT 131

Status: 🟢 Available for use Issue: Container available but not yet deployed

Note: This container is available for future use.


🎯 Priority Action Items

Critical (Affects Real Monitoring)

  1. Fix pve-router node_exporter - This is a production Proxmox host
  2. Fix vps-gaming node_exporter - This is your WireGuard VPN endpoint

Low Priority (Game Servers)

  1. Decide on minecraft-forge - Start if needed, or remove from Prometheus config
  2. Decide on minecraft-stoneblock - Start if needed, or remove from Prometheus config

🔧 Quick Fix Commands

For pve-router (10.0.10.2)

ssh root@10.0.10.2 "apt update && apt install prometheus-node-exporter -y && systemctl enable --now prometheus-node-exporter"

For vps-gaming (51.222.12.162)

ssh root@51.222.12.162 "apt update && apt install prometheus-node-exporter -y && systemctl enable --now prometheus-node-exporter && ufw allow 9100/tcp"

Clean Up Prometheus Config (Remove Game Servers)

If you don't want to monitor stopped game servers:

ssh root@10.0.10.25
nano /etc/prometheus/prometheus.yml
# Comment out or remove the minecraft targets (10.0.10.41, 10.0.10.42)
systemctl reload prometheus

📈 Expected Outcome

After fixes:

  • 2/4 hosts back online (pve-router, vps-gaming)
  • Only real infrastructure monitored
  • No false positive alerts
  • Inbox stays clean

Time to fix: ~5 minutes total


🚨 Current Alert Status

These hosts are NOT firing critical Discord alerts yet because:

  • They're in "pending" state (less than 2 minutes down)
  • Our threshold is 2+ minutes before triggering

If you don't fix them, you'll get Discord alerts in:

  • ~1-2 minutes from now (they've been down for a while already)

Notes

  • pve-router and vps-gaming are real issues - these should be monitored
  • Minecraft servers are probably intentional - you don't run them 24/7
  • Consider removing game servers from Prometheus if you don't want to track them

Let me know if you want me to:

  1. Fix the node_exporters remotely (if I have SSH access)
  2. Remove game servers from Prometheus config
  3. Both!