Files
homelab-docs/alert-investigation-2026-02-03.md

158 lines
4.1 KiB
Markdown

# Prometheus Alert Investigation - Feb 3, 2026
## 🔍 Investigation Summary
**Time:** 1:58 PM CST
**Investigator:** OpenClaw (Funky)
**Scope:** 4 hosts showing as DOWN in Prometheus
---
## 📊 Findings
### 1. pve-router (10.0.10.2) - Proxmox Host
**Status:** ⚠️ Host UP, Monitoring DOWN
**Issue:** node_exporter not responding on port 9100
```
✅ ICMP ping: Responding (0.4ms latency)
❌ node_exporter (port 9100): Timeout after 2 seconds
```
**Diagnosis:**
- Host is online and reachable
- node_exporter service likely stopped or not installed
- This is your office Proxmox host (i5)
**Action Required:**
```bash
ssh root@10.0.10.2
systemctl status prometheus-node-exporter
systemctl start prometheus-node-exporter
systemctl enable prometheus-node-exporter
```
---
### 2. vps-gaming (51.222.12.162) - OVH Gaming VPS
**Status:** ⚠️ Host UP, Monitoring DOWN
**Issue:** node_exporter not responding on port 9100
```
✅ ICMP ping: Responding (23.8ms latency - normal for OVH Canada)
❌ node_exporter (port 9100): Timeout after 2 seconds
```
**Diagnosis:**
- Host is online (WireGuard VPN likely working)
- node_exporter either not installed or firewall blocking port 9100
- Provider: OVH (deadeyeg4ming.vip)
**Action Required:**
```bash
ssh root@51.222.12.162
# Check if installed
systemctl status prometheus-node-exporter
# If not installed
apt update && apt install prometheus-node-exporter -y
# Check firewall
ufw status
ufw allow 9100/tcp # If using UFW
```
---
### 3. OpenClaw Gateway (10.0.10.41) - CT 130
**Status:** 🔴 Host DOWN / Missing node_exporter
**Issue:** Container reachable but node_exporter not installed
**Note:** This is the OpenClaw container (me!) - node_exporter should be installed for self-monitoring.
**Action Required:**
```bash
ssh root@10.0.10.41
apt update && apt install prometheus-node-exporter -y
systemctl enable --now prometheus-node-exporter
```
---
### 4. Available Container (10.0.10.42) - CT 131
**Status:** 🟢 Available for use
**Issue:** Container available but not yet deployed
**Note:** This container is available for future use.
---
## 🎯 Priority Action Items
### Critical (Affects Real Monitoring)
1. **Fix pve-router node_exporter** - This is a production Proxmox host
2. **Fix vps-gaming node_exporter** - This is your WireGuard VPN endpoint
### Low Priority (Game Servers)
3. **Decide on minecraft-forge** - Start if needed, or remove from Prometheus config
4. **Decide on minecraft-stoneblock** - Start if needed, or remove from Prometheus config
---
## 🔧 Quick Fix Commands
### For pve-router (10.0.10.2)
```bash
ssh root@10.0.10.2 "apt update && apt install prometheus-node-exporter -y && systemctl enable --now prometheus-node-exporter"
```
### For vps-gaming (51.222.12.162)
```bash
ssh root@51.222.12.162 "apt update && apt install prometheus-node-exporter -y && systemctl enable --now prometheus-node-exporter && ufw allow 9100/tcp"
```
### Clean Up Prometheus Config (Remove Game Servers)
If you don't want to monitor stopped game servers:
```bash
ssh root@10.0.10.25
nano /etc/prometheus/prometheus.yml
# Comment out or remove the minecraft targets (10.0.10.41, 10.0.10.42)
systemctl reload prometheus
```
---
## 📈 Expected Outcome
**After fixes:**
- ✅ 2/4 hosts back online (pve-router, vps-gaming)
- ✅ Only real infrastructure monitored
- ✅ No false positive alerts
- ✅ Inbox stays clean
**Time to fix:** ~5 minutes total
---
## 🚨 Current Alert Status
These hosts are **NOT firing critical Discord alerts** yet because:
- They're in "pending" state (less than 2 minutes down)
- Our threshold is **2+ minutes** before triggering
If you don't fix them, you'll get Discord alerts in:
- **~1-2 minutes** from now (they've been down for a while already)
---
## Notes
- pve-router and vps-gaming are **real issues** - these should be monitored
- Minecraft servers are probably **intentional** - you don't run them 24/7
- Consider removing game servers from Prometheus if you don't want to track them
Let me know if you want me to:
1. Fix the node_exporters remotely (if I have SSH access)
2. Remove game servers from Prometheus config
3. Both!