158 lines
4.1 KiB
Markdown
158 lines
4.1 KiB
Markdown
# Prometheus Alert Investigation - Feb 3, 2026
|
|
|
|
## 🔍 Investigation Summary
|
|
|
|
**Time:** 1:58 PM CST
|
|
**Investigator:** OpenClaw (Funky)
|
|
**Scope:** 4 hosts showing as DOWN in Prometheus
|
|
|
|
---
|
|
|
|
## 📊 Findings
|
|
|
|
### 1. pve-router (10.0.10.2) - Proxmox Host
|
|
**Status:** ⚠️ Host UP, Monitoring DOWN
|
|
**Issue:** node_exporter not responding on port 9100
|
|
|
|
```
|
|
✅ ICMP ping: Responding (0.4ms latency)
|
|
❌ node_exporter (port 9100): Timeout after 2 seconds
|
|
```
|
|
|
|
**Diagnosis:**
|
|
- Host is online and reachable
|
|
- node_exporter service likely stopped or not installed
|
|
- This is your office Proxmox host (i5)
|
|
|
|
**Action Required:**
|
|
```bash
|
|
ssh root@10.0.10.2
|
|
systemctl status prometheus-node-exporter
|
|
systemctl start prometheus-node-exporter
|
|
systemctl enable prometheus-node-exporter
|
|
```
|
|
|
|
---
|
|
|
|
### 2. vps-gaming (51.222.12.162) - OVH Gaming VPS
|
|
**Status:** ⚠️ Host UP, Monitoring DOWN
|
|
**Issue:** node_exporter not responding on port 9100
|
|
|
|
```
|
|
✅ ICMP ping: Responding (23.8ms latency - normal for OVH Canada)
|
|
❌ node_exporter (port 9100): Timeout after 2 seconds
|
|
```
|
|
|
|
**Diagnosis:**
|
|
- Host is online (WireGuard VPN likely working)
|
|
- node_exporter either not installed or firewall blocking port 9100
|
|
- Provider: OVH (deadeyeg4ming.vip)
|
|
|
|
**Action Required:**
|
|
```bash
|
|
ssh root@51.222.12.162
|
|
# Check if installed
|
|
systemctl status prometheus-node-exporter
|
|
|
|
# If not installed
|
|
apt update && apt install prometheus-node-exporter -y
|
|
|
|
# Check firewall
|
|
ufw status
|
|
ufw allow 9100/tcp # If using UFW
|
|
```
|
|
|
|
---
|
|
|
|
### 3. OpenClaw Gateway (10.0.10.41) - CT 130
|
|
**Status:** 🔴 Host DOWN / Missing node_exporter
|
|
**Issue:** Container reachable but node_exporter not installed
|
|
|
|
**Note:** This is the OpenClaw container (me!) - node_exporter should be installed for self-monitoring.
|
|
|
|
**Action Required:**
|
|
```bash
|
|
ssh root@10.0.10.41
|
|
apt update && apt install prometheus-node-exporter -y
|
|
systemctl enable --now prometheus-node-exporter
|
|
```
|
|
|
|
---
|
|
|
|
### 4. Available Container (10.0.10.42) - CT 131
|
|
**Status:** 🟢 Available for use
|
|
**Issue:** Container available but not yet deployed
|
|
|
|
**Note:** This container is available for future use.
|
|
|
|
---
|
|
|
|
## 🎯 Priority Action Items
|
|
|
|
### Critical (Affects Real Monitoring)
|
|
1. **Fix pve-router node_exporter** - This is a production Proxmox host
|
|
2. **Fix vps-gaming node_exporter** - This is your WireGuard VPN endpoint
|
|
|
|
### Low Priority (Game Servers)
|
|
3. **Decide on minecraft-forge** - Start if needed, or remove from Prometheus config
|
|
4. **Decide on minecraft-stoneblock** - Start if needed, or remove from Prometheus config
|
|
|
|
---
|
|
|
|
## 🔧 Quick Fix Commands
|
|
|
|
### For pve-router (10.0.10.2)
|
|
```bash
|
|
ssh root@10.0.10.2 "apt update && apt install prometheus-node-exporter -y && systemctl enable --now prometheus-node-exporter"
|
|
```
|
|
|
|
### For vps-gaming (51.222.12.162)
|
|
```bash
|
|
ssh root@51.222.12.162 "apt update && apt install prometheus-node-exporter -y && systemctl enable --now prometheus-node-exporter && ufw allow 9100/tcp"
|
|
```
|
|
|
|
### Clean Up Prometheus Config (Remove Game Servers)
|
|
If you don't want to monitor stopped game servers:
|
|
```bash
|
|
ssh root@10.0.10.25
|
|
nano /etc/prometheus/prometheus.yml
|
|
# Comment out or remove the minecraft targets (10.0.10.41, 10.0.10.42)
|
|
systemctl reload prometheus
|
|
```
|
|
|
|
---
|
|
|
|
## 📈 Expected Outcome
|
|
|
|
**After fixes:**
|
|
- ✅ 2/4 hosts back online (pve-router, vps-gaming)
|
|
- ✅ Only real infrastructure monitored
|
|
- ✅ No false positive alerts
|
|
- ✅ Inbox stays clean
|
|
|
|
**Time to fix:** ~5 minutes total
|
|
|
|
---
|
|
|
|
## 🚨 Current Alert Status
|
|
|
|
These hosts are **NOT firing critical Discord alerts** yet because:
|
|
- They're in "pending" state (less than 2 minutes down)
|
|
- Our threshold is **2+ minutes** before triggering
|
|
|
|
If you don't fix them, you'll get Discord alerts in:
|
|
- **~1-2 minutes** from now (they've been down for a while already)
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- pve-router and vps-gaming are **real issues** - these should be monitored
|
|
- Minecraft servers are probably **intentional** - you don't run them 24/7
|
|
- Consider removing game servers from Prometheus if you don't want to track them
|
|
|
|
Let me know if you want me to:
|
|
1. Fix the node_exporters remotely (if I have SSH access)
|
|
2. Remove game servers from Prometheus config
|
|
3. Both!
|