Initial infrastructure documentation - comprehensive homelab reference
This commit is contained in:
157
alert-investigation-2026-02-03.md
Normal file
157
alert-investigation-2026-02-03.md
Normal file
@@ -0,0 +1,157 @@
|
||||
# Prometheus Alert Investigation - Feb 3, 2026
|
||||
|
||||
## 🔍 Investigation Summary
|
||||
|
||||
**Time:** 1:58 PM CST
|
||||
**Investigator:** OpenClaw (Funky)
|
||||
**Scope:** 4 hosts showing as DOWN in Prometheus
|
||||
|
||||
---
|
||||
|
||||
## 📊 Findings
|
||||
|
||||
### 1. pve-router (10.0.10.2) - Proxmox Host
|
||||
**Status:** ⚠️ Host UP, Monitoring DOWN
|
||||
**Issue:** node_exporter not responding on port 9100
|
||||
|
||||
```
|
||||
✅ ICMP ping: Responding (0.4ms latency)
|
||||
❌ node_exporter (port 9100): Timeout after 2 seconds
|
||||
```
|
||||
|
||||
**Diagnosis:**
|
||||
- Host is online and reachable
|
||||
- node_exporter service likely stopped or not installed
|
||||
- This is your office Proxmox host (i5)
|
||||
|
||||
**Action Required:**
|
||||
```bash
|
||||
ssh root@10.0.10.2
|
||||
systemctl status prometheus-node-exporter
|
||||
systemctl start prometheus-node-exporter
|
||||
systemctl enable prometheus-node-exporter
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. vps-gaming (51.222.12.162) - OVH Gaming VPS
|
||||
**Status:** ⚠️ Host UP, Monitoring DOWN
|
||||
**Issue:** node_exporter not responding on port 9100
|
||||
|
||||
```
|
||||
✅ ICMP ping: Responding (23.8ms latency - normal for OVH Canada)
|
||||
❌ node_exporter (port 9100): Timeout after 2 seconds
|
||||
```
|
||||
|
||||
**Diagnosis:**
|
||||
- Host is online (WireGuard VPN likely working)
|
||||
- node_exporter either not installed or firewall blocking port 9100
|
||||
- Provider: OVH (deadeyeg4ming.vip)
|
||||
|
||||
**Action Required:**
|
||||
```bash
|
||||
ssh root@51.222.12.162
|
||||
# Check if installed
|
||||
systemctl status prometheus-node-exporter
|
||||
|
||||
# If not installed
|
||||
apt update && apt install prometheus-node-exporter -y
|
||||
|
||||
# Check firewall
|
||||
ufw status
|
||||
ufw allow 9100/tcp # If using UFW
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. OpenClaw Gateway (10.0.10.41) - CT 130
|
||||
**Status:** 🔴 Host DOWN / Missing node_exporter
|
||||
**Issue:** Container reachable but node_exporter not installed
|
||||
|
||||
**Note:** This is the OpenClaw container (me!) - node_exporter should be installed for self-monitoring.
|
||||
|
||||
**Action Required:**
|
||||
```bash
|
||||
ssh root@10.0.10.41
|
||||
apt update && apt install prometheus-node-exporter -y
|
||||
systemctl enable --now prometheus-node-exporter
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. Available Container (10.0.10.42) - CT 131
|
||||
**Status:** 🟢 Available for use
|
||||
**Issue:** Container available but not yet deployed
|
||||
|
||||
**Note:** This container is available for future use.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Priority Action Items
|
||||
|
||||
### Critical (Affects Real Monitoring)
|
||||
1. **Fix pve-router node_exporter** - This is a production Proxmox host
|
||||
2. **Fix vps-gaming node_exporter** - This is your WireGuard VPN endpoint
|
||||
|
||||
### Low Priority (Game Servers)
|
||||
3. **Decide on minecraft-forge** - Start if needed, or remove from Prometheus config
|
||||
4. **Decide on minecraft-stoneblock** - Start if needed, or remove from Prometheus config
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Quick Fix Commands
|
||||
|
||||
### For pve-router (10.0.10.2)
|
||||
```bash
|
||||
ssh root@10.0.10.2 "apt update && apt install prometheus-node-exporter -y && systemctl enable --now prometheus-node-exporter"
|
||||
```
|
||||
|
||||
### For vps-gaming (51.222.12.162)
|
||||
```bash
|
||||
ssh root@51.222.12.162 "apt update && apt install prometheus-node-exporter -y && systemctl enable --now prometheus-node-exporter && ufw allow 9100/tcp"
|
||||
```
|
||||
|
||||
### Clean Up Prometheus Config (Remove Game Servers)
|
||||
If you don't want to monitor stopped game servers:
|
||||
```bash
|
||||
ssh root@10.0.10.25
|
||||
nano /etc/prometheus/prometheus.yml
|
||||
# Comment out or remove the minecraft targets (10.0.10.41, 10.0.10.42)
|
||||
systemctl reload prometheus
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 Expected Outcome
|
||||
|
||||
**After fixes:**
|
||||
- ✅ 2/4 hosts back online (pve-router, vps-gaming)
|
||||
- ✅ Only real infrastructure monitored
|
||||
- ✅ No false positive alerts
|
||||
- ✅ Inbox stays clean
|
||||
|
||||
**Time to fix:** ~5 minutes total
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Current Alert Status
|
||||
|
||||
These hosts are **NOT firing critical Discord alerts** yet because:
|
||||
- They're in "pending" state (less than 2 minutes down)
|
||||
- Our threshold is **2+ minutes** before triggering
|
||||
|
||||
If you don't fix them, you'll get Discord alerts in:
|
||||
- **~1-2 minutes** from now (they've been down for a while already)
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
- pve-router and vps-gaming are **real issues** - these should be monitored
|
||||
- Minecraft servers are probably **intentional** - you don't run them 24/7
|
||||
- Consider removing game servers from Prometheus if you don't want to track them
|
||||
|
||||
Let me know if you want me to:
|
||||
1. Fix the node_exporters remotely (if I have SSH access)
|
||||
2. Remove game servers from Prometheus config
|
||||
3. Both!
|
||||
Reference in New Issue
Block a user