Initial infrastructure documentation - comprehensive homelab reference

2026-02-23 03:42:22 +00:00
commit 0682c79580
169 changed files with 63913 additions and 0 deletions
--- a/ALERT-REDUCTION-SUMMARY.md
+++ b/ALERT-REDUCTION-SUMMARY.md
@@ -0,0 +1,190 @@
+# Alert Reduction Summary
+
+## What Changed
+
+Your Prometheus alerting has been updated to **drastically reduce notification noise** while still catching critical issues.
+
+### Before (Your Current Inbox)
+- 🔥 **All warnings** → Email inbox spam
+- Multiple alerts per minute
+- Drowning out important messages
+
+### After (New Configuration)
+- ✅ **Only CRITICAL alerts** → Discord notification
+- **WARNING alerts** → Logged in Prometheus UI (no notification)
+- Clean inbox, important alerts still get through
+
+---
+
+## Updated Alert Thresholds
+
+### CPU Monitoring
+| Severity | Threshold | Duration | Action |
+|----------|-----------|----------|--------|
+| **WARNING** | 80%+ | 5 minutes | Logged only |
+| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |
+
+### Memory Monitoring
+| Severity | Threshold | Duration | Action |
+|----------|-----------|----------|--------|
+| **WARNING** | 85%+ | 10 minutes | Logged only |
+| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |
+
+### Disk Space
+| Severity | Threshold | Duration | Action |
+|----------|-----------|----------|--------|
+| **WARNING** | <15% free | 5 minutes | Logged only |
+| **CRITICAL** | <5% free | 2 minutes | **Discord notification** |
+
+---
+
+## CRITICAL Alerts (Discord Notifications)
+
+You will **ONLY** receive Discord notifications for:
+
+### Infrastructure
+- ❌ **HostDown** - Any host completely unreachable for 2+ minutes
+- ❌ **ProxmoxNodeDown** - Proxmox host down for 2+ minutes
+- ❌ **VPSDown** - VPS (66.63.182.168) unreachable for 2+ minutes
+
+### Performance
+- 🔥 **CriticalCPUUsage** - CPU >95% sustained for 5+ minutes
+- 🧠 **CriticalMemoryUsage** - Memory >95% sustained for 5+ minutes
+
+### Storage
+- 💾 **DiskSpaceCritical** - Disk <5% free space for 2+ minutes
+
+### Services
+- 🗄️ **PostgreSQLDown** - Database down for 2+ minutes
+- ⚙️ **PrometheusConfigReloadFailed** - Monitoring system config broken
+
+---
+
+## WARNING Alerts (Logged Only)
+
+These alerts are **visible in Prometheus UI** but **do NOT trigger notifications**:
+
+- CPU 80-95% (warning threshold)
+- Memory 85-95%
+- Disk 5-15% free
+- Network interface down
+- High disk I/O wait
+- Home Assistant down
+- n8n automation down
+- Prometheus scrape failures
+- and more...
+
+**You can check them anytime at:** http://10.0.10.25:9090/alerts
+
+---
+
+## Deployment
+
+### Quick Deploy
+```bash
+cd /root/.openclaw/workspace/fred-infrastructure
+./deploy-reduced-alerts.sh
+```
+
+### Manual Steps (if needed)
+```bash
+# 1. Backup existing configs
+ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups'
+ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup'
+ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup'
+
+# 2. Upload new configs
+scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml
+scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml
+
+# 3. Reload services
+ssh root@10.0.10.25 'systemctl reload prometheus'
+ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager'
+
+# 4. Verify
+curl http://10.0.10.25:9090/api/v1/rules
+curl http://10.0.10.25:9093/api/v1/status
+```
+
+### Test Critical Alert
+Send a test alert to verify Discord webhook works:
+```bash
+curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[
+  {
+    "labels": {
+      "alertname": "TestCriticalAlert",
+      "severity": "critical",
+      "instance": "test:9100"
+    },
+    "annotations": {
+      "summary": "Test alert - please ignore"
+    }
+  }
+]'
+```
+
+You should see this appear in Discord within ~30 seconds.
+
+---
+
+## Expected Results
+
+### Your Email Inbox
+- **Before:** 50+ Prometheus alerts per day
+- **After:** **ZERO** (all notifications moved to Discord)
+
+### Your Discord Server
+- **Only critical issues** that need immediate attention
+- Estimated: 0-2 alerts per day (unless something is actually broken)
+
+### Prometheus UI
+- **All alerts still visible** at http://10.0.10.25:9090/alerts
+- Use this to check warnings when convenient
+
+---
+
+## Monitoring Your Alerts
+
+### Check Active Alerts
+```bash
+# View current alerts
+curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool
+
+# View Alertmanager status
+curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool
+```
+
+### Web Interfaces
+- **Prometheus:** http://10.0.10.25:9090/alerts
+- **Alertmanager:** http://10.0.10.25:9093/#/alerts
+
+---
+
+## Rollback (If Needed)
+
+If something goes wrong, restore the backup:
+```bash
+ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml'
+ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml'
+ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager'
+```
+
+---
+
+## Files Created
+
+- `prometheus-alert-rules-updated.yml` - Updated alert rules (CPU 80%+, warnings logged)
+- `alertmanager-config-updated.yml` - Only critical → Discord, warnings → null
+- `deploy-reduced-alerts.sh` - Automated deployment script
+- `ALERT-REDUCTION-SUMMARY.md` - This file
+
+---
+
+## Questions?
+
+- **"I want to see warnings occasionally"** → Check http://10.0.10.25:9090/alerts daily
+- **"I need email back"** → We can add it back for critical-only
+- **"80% CPU is too low"** → We can adjust the threshold up/down
+- **"I want alerts in Slack instead"** → Easy to add another webhook
+
+Let me know if you want to tweak anything! 🎯