# Alert Reduction Summary ## What Changed Your Prometheus alerting has been updated to **drastically reduce notification noise** while still catching critical issues. ### Before (Your Current Inbox) - 🔥 **All warnings** → Email inbox spam - Multiple alerts per minute - Drowning out important messages ### After (New Configuration) - ✅ **Only CRITICAL alerts** → Discord notification - **WARNING alerts** → Logged in Prometheus UI (no notification) - Clean inbox, important alerts still get through --- ## Updated Alert Thresholds ### CPU Monitoring | Severity | Threshold | Duration | Action | |----------|-----------|----------|--------| | **WARNING** | 80%+ | 5 minutes | Logged only | | **CRITICAL** | 95%+ | 5 minutes | **Discord notification** | ### Memory Monitoring | Severity | Threshold | Duration | Action | |----------|-----------|----------|--------| | **WARNING** | 85%+ | 10 minutes | Logged only | | **CRITICAL** | 95%+ | 5 minutes | **Discord notification** | ### Disk Space | Severity | Threshold | Duration | Action | |----------|-----------|----------|--------| | **WARNING** | <15% free | 5 minutes | Logged only | | **CRITICAL** | <5% free | 2 minutes | **Discord notification** | --- ## CRITICAL Alerts (Discord Notifications) You will **ONLY** receive Discord notifications for: ### Infrastructure - ❌ **HostDown** - Any host completely unreachable for 2+ minutes - ❌ **ProxmoxNodeDown** - Proxmox host down for 2+ minutes - ❌ **VPSDown** - VPS (66.63.182.168) unreachable for 2+ minutes ### Performance - 🔥 **CriticalCPUUsage** - CPU >95% sustained for 5+ minutes - 🧠 **CriticalMemoryUsage** - Memory >95% sustained for 5+ minutes ### Storage - 💾 **DiskSpaceCritical** - Disk <5% free space for 2+ minutes ### Services - 🗄️ **PostgreSQLDown** - Database down for 2+ minutes - ⚙️ **PrometheusConfigReloadFailed** - Monitoring system config broken --- ## WARNING Alerts (Logged Only) These alerts are **visible in Prometheus UI** but **do NOT trigger notifications**: - CPU 80-95% (warning threshold) - Memory 85-95% - Disk 5-15% free - Network interface down - High disk I/O wait - Home Assistant down - n8n automation down - Prometheus scrape failures - and more... **You can check them anytime at:** http://10.0.10.25:9090/alerts --- ## Deployment ### Quick Deploy ```bash cd /root/.openclaw/workspace/fred-infrastructure ./deploy-reduced-alerts.sh ``` ### Manual Steps (if needed) ```bash # 1. Backup existing configs ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups' ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup' ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup' # 2. Upload new configs scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml # 3. Reload services ssh root@10.0.10.25 'systemctl reload prometheus' ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager' # 4. Verify curl http://10.0.10.25:9090/api/v1/rules curl http://10.0.10.25:9093/api/v1/status ``` ### Test Critical Alert Send a test alert to verify Discord webhook works: ```bash curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[ { "labels": { "alertname": "TestCriticalAlert", "severity": "critical", "instance": "test:9100" }, "annotations": { "summary": "Test alert - please ignore" } } ]' ``` You should see this appear in Discord within ~30 seconds. --- ## Expected Results ### Your Email Inbox - **Before:** 50+ Prometheus alerts per day - **After:** **ZERO** (all notifications moved to Discord) ### Your Discord Server - **Only critical issues** that need immediate attention - Estimated: 0-2 alerts per day (unless something is actually broken) ### Prometheus UI - **All alerts still visible** at http://10.0.10.25:9090/alerts - Use this to check warnings when convenient --- ## Monitoring Your Alerts ### Check Active Alerts ```bash # View current alerts curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool # View Alertmanager status curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool ``` ### Web Interfaces - **Prometheus:** http://10.0.10.25:9090/alerts - **Alertmanager:** http://10.0.10.25:9093/#/alerts --- ## Rollback (If Needed) If something goes wrong, restore the backup: ```bash ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml' ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml' ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager' ``` --- ## Files Created - `prometheus-alert-rules-updated.yml` - Updated alert rules (CPU 80%+, warnings logged) - `alertmanager-config-updated.yml` - Only critical → Discord, warnings → null - `deploy-reduced-alerts.sh` - Automated deployment script - `ALERT-REDUCTION-SUMMARY.md` - This file --- ## Questions? - **"I want to see warnings occasionally"** → Check http://10.0.10.25:9090/alerts daily - **"I need email back"** → We can add it back for critical-only - **"80% CPU is too low"** → We can adjust the threshold up/down - **"I want alerts in Slack instead"** → Easy to add another webhook Let me know if you want to tweak anything! 🎯