5.3 KiB
5.3 KiB
Alert Reduction Summary
What Changed
Your Prometheus alerting has been updated to drastically reduce notification noise while still catching critical issues.
Before (Your Current Inbox)
- 🔥 All warnings → Email inbox spam
- Multiple alerts per minute
- Drowning out important messages
After (New Configuration)
- ✅ Only CRITICAL alerts → Discord notification
- WARNING alerts → Logged in Prometheus UI (no notification)
- Clean inbox, important alerts still get through
Updated Alert Thresholds
CPU Monitoring
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| WARNING | 80%+ | 5 minutes | Logged only |
| CRITICAL | 95%+ | 5 minutes | Discord notification |
Memory Monitoring
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| WARNING | 85%+ | 10 minutes | Logged only |
| CRITICAL | 95%+ | 5 minutes | Discord notification |
Disk Space
| Severity | Threshold | Duration | Action |
|---|---|---|---|
| WARNING | <15% free | 5 minutes | Logged only |
| CRITICAL | <5% free | 2 minutes | Discord notification |
CRITICAL Alerts (Discord Notifications)
You will ONLY receive Discord notifications for:
Infrastructure
- ❌ HostDown - Any host completely unreachable for 2+ minutes
- ❌ ProxmoxNodeDown - Proxmox host down for 2+ minutes
- ❌ VPSDown - VPS (66.63.182.168) unreachable for 2+ minutes
Performance
- 🔥 CriticalCPUUsage - CPU >95% sustained for 5+ minutes
- 🧠 CriticalMemoryUsage - Memory >95% sustained for 5+ minutes
Storage
- 💾 DiskSpaceCritical - Disk <5% free space for 2+ minutes
Services
- 🗄️ PostgreSQLDown - Database down for 2+ minutes
- ⚙️ PrometheusConfigReloadFailed - Monitoring system config broken
WARNING Alerts (Logged Only)
These alerts are visible in Prometheus UI but do NOT trigger notifications:
- CPU 80-95% (warning threshold)
- Memory 85-95%
- Disk 5-15% free
- Network interface down
- High disk I/O wait
- Home Assistant down
- n8n automation down
- Prometheus scrape failures
- and more...
You can check them anytime at: http://10.0.10.25:9090/alerts
Deployment
Quick Deploy
cd /root/.openclaw/workspace/fred-infrastructure
./deploy-reduced-alerts.sh
Manual Steps (if needed)
# 1. Backup existing configs
ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups'
ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup'
ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup'
# 2. Upload new configs
scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml
scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml
# 3. Reload services
ssh root@10.0.10.25 'systemctl reload prometheus'
ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager'
# 4. Verify
curl http://10.0.10.25:9090/api/v1/rules
curl http://10.0.10.25:9093/api/v1/status
Test Critical Alert
Send a test alert to verify Discord webhook works:
curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestCriticalAlert",
"severity": "critical",
"instance": "test:9100"
},
"annotations": {
"summary": "Test alert - please ignore"
}
}
]'
You should see this appear in Discord within ~30 seconds.
Expected Results
Your Email Inbox
- Before: 50+ Prometheus alerts per day
- After: ZERO (all notifications moved to Discord)
Your Discord Server
- Only critical issues that need immediate attention
- Estimated: 0-2 alerts per day (unless something is actually broken)
Prometheus UI
- All alerts still visible at http://10.0.10.25:9090/alerts
- Use this to check warnings when convenient
Monitoring Your Alerts
Check Active Alerts
# View current alerts
curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool
# View Alertmanager status
curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool
Web Interfaces
- Prometheus: http://10.0.10.25:9090/alerts
- Alertmanager: http://10.0.10.25:9093/#/alerts
Rollback (If Needed)
If something goes wrong, restore the backup:
ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml'
ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml'
ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager'
Files Created
prometheus-alert-rules-updated.yml- Updated alert rules (CPU 80%+, warnings logged)alertmanager-config-updated.yml- Only critical → Discord, warnings → nulldeploy-reduced-alerts.sh- Automated deployment scriptALERT-REDUCTION-SUMMARY.md- This file
Questions?
- "I want to see warnings occasionally" → Check http://10.0.10.25:9090/alerts daily
- "I need email back" → We can add it back for critical-only
- "80% CPU is too low" → We can adjust the threshold up/down
- "I want alerts in Slack instead" → Easy to add another webhook
Let me know if you want to tweak anything! 🎯