Files
homelab-docs/ALERT-REDUCTION-SUMMARY.md

191 lines
5.3 KiB
Markdown

# Alert Reduction Summary
## What Changed
Your Prometheus alerting has been updated to **drastically reduce notification noise** while still catching critical issues.
### Before (Your Current Inbox)
- 🔥 **All warnings** → Email inbox spam
- Multiple alerts per minute
- Drowning out important messages
### After (New Configuration)
-**Only CRITICAL alerts** → Discord notification
- **WARNING alerts** → Logged in Prometheus UI (no notification)
- Clean inbox, important alerts still get through
---
## Updated Alert Thresholds
### CPU Monitoring
| Severity | Threshold | Duration | Action |
|----------|-----------|----------|--------|
| **WARNING** | 80%+ | 5 minutes | Logged only |
| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |
### Memory Monitoring
| Severity | Threshold | Duration | Action |
|----------|-----------|----------|--------|
| **WARNING** | 85%+ | 10 minutes | Logged only |
| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |
### Disk Space
| Severity | Threshold | Duration | Action |
|----------|-----------|----------|--------|
| **WARNING** | <15% free | 5 minutes | Logged only |
| **CRITICAL** | <5% free | 2 minutes | **Discord notification** |
---
## CRITICAL Alerts (Discord Notifications)
You will **ONLY** receive Discord notifications for:
### Infrastructure
-**HostDown** - Any host completely unreachable for 2+ minutes
-**ProxmoxNodeDown** - Proxmox host down for 2+ minutes
-**VPSDown** - VPS (66.63.182.168) unreachable for 2+ minutes
### Performance
- 🔥 **CriticalCPUUsage** - CPU >95% sustained for 5+ minutes
- 🧠 **CriticalMemoryUsage** - Memory >95% sustained for 5+ minutes
### Storage
- 💾 **DiskSpaceCritical** - Disk <5% free space for 2+ minutes
### Services
- 🗄️ **PostgreSQLDown** - Database down for 2+ minutes
- ⚙️ **PrometheusConfigReloadFailed** - Monitoring system config broken
---
## WARNING Alerts (Logged Only)
These alerts are **visible in Prometheus UI** but **do NOT trigger notifications**:
- CPU 80-95% (warning threshold)
- Memory 85-95%
- Disk 5-15% free
- Network interface down
- High disk I/O wait
- Home Assistant down
- n8n automation down
- Prometheus scrape failures
- and more...
**You can check them anytime at:** http://10.0.10.25:9090/alerts
---
## Deployment
### Quick Deploy
```bash
cd /root/.openclaw/workspace/fred-infrastructure
./deploy-reduced-alerts.sh
```
### Manual Steps (if needed)
```bash
# 1. Backup existing configs
ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups'
ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup'
ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup'
# 2. Upload new configs
scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml
scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml
# 3. Reload services
ssh root@10.0.10.25 'systemctl reload prometheus'
ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager'
# 4. Verify
curl http://10.0.10.25:9090/api/v1/rules
curl http://10.0.10.25:9093/api/v1/status
```
### Test Critical Alert
Send a test alert to verify Discord webhook works:
```bash
curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestCriticalAlert",
"severity": "critical",
"instance": "test:9100"
},
"annotations": {
"summary": "Test alert - please ignore"
}
}
]'
```
You should see this appear in Discord within ~30 seconds.
---
## Expected Results
### Your Email Inbox
- **Before:** 50+ Prometheus alerts per day
- **After:** **ZERO** (all notifications moved to Discord)
### Your Discord Server
- **Only critical issues** that need immediate attention
- Estimated: 0-2 alerts per day (unless something is actually broken)
### Prometheus UI
- **All alerts still visible** at http://10.0.10.25:9090/alerts
- Use this to check warnings when convenient
---
## Monitoring Your Alerts
### Check Active Alerts
```bash
# View current alerts
curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool
# View Alertmanager status
curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool
```
### Web Interfaces
- **Prometheus:** http://10.0.10.25:9090/alerts
- **Alertmanager:** http://10.0.10.25:9093/#/alerts
---
## Rollback (If Needed)
If something goes wrong, restore the backup:
```bash
ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml'
ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml'
ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager'
```
---
## Files Created
- `prometheus-alert-rules-updated.yml` - Updated alert rules (CPU 80%+, warnings logged)
- `alertmanager-config-updated.yml` - Only critical → Discord, warnings → null
- `deploy-reduced-alerts.sh` - Automated deployment script
- `ALERT-REDUCTION-SUMMARY.md` - This file
---
## Questions?
- **"I want to see warnings occasionally"** → Check http://10.0.10.25:9090/alerts daily
- **"I need email back"** → We can add it back for critical-only
- **"80% CPU is too low"** → We can adjust the threshold up/down
- **"I want alerts in Slack instead"** → Easy to add another webhook
Let me know if you want to tweak anything! 🎯