191 lines
5.3 KiB
Markdown
191 lines
5.3 KiB
Markdown
# Alert Reduction Summary
|
|
|
|
## What Changed
|
|
|
|
Your Prometheus alerting has been updated to **drastically reduce notification noise** while still catching critical issues.
|
|
|
|
### Before (Your Current Inbox)
|
|
- 🔥 **All warnings** → Email inbox spam
|
|
- Multiple alerts per minute
|
|
- Drowning out important messages
|
|
|
|
### After (New Configuration)
|
|
- ✅ **Only CRITICAL alerts** → Discord notification
|
|
- **WARNING alerts** → Logged in Prometheus UI (no notification)
|
|
- Clean inbox, important alerts still get through
|
|
|
|
---
|
|
|
|
## Updated Alert Thresholds
|
|
|
|
### CPU Monitoring
|
|
| Severity | Threshold | Duration | Action |
|
|
|----------|-----------|----------|--------|
|
|
| **WARNING** | 80%+ | 5 minutes | Logged only |
|
|
| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |
|
|
|
|
### Memory Monitoring
|
|
| Severity | Threshold | Duration | Action |
|
|
|----------|-----------|----------|--------|
|
|
| **WARNING** | 85%+ | 10 minutes | Logged only |
|
|
| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |
|
|
|
|
### Disk Space
|
|
| Severity | Threshold | Duration | Action |
|
|
|----------|-----------|----------|--------|
|
|
| **WARNING** | <15% free | 5 minutes | Logged only |
|
|
| **CRITICAL** | <5% free | 2 minutes | **Discord notification** |
|
|
|
|
---
|
|
|
|
## CRITICAL Alerts (Discord Notifications)
|
|
|
|
You will **ONLY** receive Discord notifications for:
|
|
|
|
### Infrastructure
|
|
- ❌ **HostDown** - Any host completely unreachable for 2+ minutes
|
|
- ❌ **ProxmoxNodeDown** - Proxmox host down for 2+ minutes
|
|
- ❌ **VPSDown** - VPS (66.63.182.168) unreachable for 2+ minutes
|
|
|
|
### Performance
|
|
- 🔥 **CriticalCPUUsage** - CPU >95% sustained for 5+ minutes
|
|
- 🧠 **CriticalMemoryUsage** - Memory >95% sustained for 5+ minutes
|
|
|
|
### Storage
|
|
- 💾 **DiskSpaceCritical** - Disk <5% free space for 2+ minutes
|
|
|
|
### Services
|
|
- 🗄️ **PostgreSQLDown** - Database down for 2+ minutes
|
|
- ⚙️ **PrometheusConfigReloadFailed** - Monitoring system config broken
|
|
|
|
---
|
|
|
|
## WARNING Alerts (Logged Only)
|
|
|
|
These alerts are **visible in Prometheus UI** but **do NOT trigger notifications**:
|
|
|
|
- CPU 80-95% (warning threshold)
|
|
- Memory 85-95%
|
|
- Disk 5-15% free
|
|
- Network interface down
|
|
- High disk I/O wait
|
|
- Home Assistant down
|
|
- n8n automation down
|
|
- Prometheus scrape failures
|
|
- and more...
|
|
|
|
**You can check them anytime at:** http://10.0.10.25:9090/alerts
|
|
|
|
---
|
|
|
|
## Deployment
|
|
|
|
### Quick Deploy
|
|
```bash
|
|
cd /root/.openclaw/workspace/fred-infrastructure
|
|
./deploy-reduced-alerts.sh
|
|
```
|
|
|
|
### Manual Steps (if needed)
|
|
```bash
|
|
# 1. Backup existing configs
|
|
ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups'
|
|
ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup'
|
|
ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup'
|
|
|
|
# 2. Upload new configs
|
|
scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml
|
|
scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml
|
|
|
|
# 3. Reload services
|
|
ssh root@10.0.10.25 'systemctl reload prometheus'
|
|
ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager'
|
|
|
|
# 4. Verify
|
|
curl http://10.0.10.25:9090/api/v1/rules
|
|
curl http://10.0.10.25:9093/api/v1/status
|
|
```
|
|
|
|
### Test Critical Alert
|
|
Send a test alert to verify Discord webhook works:
|
|
```bash
|
|
curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[
|
|
{
|
|
"labels": {
|
|
"alertname": "TestCriticalAlert",
|
|
"severity": "critical",
|
|
"instance": "test:9100"
|
|
},
|
|
"annotations": {
|
|
"summary": "Test alert - please ignore"
|
|
}
|
|
}
|
|
]'
|
|
```
|
|
|
|
You should see this appear in Discord within ~30 seconds.
|
|
|
|
---
|
|
|
|
## Expected Results
|
|
|
|
### Your Email Inbox
|
|
- **Before:** 50+ Prometheus alerts per day
|
|
- **After:** **ZERO** (all notifications moved to Discord)
|
|
|
|
### Your Discord Server
|
|
- **Only critical issues** that need immediate attention
|
|
- Estimated: 0-2 alerts per day (unless something is actually broken)
|
|
|
|
### Prometheus UI
|
|
- **All alerts still visible** at http://10.0.10.25:9090/alerts
|
|
- Use this to check warnings when convenient
|
|
|
|
---
|
|
|
|
## Monitoring Your Alerts
|
|
|
|
### Check Active Alerts
|
|
```bash
|
|
# View current alerts
|
|
curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool
|
|
|
|
# View Alertmanager status
|
|
curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool
|
|
```
|
|
|
|
### Web Interfaces
|
|
- **Prometheus:** http://10.0.10.25:9090/alerts
|
|
- **Alertmanager:** http://10.0.10.25:9093/#/alerts
|
|
|
|
---
|
|
|
|
## Rollback (If Needed)
|
|
|
|
If something goes wrong, restore the backup:
|
|
```bash
|
|
ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml'
|
|
ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml'
|
|
ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager'
|
|
```
|
|
|
|
---
|
|
|
|
## Files Created
|
|
|
|
- `prometheus-alert-rules-updated.yml` - Updated alert rules (CPU 80%+, warnings logged)
|
|
- `alertmanager-config-updated.yml` - Only critical → Discord, warnings → null
|
|
- `deploy-reduced-alerts.sh` - Automated deployment script
|
|
- `ALERT-REDUCTION-SUMMARY.md` - This file
|
|
|
|
---
|
|
|
|
## Questions?
|
|
|
|
- **"I want to see warnings occasionally"** → Check http://10.0.10.25:9090/alerts daily
|
|
- **"I need email back"** → We can add it back for critical-only
|
|
- **"80% CPU is too low"** → We can adjust the threshold up/down
|
|
- **"I want alerts in Slack instead"** → Easy to add another webhook
|
|
|
|
Let me know if you want to tweak anything! 🎯
|