Initial infrastructure documentation - comprehensive homelab reference
This commit is contained in:
190
ALERT-REDUCTION-SUMMARY.md
Normal file
190
ALERT-REDUCTION-SUMMARY.md
Normal file
@@ -0,0 +1,190 @@
|
||||
# Alert Reduction Summary
|
||||
|
||||
## What Changed
|
||||
|
||||
Your Prometheus alerting has been updated to **drastically reduce notification noise** while still catching critical issues.
|
||||
|
||||
### Before (Your Current Inbox)
|
||||
- 🔥 **All warnings** → Email inbox spam
|
||||
- Multiple alerts per minute
|
||||
- Drowning out important messages
|
||||
|
||||
### After (New Configuration)
|
||||
- ✅ **Only CRITICAL alerts** → Discord notification
|
||||
- **WARNING alerts** → Logged in Prometheus UI (no notification)
|
||||
- Clean inbox, important alerts still get through
|
||||
|
||||
---
|
||||
|
||||
## Updated Alert Thresholds
|
||||
|
||||
### CPU Monitoring
|
||||
| Severity | Threshold | Duration | Action |
|
||||
|----------|-----------|----------|--------|
|
||||
| **WARNING** | 80%+ | 5 minutes | Logged only |
|
||||
| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |
|
||||
|
||||
### Memory Monitoring
|
||||
| Severity | Threshold | Duration | Action |
|
||||
|----------|-----------|----------|--------|
|
||||
| **WARNING** | 85%+ | 10 minutes | Logged only |
|
||||
| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |
|
||||
|
||||
### Disk Space
|
||||
| Severity | Threshold | Duration | Action |
|
||||
|----------|-----------|----------|--------|
|
||||
| **WARNING** | <15% free | 5 minutes | Logged only |
|
||||
| **CRITICAL** | <5% free | 2 minutes | **Discord notification** |
|
||||
|
||||
---
|
||||
|
||||
## CRITICAL Alerts (Discord Notifications)
|
||||
|
||||
You will **ONLY** receive Discord notifications for:
|
||||
|
||||
### Infrastructure
|
||||
- ❌ **HostDown** - Any host completely unreachable for 2+ minutes
|
||||
- ❌ **ProxmoxNodeDown** - Proxmox host down for 2+ minutes
|
||||
- ❌ **VPSDown** - VPS (66.63.182.168) unreachable for 2+ minutes
|
||||
|
||||
### Performance
|
||||
- 🔥 **CriticalCPUUsage** - CPU >95% sustained for 5+ minutes
|
||||
- 🧠 **CriticalMemoryUsage** - Memory >95% sustained for 5+ minutes
|
||||
|
||||
### Storage
|
||||
- 💾 **DiskSpaceCritical** - Disk <5% free space for 2+ minutes
|
||||
|
||||
### Services
|
||||
- 🗄️ **PostgreSQLDown** - Database down for 2+ minutes
|
||||
- ⚙️ **PrometheusConfigReloadFailed** - Monitoring system config broken
|
||||
|
||||
---
|
||||
|
||||
## WARNING Alerts (Logged Only)
|
||||
|
||||
These alerts are **visible in Prometheus UI** but **do NOT trigger notifications**:
|
||||
|
||||
- CPU 80-95% (warning threshold)
|
||||
- Memory 85-95%
|
||||
- Disk 5-15% free
|
||||
- Network interface down
|
||||
- High disk I/O wait
|
||||
- Home Assistant down
|
||||
- n8n automation down
|
||||
- Prometheus scrape failures
|
||||
- and more...
|
||||
|
||||
**You can check them anytime at:** http://10.0.10.25:9090/alerts
|
||||
|
||||
---
|
||||
|
||||
## Deployment
|
||||
|
||||
### Quick Deploy
|
||||
```bash
|
||||
cd /root/.openclaw/workspace/fred-infrastructure
|
||||
./deploy-reduced-alerts.sh
|
||||
```
|
||||
|
||||
### Manual Steps (if needed)
|
||||
```bash
|
||||
# 1. Backup existing configs
|
||||
ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups'
|
||||
ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup'
|
||||
ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup'
|
||||
|
||||
# 2. Upload new configs
|
||||
scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml
|
||||
scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml
|
||||
|
||||
# 3. Reload services
|
||||
ssh root@10.0.10.25 'systemctl reload prometheus'
|
||||
ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager'
|
||||
|
||||
# 4. Verify
|
||||
curl http://10.0.10.25:9090/api/v1/rules
|
||||
curl http://10.0.10.25:9093/api/v1/status
|
||||
```
|
||||
|
||||
### Test Critical Alert
|
||||
Send a test alert to verify Discord webhook works:
|
||||
```bash
|
||||
curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[
|
||||
{
|
||||
"labels": {
|
||||
"alertname": "TestCriticalAlert",
|
||||
"severity": "critical",
|
||||
"instance": "test:9100"
|
||||
},
|
||||
"annotations": {
|
||||
"summary": "Test alert - please ignore"
|
||||
}
|
||||
}
|
||||
]'
|
||||
```
|
||||
|
||||
You should see this appear in Discord within ~30 seconds.
|
||||
|
||||
---
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Your Email Inbox
|
||||
- **Before:** 50+ Prometheus alerts per day
|
||||
- **After:** **ZERO** (all notifications moved to Discord)
|
||||
|
||||
### Your Discord Server
|
||||
- **Only critical issues** that need immediate attention
|
||||
- Estimated: 0-2 alerts per day (unless something is actually broken)
|
||||
|
||||
### Prometheus UI
|
||||
- **All alerts still visible** at http://10.0.10.25:9090/alerts
|
||||
- Use this to check warnings when convenient
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Your Alerts
|
||||
|
||||
### Check Active Alerts
|
||||
```bash
|
||||
# View current alerts
|
||||
curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool
|
||||
|
||||
# View Alertmanager status
|
||||
curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool
|
||||
```
|
||||
|
||||
### Web Interfaces
|
||||
- **Prometheus:** http://10.0.10.25:9090/alerts
|
||||
- **Alertmanager:** http://10.0.10.25:9093/#/alerts
|
||||
|
||||
---
|
||||
|
||||
## Rollback (If Needed)
|
||||
|
||||
If something goes wrong, restore the backup:
|
||||
```bash
|
||||
ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml'
|
||||
ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml'
|
||||
ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Created
|
||||
|
||||
- `prometheus-alert-rules-updated.yml` - Updated alert rules (CPU 80%+, warnings logged)
|
||||
- `alertmanager-config-updated.yml` - Only critical → Discord, warnings → null
|
||||
- `deploy-reduced-alerts.sh` - Automated deployment script
|
||||
- `ALERT-REDUCTION-SUMMARY.md` - This file
|
||||
|
||||
---
|
||||
|
||||
## Questions?
|
||||
|
||||
- **"I want to see warnings occasionally"** → Check http://10.0.10.25:9090/alerts daily
|
||||
- **"I need email back"** → We can add it back for critical-only
|
||||
- **"80% CPU is too low"** → We can adjust the threshold up/down
|
||||
- **"I want alerts in Slack instead"** → Easy to add another webhook
|
||||
|
||||
Let me know if you want to tweak anything! 🎯
|
||||
Reference in New Issue
Block a user