homelab-docs/ALERT-REDUCTION-SUMMARY.md

# Alert Reduction Summary

## What Changed

Your Prometheus alerting has been updated to **drastically reduce notification noise** while still catching critical issues.

### Before (Your Current Inbox)
- 🔥 **All warnings** → Email inbox spam
- Multiple alerts per minute
- Drowning out important messages

### After (New Configuration)
- ✅ **Only CRITICAL alerts** → Discord notification
- **WARNING alerts** → Logged in Prometheus UI (no notification)
- Clean inbox, important alerts still get through

---

## Updated Alert Thresholds

### CPU Monitoring
| Severity | Threshold | Duration | Action |
|----------|-----------|----------|--------|
| **WARNING** | 80%+ | 5 minutes | Logged only |
| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |

### Memory Monitoring
| Severity | Threshold | Duration | Action |
|----------|-----------|----------|--------|
| **WARNING** | 85%+ | 10 minutes | Logged only |
| **CRITICAL** | 95%+ | 5 minutes | **Discord notification** |

### Disk Space
| Severity | Threshold | Duration | Action |
|----------|-----------|----------|--------|
| **WARNING** | <15% free | 5 minutes | Logged only |
| **CRITICAL** | <5% free | 2 minutes | **Discord notification** |

---

## CRITICAL Alerts (Discord Notifications)

You will **ONLY** receive Discord notifications for:

### Infrastructure
- ❌ **HostDown** - Any host completely unreachable for 2+ minutes
- ❌ **ProxmoxNodeDown** - Proxmox host down for 2+ minutes
- ❌ **VPSDown** - VPS (66.63.182.168) unreachable for 2+ minutes

### Performance
- 🔥 **CriticalCPUUsage** - CPU >95% sustained for 5+ minutes
- 🧠 **CriticalMemoryUsage** - Memory >95% sustained for 5+ minutes

### Storage
- 💾 **DiskSpaceCritical** - Disk <5% free space for 2+ minutes

### Services
- 🗄️ **PostgreSQLDown** - Database down for 2+ minutes
- ⚙️ **PrometheusConfigReloadFailed** - Monitoring system config broken

---

## WARNING Alerts (Logged Only)

These alerts are **visible in Prometheus UI** but **do NOT trigger notifications**:

- CPU 80-95% (warning threshold)
- Memory 85-95%
- Disk 5-15% free
- Network interface down
- High disk I/O wait
- Home Assistant down
- n8n automation down
- Prometheus scrape failures
- and more...

**You can check them anytime at:** http://10.0.10.25:9090/alerts

---

## Deployment

### Quick Deploy
```bash
cd /root/.openclaw/workspace/fred-infrastructure
./deploy-reduced-alerts.sh
```

### Manual Steps (if needed)
```bash
# 1. Backup existing configs
ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups'
ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup'
ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup'

# 2. Upload new configs
scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml
scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml

# 3. Reload services
ssh root@10.0.10.25 'systemctl reload prometheus'
ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager'

# 4. Verify
curl http://10.0.10.25:9090/api/v1/rules
curl http://10.0.10.25:9093/api/v1/status
```

### Test Critical Alert
Send a test alert to verify Discord webhook works:
```bash
curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[
  {
    "labels": {
      "alertname": "TestCriticalAlert",
      "severity": "critical",
      "instance": "test:9100"
    },
    "annotations": {
      "summary": "Test alert - please ignore"
    }
  }
]'
```

You should see this appear in Discord within ~30 seconds.

---

## Expected Results

### Your Email Inbox
- **Before:** 50+ Prometheus alerts per day
- **After:** **ZERO** (all notifications moved to Discord)

### Your Discord Server
- **Only critical issues** that need immediate attention
- Estimated: 0-2 alerts per day (unless something is actually broken)

### Prometheus UI
- **All alerts still visible** at http://10.0.10.25:9090/alerts
- Use this to check warnings when convenient

---

## Monitoring Your Alerts

### Check Active Alerts
```bash
# View current alerts
curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool

# View Alertmanager status
curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool
```

### Web Interfaces
- **Prometheus:** http://10.0.10.25:9090/alerts
- **Alertmanager:** http://10.0.10.25:9093/#/alerts

---

## Rollback (If Needed)

If something goes wrong, restore the backup:
```bash
ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml'
ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml'
ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager'
```

---

## Files Created

- `prometheus-alert-rules-updated.yml` - Updated alert rules (CPU 80%+, warnings logged)
- `alertmanager-config-updated.yml` - Only critical → Discord, warnings → null
- `deploy-reduced-alerts.sh` - Automated deployment script
- `ALERT-REDUCTION-SUMMARY.md` - This file

---

## Questions?

- **"I want to see warnings occasionally"** → Check http://10.0.10.25:9090/alerts daily
- **"I need email back"** → We can add it back for critical-only
- **"80% CPU is too low"** → We can adjust the threshold up/down
- **"I want alerts in Slack instead"** → Easy to add another webhook

Let me know if you want to tweak anything! 🎯