Files
homelab-docs/ALERT-REDUCTION-SUMMARY.md

5.3 KiB

Alert Reduction Summary

What Changed

Your Prometheus alerting has been updated to drastically reduce notification noise while still catching critical issues.

Before (Your Current Inbox)

  • 🔥 All warnings → Email inbox spam
  • Multiple alerts per minute
  • Drowning out important messages

After (New Configuration)

  • Only CRITICAL alerts → Discord notification
  • WARNING alerts → Logged in Prometheus UI (no notification)
  • Clean inbox, important alerts still get through

Updated Alert Thresholds

CPU Monitoring

Severity Threshold Duration Action
WARNING 80%+ 5 minutes Logged only
CRITICAL 95%+ 5 minutes Discord notification

Memory Monitoring

Severity Threshold Duration Action
WARNING 85%+ 10 minutes Logged only
CRITICAL 95%+ 5 minutes Discord notification

Disk Space

Severity Threshold Duration Action
WARNING <15% free 5 minutes Logged only
CRITICAL <5% free 2 minutes Discord notification

CRITICAL Alerts (Discord Notifications)

You will ONLY receive Discord notifications for:

Infrastructure

  • HostDown - Any host completely unreachable for 2+ minutes
  • ProxmoxNodeDown - Proxmox host down for 2+ minutes
  • VPSDown - VPS (66.63.182.168) unreachable for 2+ minutes

Performance

  • 🔥 CriticalCPUUsage - CPU >95% sustained for 5+ minutes
  • 🧠 CriticalMemoryUsage - Memory >95% sustained for 5+ minutes

Storage

  • 💾 DiskSpaceCritical - Disk <5% free space for 2+ minutes

Services

  • 🗄️ PostgreSQLDown - Database down for 2+ minutes
  • ⚙️ PrometheusConfigReloadFailed - Monitoring system config broken

WARNING Alerts (Logged Only)

These alerts are visible in Prometheus UI but do NOT trigger notifications:

  • CPU 80-95% (warning threshold)
  • Memory 85-95%
  • Disk 5-15% free
  • Network interface down
  • High disk I/O wait
  • Home Assistant down
  • n8n automation down
  • Prometheus scrape failures
  • and more...

You can check them anytime at: http://10.0.10.25:9090/alerts


Deployment

Quick Deploy

cd /root/.openclaw/workspace/fred-infrastructure
./deploy-reduced-alerts.sh

Manual Steps (if needed)

# 1. Backup existing configs
ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups'
ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup'
ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup'

# 2. Upload new configs
scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml
scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml

# 3. Reload services
ssh root@10.0.10.25 'systemctl reload prometheus'
ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager'

# 4. Verify
curl http://10.0.10.25:9090/api/v1/rules
curl http://10.0.10.25:9093/api/v1/status

Test Critical Alert

Send a test alert to verify Discord webhook works:

curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[
  {
    "labels": {
      "alertname": "TestCriticalAlert",
      "severity": "critical",
      "instance": "test:9100"
    },
    "annotations": {
      "summary": "Test alert - please ignore"
    }
  }
]'

You should see this appear in Discord within ~30 seconds.


Expected Results

Your Email Inbox

  • Before: 50+ Prometheus alerts per day
  • After: ZERO (all notifications moved to Discord)

Your Discord Server

  • Only critical issues that need immediate attention
  • Estimated: 0-2 alerts per day (unless something is actually broken)

Prometheus UI


Monitoring Your Alerts

Check Active Alerts

# View current alerts
curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool

# View Alertmanager status
curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool

Web Interfaces


Rollback (If Needed)

If something goes wrong, restore the backup:

ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml'
ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml'
ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager'

Files Created

  • prometheus-alert-rules-updated.yml - Updated alert rules (CPU 80%+, warnings logged)
  • alertmanager-config-updated.yml - Only critical → Discord, warnings → null
  • deploy-reduced-alerts.sh - Automated deployment script
  • ALERT-REDUCTION-SUMMARY.md - This file

Questions?

  • "I want to see warnings occasionally" → Check http://10.0.10.25:9090/alerts daily
  • "I need email back" → We can add it back for critical-only
  • "80% CPU is too low" → We can adjust the threshold up/down
  • "I want alerts in Slack instead" → Easy to add another webhook

Let me know if you want to tweak anything! 🎯