Files

Funky (OpenClaw) 0682c79580 Initial infrastructure documentation - comprehensive homelab reference

2026-02-23 03:42:22 +00:00

5.3 KiB

Raw Permalink Blame History

Alert Reduction Summary

What Changed

Your Prometheus alerting has been updated to drastically reduce notification noise while still catching critical issues.

Before (Your Current Inbox)

🔥 All warnings → Email inbox spam
Multiple alerts per minute
Drowning out important messages

After (New Configuration)

✅ Only CRITICAL alerts → Discord notification
WARNING alerts → Logged in Prometheus UI (no notification)
Clean inbox, important alerts still get through

Updated Alert Thresholds

CPU Monitoring

Severity	Threshold	Duration	Action
WARNING	80%+	5 minutes	Logged only
CRITICAL	95%+	5 minutes	Discord notification

Memory Monitoring

Severity	Threshold	Duration	Action
WARNING	85%+	10 minutes	Logged only
CRITICAL	95%+	5 minutes	Discord notification

Disk Space

Severity	Threshold	Duration	Action
WARNING	<15% free	5 minutes	Logged only
CRITICAL	<5% free	2 minutes	Discord notification

CRITICAL Alerts (Discord Notifications)

You will ONLY receive Discord notifications for:

Infrastructure

❌ HostDown - Any host completely unreachable for 2+ minutes
❌ ProxmoxNodeDown - Proxmox host down for 2+ minutes
❌ VPSDown - VPS (66.63.182.168) unreachable for 2+ minutes

Performance

🔥 CriticalCPUUsage - CPU >95% sustained for 5+ minutes
🧠 CriticalMemoryUsage - Memory >95% sustained for 5+ minutes

Storage

💾 DiskSpaceCritical - Disk <5% free space for 2+ minutes

Services

🗄️ PostgreSQLDown - Database down for 2+ minutes
⚙️ PrometheusConfigReloadFailed - Monitoring system config broken

WARNING Alerts (Logged Only)

These alerts are visible in Prometheus UI but do NOT trigger notifications:

CPU 80-95% (warning threshold)
Memory 85-95%
Disk 5-15% free
Network interface down
High disk I/O wait
Home Assistant down
n8n automation down
Prometheus scrape failures
and more...

You can check them anytime at: http://10.0.10.25:9090/alerts

Deployment

Quick Deploy

cd /root/.openclaw/workspace/fred-infrastructure
./deploy-reduced-alerts.sh

Manual Steps (if needed)

# 1. Backup existing configs
ssh root@10.0.10.25 'mkdir -p /etc/prometheus/backups'
ssh root@10.0.10.25 'cp /etc/prometheus/alertmanager.yml /etc/prometheus/backups/alertmanager.yml.backup'
ssh root@10.0.10.25 'cp /etc/prometheus/rules/homelab-alerts.yml /etc/prometheus/backups/homelab-alerts.yml.backup'

# 2. Upload new configs
scp alertmanager-config-updated.yml root@10.0.10.25:/etc/prometheus/alertmanager.yml
scp prometheus-alert-rules-updated.yml root@10.0.10.25:/etc/prometheus/rules/homelab-alerts.yml

# 3. Reload services
ssh root@10.0.10.25 'systemctl reload prometheus'
ssh root@10.0.10.25 'systemctl reload prometheus-alertmanager'

# 4. Verify
curl http://10.0.10.25:9090/api/v1/rules
curl http://10.0.10.25:9093/api/v1/status

Test Critical Alert

Send a test alert to verify Discord webhook works:

curl -X POST http://10.0.10.25:9093/api/v1/alerts -d '[
  {
    "labels": {
      "alertname": "TestCriticalAlert",
      "severity": "critical",
      "instance": "test:9100"
    },
    "annotations": {
      "summary": "Test alert - please ignore"
    }
  }
]'

You should see this appear in Discord within ~30 seconds.

Expected Results

Your Email Inbox

Before: 50+ Prometheus alerts per day
After: ZERO (all notifications moved to Discord)

Your Discord Server

Only critical issues that need immediate attention
Estimated: 0-2 alerts per day (unless something is actually broken)

Prometheus UI

All alerts still visible at http://10.0.10.25:9090/alerts
Use this to check warnings when convenient

Monitoring Your Alerts

Check Active Alerts

# View current alerts
curl http://10.0.10.25:9090/api/v1/alerts | python3 -m json.tool

# View Alertmanager status
curl http://10.0.10.25:9093/api/v1/status | python3 -m json.tool

Web Interfaces

Prometheus: http://10.0.10.25:9090/alerts
Alertmanager: http://10.0.10.25:9093/#/alerts

Rollback (If Needed)

If something goes wrong, restore the backup:

ssh root@10.0.10.25 'cp /etc/prometheus/backups/alertmanager.yml.* /etc/prometheus/alertmanager.yml'
ssh root@10.0.10.25 'cp /etc/prometheus/backups/homelab-alerts.yml.* /etc/prometheus/rules/homelab-alerts.yml'
ssh root@10.0.10.25 'systemctl reload prometheus prometheus-alertmanager'

Files Created

prometheus-alert-rules-updated.yml - Updated alert rules (CPU 80%+, warnings logged)
alertmanager-config-updated.yml - Only critical → Discord, warnings → null
deploy-reduced-alerts.sh - Automated deployment script
ALERT-REDUCTION-SUMMARY.md - This file

Questions?

"I want to see warnings occasionally" → Check http://10.0.10.25:9090/alerts daily
"I need email back" → We can add it back for critical-only
"80% CPU is too low" → We can adjust the threshold up/down
"I want alerts in Slack instead" → Easy to add another webhook

Let me know if you want to tweak anything! 🎯

5.3 KiB Raw Permalink Blame History