Initial infrastructure documentation - comprehensive homelab reference
This commit is contained in:
360
infrastructure/INFRASTRUCTURE-TODO.md
Normal file
360
infrastructure/INFRASTRUCTURE-TODO.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# Infrastructure TODO List
|
||||
|
||||
**Created:** 2025-12-29
|
||||
**Last Updated:** 2025-12-29
|
||||
**Status:** Active development tasks
|
||||
|
||||
This document tracks all incomplete infrastructure tasks and future improvements.
|
||||
|
||||
---
|
||||
|
||||
## ✅ Completed Items
|
||||
|
||||
### 1. Fix Home Assistant Public Domain Access
|
||||
|
||||
**Status**: ✅ COMPLETED (2025-12-29)
|
||||
|
||||
**What was done**:
|
||||
1. Updated Caddy to use HTTPS backend for Home Assistant
|
||||
2. Added VPS WireGuard IP (10.0.8.1) to Home Assistant's trusted_proxies
|
||||
3. Verified bob.nianticbooks.com is accessible
|
||||
|
||||
**Result**: All 5 public domains now working:
|
||||
- ✅ freddesk.nianticbooks.com → Proxmox
|
||||
- ✅ bob.nianticbooks.com → Home Assistant
|
||||
- ✅ ad5m.nianticbooks.com → 3D Printer
|
||||
- ✅ auth.nianticbooks.com → Authentik SSO
|
||||
- ✅ bible.nianticbooks.com → Bible reading plan
|
||||
|
||||
### 2. Deploy RustDesk ID Server
|
||||
|
||||
**Status**: ✅ COMPLETED (2025-12-25)
|
||||
|
||||
**What was deployed**:
|
||||
1. ID Server (hbbs) on main-pve LXC 123 at 10.0.10.23
|
||||
2. Relay Server (hbbr) on VPS at 66.63.182.168:21117
|
||||
3. Generated encryption key pair
|
||||
4. Verified client connectivity
|
||||
|
||||
**Result**: RustDesk fully operational
|
||||
- ✅ ID Server (hbbs): 10.0.10.23 ports 21115, 21116, 21118
|
||||
- ✅ Relay Server (hbbr): VPS port 21117
|
||||
- ✅ Public Key: `sfYuCTMHxrA22kukomb/RAKYyUgr8iaMfm/U4CFLfL0=`
|
||||
- ✅ Client Configuration: ID Server `66.63.182.168`, Key included
|
||||
- ✅ Version: 1.1.14 (both servers)
|
||||
|
||||
**Documentation**:
|
||||
- SERVICES.md - Service inventory and health checks
|
||||
- guides/RUSTDESK-DEPLOYMENT-COMPLETE.md - Complete deployment guide
|
||||
|
||||
---
|
||||
|
||||
## Medium Priority
|
||||
|
||||
### 3. Deploy Prometheus + Grafana Monitoring
|
||||
|
||||
**Status**: ✅ DISCOVERED - Already deployed (2025-12-29)
|
||||
|
||||
**Current State**:
|
||||
- **Location**: 10.0.10.25 (responding to ping)
|
||||
- **Grafana**: Port 3000 ✅ Running (redirects to /login)
|
||||
- **Prometheus**: Port 9090 ✅ Running
|
||||
- **Deployment Method**: TBD (need to investigate)
|
||||
|
||||
**Remaining Configuration Tasks**:
|
||||
1. Document deployment method (Docker Compose, systemd, VM/Container type)
|
||||
2. Configure PostgreSQL database on 10.0.10.20 for Grafana (if not already done)
|
||||
3. Set up Authentik SSO for Grafana
|
||||
4. Configure Prometheus monitoring targets:
|
||||
- Proxmox nodes (via node_exporter)
|
||||
- VPS (WireGuard tunnel metrics)
|
||||
- PostgreSQL
|
||||
- Home Assistant
|
||||
- Other services
|
||||
5. Import Grafana dashboards:
|
||||
- Proxmox overview
|
||||
- PostgreSQL metrics
|
||||
- Network metrics
|
||||
6. Set up alerting (email/Slack)
|
||||
7. Optionally add Caddy public route
|
||||
|
||||
**Priority**: Low-Medium (services running, configuration needed)
|
||||
|
||||
**Note**: This was discovered during the infrastructure audit. The basic services are operational, but monitoring targets and dashboards need configuration.
|
||||
|
||||
---
|
||||
|
||||
## Low Priority (Cleanup)
|
||||
|
||||
### 4. Remove Deprecated VMs
|
||||
|
||||
**Objective**: Reclaim resources from unused services
|
||||
|
||||
**Status**: ⏸️ Deferred - Non-critical
|
||||
|
||||
#### 4.1 Remove Spoolman VM
|
||||
|
||||
**Current State**:
|
||||
- IP: 10.0.10.71 (allocated but not in use)
|
||||
- Reason: Bambu printer incompatible, service no longer needed
|
||||
|
||||
**Steps**:
|
||||
1. Verify no dependencies: `pct/qm status <VMID>`
|
||||
2. Backup if needed: `vzdump <VMID> --storage backup`
|
||||
3. Stop VM/container: `pct stop <VMID>` or `qm stop <VMID>`
|
||||
4. Delete: `pct destroy <VMID>` or `qm destroy <VMID>`
|
||||
5. Remove Pangolin route (if exists)
|
||||
6. Update IP-ALLOCATION.md to mark 10.0.10.71 as available
|
||||
7. Update documentation
|
||||
|
||||
**Priority**: Low
|
||||
|
||||
**Estimated Time**: 15 minutes
|
||||
|
||||
#### 4.2 Remove Authelia VM
|
||||
|
||||
**Current State**:
|
||||
- IP: 10.0.10.112 (allocated but not in use)
|
||||
- Reason: Replaced by Authentik SSO
|
||||
|
||||
**Steps**:
|
||||
1. Verify Authentik is working for all services
|
||||
2. Backup Authelia config for reference (if needed)
|
||||
3. Stop VM/container: `pct stop <VMID>` or `qm stop <VMID>`
|
||||
4. Delete: `pct destroy <VMID>` or `qm destroy <VMID>`
|
||||
5. Update IP-ALLOCATION.md to mark 10.0.10.112 as available (or remove from list)
|
||||
6. Update documentation
|
||||
|
||||
**Priority**: Low
|
||||
|
||||
**Estimated Time**: 15 minutes
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### 5. n8n + Claude Code Advanced Features
|
||||
|
||||
**Objective**: Enhance n8n and Claude Code integration
|
||||
|
||||
**Status**: ✅ Basic integration working, advanced features optional
|
||||
|
||||
**Remaining Optional Tasks** (from MIGRATION-CHECKLIST.md 6.4):
|
||||
- [ ] Session management workflow (UUID generation, multi-turn conversations)
|
||||
- [ ] Slack integration (Slack → n8n → Claude Code → Slack)
|
||||
- [ ] Tool deployment with `--dangerously-skip-permissions` flag
|
||||
- [ ] Error handling (network disconnect, invalid commands)
|
||||
- [ ] Resource monitoring during heavy Claude operations
|
||||
- [ ] Production hardening:
|
||||
- SSH timeout configuration
|
||||
- Output length limits
|
||||
- Logging for Claude executions
|
||||
- Error notifications
|
||||
- Optional Caddy route for public n8n access (with Authentik SSO)
|
||||
|
||||
**Reference**:
|
||||
- MIGRATION-CHECKLIST.md section 6.4
|
||||
- N8N-CLAUDE-STATUS.md
|
||||
|
||||
**Priority**: Low (nice-to-have, basic functionality working)
|
||||
|
||||
**Estimated Time**: 2-4 hours for each feature
|
||||
|
||||
---
|
||||
|
||||
### 6. Home Assistant Enhancements
|
||||
|
||||
#### 6.1 Configure Local HTTPS Certificates
|
||||
|
||||
**Objective**: Use local CA certificates for internal HTTPS access
|
||||
|
||||
**Status**: ⏸️ Deferred (CA setup complete, deployment pending)
|
||||
|
||||
**Details**:
|
||||
- CA already set up (HTTPS-SETUP-STATUS.md from 2025-12-06)
|
||||
- Certificates generated for services
|
||||
- Need to deploy certificates to Home Assistant and other services
|
||||
|
||||
**Steps** (from HTTPS-SETUP-STATUS.md):
|
||||
1. Copy certificates to Home Assistant:
|
||||
```bash
|
||||
scp ~/certs/bob.crt ~/certs/bob.key root@10.0.10.24:/config/ssl/
|
||||
```
|
||||
2. Update Home Assistant configuration:
|
||||
```yaml
|
||||
http:
|
||||
ssl_certificate: /config/ssl/bob.crt
|
||||
ssl_key: /config/ssl/bob.key
|
||||
server_port: 8123
|
||||
```
|
||||
3. Restart Home Assistant
|
||||
4. Trust CA on client devices
|
||||
|
||||
**Note**: Current setup uses local CA certificate. Public domain uses Caddy with Let's Encrypt.
|
||||
|
||||
**Priority**: Low (HTTPS already working with local CA cert)
|
||||
|
||||
**Estimated Time**: 30 minutes
|
||||
|
||||
#### 6.2 Integrate More Services with Authentik SSO
|
||||
|
||||
**Objective**: Single sign-on for additional services
|
||||
|
||||
**Status**: 📋 Planned
|
||||
|
||||
**Completed**:
|
||||
- ✅ Proxmox (all 3 hosts)
|
||||
- ✅ Grafana (OAuth2 configured)
|
||||
|
||||
**Not Possible**:
|
||||
- ❌ n8n (requires Enterprise license for OIDC/SSO)
|
||||
|
||||
**Pending**:
|
||||
- [ ] Home Assistant (complex - requires proxy provider or LDAP)
|
||||
- [ ] Other services as they're deployed
|
||||
|
||||
**Priority**: Low (manual login acceptable for now)
|
||||
|
||||
**Estimated Time**: 1-2 hours per service
|
||||
|
||||
---
|
||||
|
||||
### 7. Backup Strategy Completion
|
||||
|
||||
**Objective**: Implement full 3-tier backup system
|
||||
|
||||
**Status**: ✅ Tier 1 complete, Tier 2-3 planned
|
||||
|
||||
**Current State** (from CLAUDE.md):
|
||||
- ✅ Tier 1 (Local/OMV NFS): Fully operational
|
||||
- PostgreSQL backups: Daily 2:00 AM
|
||||
- Proxmox VM/container backups: Daily 2:30 AM
|
||||
- Retention: 7 days daily, 4 weeks weekly, 3 months monthly
|
||||
|
||||
**Remaining Tiers**:
|
||||
- [ ] Tier 2: Off-site external drives (manual rotation)
|
||||
- [ ] Tier 3: Backblaze B2 cloud storage (automated)
|
||||
|
||||
**Reference**:
|
||||
- guides/HOMELAB-BACKUP-STRATEGY.md
|
||||
- guides/BACKUP-QUICK-START.md
|
||||
|
||||
**Priority**: Medium (Tier 1 provides good protection, Tier 2-3 for disaster recovery)
|
||||
|
||||
**Estimated Time**: 2-4 hours for Tier 3 cloud setup
|
||||
|
||||
---
|
||||
|
||||
### 8. Monitoring & Alerting
|
||||
|
||||
**Objective**: Proactive monitoring of infrastructure health
|
||||
|
||||
**Status**: 📋 Planned (prerequisite: Prometheus + Grafana deployment)
|
||||
|
||||
**Components**:
|
||||
- [ ] Service uptime monitoring
|
||||
- [ ] Resource utilization (CPU, RAM, disk)
|
||||
- [ ] Network connectivity (WireGuard tunnel status)
|
||||
- [ ] Backup success/failure alerts
|
||||
- [ ] Certificate expiration warnings
|
||||
- [ ] Disk space alerts (OMV storage)
|
||||
|
||||
**Alerting Methods**:
|
||||
- Email
|
||||
- Slack/Discord webhook
|
||||
- Home Assistant notifications
|
||||
|
||||
**Priority**: Medium (blocked by Prometheus deployment)
|
||||
|
||||
**Estimated Time**: 2-3 hours (after Prometheus is deployed)
|
||||
|
||||
---
|
||||
|
||||
### 9. Cleanup and Archive Old Documentation
|
||||
|
||||
**Objective**: Remove or archive outdated status documents
|
||||
|
||||
**Status**: 📋 Pending
|
||||
|
||||
**Files to Archive or Update**:
|
||||
1. **wireguard-setup-progress.md**
|
||||
- Status: Outdated (from November 2025)
|
||||
- Contains old troubleshooting info that's no longer relevant
|
||||
- WireGuard now operational (verified 2025-12-29)
|
||||
- Action: Archive to `docs/archive/` or delete
|
||||
|
||||
2. **HTTPS-SETUP-STATUS.md**
|
||||
- Status: Partially outdated (from December 6, 2025)
|
||||
- CA setup complete, but local cert deployment not done
|
||||
- Services using Caddy with Let's Encrypt for public access
|
||||
- Action: Archive or update with current HTTPS status
|
||||
|
||||
3. **N8N-CLAUDE-STATUS.md**
|
||||
- Status: Partially outdated
|
||||
- Basic integration complete
|
||||
- Many "TODO" items that are now optional
|
||||
- Action: Archive or consolidate into SERVICES.md
|
||||
|
||||
**Priority**: Low
|
||||
|
||||
**Estimated Time**: 30 minutes
|
||||
|
||||
---
|
||||
|
||||
## Documentation Maintenance
|
||||
|
||||
### 10. Keep Documentation Updated
|
||||
|
||||
**Objective**: Maintain accurate infrastructure documentation
|
||||
|
||||
**Regular Tasks**:
|
||||
- [ ] Update SERVICES.md when services change
|
||||
- [ ] Update IP-ALLOCATION.md for new devices
|
||||
- [ ] Update MIGRATION-CHECKLIST.md for completed phases
|
||||
- [ ] Update INFRASTRUCTURE-TODO.md (this file) as tasks are completed
|
||||
- [ ] Update CLAUDE.md when architecture changes
|
||||
|
||||
**Frequency**: As changes occur
|
||||
|
||||
**Priority**: Ongoing
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: IP Addresses Still Available
|
||||
|
||||
### Reserved but Unused (Available for new services):
|
||||
- 10.0.10.6-9 (infrastructure expansion)
|
||||
- 10.0.10.11-12, 10.0.10.14-19 (management)
|
||||
- 10.0.10.23 (RustDesk - planned)
|
||||
- 10.0.10.25 (Prometheus/Grafana - planned)
|
||||
- 10.0.10.26 (production services)
|
||||
- 10.0.10.28 (was ESPHome - now runs as HA add-on, IP available)
|
||||
- 10.0.10.31-39 (IoT devices)
|
||||
- 10.0.10.41-49 (utility services)
|
||||
|
||||
### To Be Reclaimed (after cleanup):
|
||||
- 10.0.10.71 (Spoolman - to be removed)
|
||||
- 10.0.10.112 (Authelia - to be removed)
|
||||
|
||||
---
|
||||
|
||||
## Notes
|
||||
|
||||
- All critical infrastructure is operational (verified 2025-12-29)
|
||||
- WireGuard tunnel stable and functional
|
||||
- Public domains working (except Home Assistant HTTPS backend)
|
||||
- PostgreSQL shared database serving multiple services
|
||||
- Authentik SSO integrated with Proxmox cluster
|
||||
- Automated backups operational (Tier 1 local/NFS)
|
||||
|
||||
**Next High-Value Tasks**:
|
||||
1. ✅ ~~Fix Home Assistant public domain~~ - COMPLETED
|
||||
2. ✅ ~~Discover/Document Prometheus + Grafana~~ - COMPLETED
|
||||
3. ✅ ~~Discover/Document RustDesk~~ - COMPLETED
|
||||
4. Configure Prometheus monitoring targets and Grafana dashboards
|
||||
5. Cleanup deprecated VMs (Spoolman, Authelia)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-29
|
||||
**Updated By**: Fred (with Claude Code)
|
||||
Reference in New Issue
Block a user