Name: alerting
Author: hsudatalks
# Alerting Skill

This skill provides comprehensive monitoring and alerting capabilities for the Univers infrastructure ecosystem.

## Capabilities

### 1. Real-time Monitoring
- System resource monitoring (CPU, Memory, Disk, Network)
- Service health checks (HTTP endpoints, ports, processes)
- Application-specific metrics (response times, error rates)
- Custom metric collection and aggregation

### 2. Alert Engine
- Threshold-based alerting
- Rate limiting and alert suppression
- Alert escalation policies
- Multi-condition alert rules

### 3. Notification Channels
- Email notifications with rich formatting
- Slack/Teams integration with actionable messages
- Webhook support for custom integrations
- In-app notifications and banners

### 4. Alert Management
- Alert acknowledgment and resolution
- Alert history and analytics
- Scheduled maintenance windows
- Alert rule testing and validation

### 5. Dashboards and Reports
- Real-time alert status dashboard
- Historical alert trends and analytics
- Service health overview
- Performance metrics visualization

## Common Tasks

### Basic Alert Setup
```bash
# Check system for alert conditions
alert check system

# Monitor specific services
alert monitor services

# Test notification channels
alert test channels
```

### Alert Rule Management
```bash
# List all alert rules
alert rules list

# Add new alert rule
alert rules add cpu-high --threshold 80 --duration 5m

# Update existing rule
alert rules update memory-usage --threshold 90

# Remove alert rule
alert rules remove disk-space-low
```

### Notification Configuration
```bash
# Configure email notifications
alert config email --smtp smtp.example.com --from alerts@example.com

# Configure Slack integration
alert config slack --webhook https://hooks.slack.com/... --channel #alerts

# Test notification delivery
alert test email --to admin@example.com
alert test slack --message "Test alert"
```

### Alert Operations
```bash
# View active alerts
alert status

# Acknowledge an alert
alert acknowledge CPU_HIGH_001

# Resolve an alert
alert resolve MEMORY_HIGH_003

# View alert history
alert history --last 24h
```

## Alert Rule Examples

### System Resource Alerts
```yaml
# High CPU Usage
name: cpu-high
condition: cpu_usage > 80
duration: 5m
severity: warning
message: "CPU usage is {{cpu_usage}}% on {{hostname}}"
actions:
  - type: email
    to: ops@example.com
  - type: slack
    channel: #alerts

# Critical Memory Usage
name: memory-critical
condition: memory_usage > 90
duration: 2m
severity: critical
message: "Critical memory usage: {{memory_usage}}%"
actions:
  - type: webhook
    url: https://api.pagerduty.com/incidents
```

### Service Health Alerts
```yaml
# Service Down
name: service-down
condition: service_health == 0
duration: 1m
severity: critical
message: "{{service_name}} is down on {{hostname}}"
actions:
  - type: email
    to: devops@example.com
  - type: restart
    service: "{{service_name}}"

# High Response Time
name: slow-response
condition: response_time > 2000
duration: 3m
severity: warning
message: "{{service_name}} response time: {{response_time}}ms"
actions:
  - type: slack
    channel: #performance
```

### Application-Specific Alerts
```yaml
# High Error Rate
name: high-error-rate
condition: error_rate > 5
duration: 5m
severity: warning
message: "{{application}} error rate: {{error_rate}}%"
actions:
  - type: email
    to: dev-team@example.com

# Database Connection Issues
name: db-connection-failed
condition: db_connection_status != "healthy"
duration: 30s
severity: critical
message: "Database connection failed for {{application}}"
actions:
  - type: webhook
    url: https://hooks.slack.com/...
```

## Integration Examples

### Univers Services Integration
```bash
# Monitor Univers services
alert monitor univers-services

# Check specific Univers endpoints
alert check endpoint http://localhost:3003/health --service univers-server
alert check endpoint http://localhost:6007 --service univers-ui
alert check endpoint http://localhost:5173 --service univers-web

# Monitor tmux sessions
alert monitor tmux-sessions --alert-if-missing univers-developer
```

### Container Integration
```bash
# Monitor Docker containers
alert monitor containers --include univers-*

# Check container health
alert check container univers-server
alert check container univers-ui
```

## Configuration Files

### Alert Rules Configuration
```yaml
# ~/.config/univers/alerting/rules.yaml
rules:
  - name: system-cpu-high
    type: system
    metric: cpu_usage
    operator: ">"
    threshold: 80
    duration: 5m
    severity: warning

  - name: service-unavailable
    type: service
    check: http_status
    target: "http://localhost:3003/health"
    operator: "!="
    threshold: 200
    duration: 1m
    severity: critical
```

### Notification Channels
```yaml
# ~/.config/univers/alerting/channels.yaml
channels:
  email:
    smtp_host: smtp.gmail.com
    smtp_port: 587
    username: alerts@company.com
    password: ${SMTP_PASSWORD}

  slack:
    webhook_url: ${SLACK_WEBHOOK_URL}
    default_channel: #univers-alerts

  webhook:
    endpoint: https://api.example.com/alerts
    headers:
      Authorization: "Bearer ${API_TOKEN}"
```

## Best Practices

1. **Set Meaningful Thresholds**: Avoid alert fatigue by setting realistic thresholds
2. **Use Escalation Policies**: Implement graduated alert escalation
3. **Provide Context**: Include relevant details in alert messages
4. **Test Regularly**: Verify alert rules and notification channels
5. **Document Procedures**: Maintain clear runbooks for common alerts

## Troubleshooting

### Common Issues
- **Missing Notifications**: Check channel configurations and connectivity
- **False Positives**: Review alert thresholds and conditions
- **Alert Storms**: Implement rate limiting and suppression rules
- **Slow Performance**: Optimize alert check intervals and data collection

### Debug Commands
```bash
# Check alert engine status
alert status --verbose

# Test specific rule
alert test-rule cpu-high

# Check notification delivery
alert test-notification email --to test@example.com

# View alert engine logs
alert logs --tail 100
```

## Version History

- v1.0 (2025-12-16): Initial alerting system implementation
- Basic monitoring, email notifications, and alert rules
alerting

Quick Install

Details

Used In

More by hsudatalks