skillby redhat-et
platform-health
Check comprehensive platform health including ArgoCD apps, pods, services, certificates, and resources across the Kagenti platform
Installs: 0
Used in: 1 repos
Updated: 1d ago
$
npx ai-builder add skill redhat-et/platform-healthInstalls to .claude/skills/platform-health/
# Platform Health Check Skill This skill helps you perform comprehensive platform health checks and identify issues quickly. ## When to Use - After deployments or cluster restarts - Before making changes (baseline health) - During incident investigation - Regular health monitoring - After running tests - User requests "check platform" or "is everything working" ## What This Skill Does 1. **Quick Health Overview**: One-command platform status 2. **ArgoCD Apps**: Health and sync status of all applications 3. **Pod Health**: Check pods across all namespaces 4. **Service Accessibility**: Test Gateway routes and certificates 5. **Resource Usage**: CPU/memory consumption 6. **Component-Specific Checks**: Detailed validation per component ## Quick Health Check ### Comprehensive Platform Status ```bash # Single command for full platform health (includes pytest tests) ./scripts/platform-status.sh # What it checks: # ✓ ArgoCD applications (health & sync status) # ✓ Platform pods (all namespaces) # ✓ Gateway & certificates # ✓ Istio mTLS configuration # ✓ Service accessibility (via Gateway) # ✓ OAuth authentication # ✓ Integration tests (pytest) ``` **Expected Output**: ``` === ArgoCD Applications Status === ✓ gateway-api: Healthy, Synced ✓ cert-manager: Healthy, Synced ✓ istio-base: Healthy, Synced ... === Platform Pods === observability grafana-xxx 2/2 Running observability prometheus-xxx 2/2 Running ... === Gateway & Certificates === ✓ external-gateway: Programmed ✓ grafana-cert: Ready ... === Integration Tests === PASSED tests/validation/test_app_state.py::test_critical_apps ... ``` ### Quick Status Commands ```bash # ArgoCD apps summary argocd app list --port-forward --port-forward-namespace argocd --grpc-web # All pods summary kubectl get pods -A # Failing pods only kubectl get pods -A | grep -vE "Running|Completed" # Service endpoints kubectl get svc -A # Gateway status kubectl get gateway -A # Certificate status kubectl get certificate -A ``` ## Detailed Health Checks ### 1. ArgoCD Application Health ```bash # List all apps with health status argocd app list --port-forward --port-forward-namespace argocd --grpc-web \ -o json | jq -r '.[] | "\(.metadata.name): \(.status.health.status), \(.status.sync.status)"' # Check for unhealthy apps argocd app list --port-forward --port-forward-namespace argocd --grpc-web \ | grep -E "Degraded|OutOfSync|Unknown|Missing" # Get details for specific app argocd app get <app-name> --port-forward --port-forward-namespace argocd --grpc-web # Check app sync history argocd app history <app-name> --port-forward --port-forward-namespace argocd --grpc-web ``` **Expected States**: - **Health**: `Healthy` (✓), `Progressing` (⚠️), `Degraded` (❌), `Missing` (❌) - **Sync**: `Synced` (✓), `OutOfSync` (⚠️) **Critical Apps** (must be Healthy): - gateway-api - cert-manager - istio-base, istiod - tekton-pipelines - keycloak - kagenti-operator, kagenti-platform-operator - kagenti-platform - kagenti-ui **Optional Apps** (can be Progressing): - observability (large images, slow startup) - kiali - ollama ### 2. Pod Health by Namespace ```bash # All pods with status kubectl get pods -A -o wide # Pods sorted by restarts kubectl get pods -A --sort-by='.status.containerStatuses[0].restartCount' | tail -20 # Pods with issues kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded # Pod resource usage kubectl top pods -A --sort-by=memory kubectl top pods -A --sort-by=cpu # Specific namespace health kubectl get pods -n observability kubectl get pods -n keycloak kubectl get pods -n kagenti-system ``` **Check for these statuses**: - ❌ **CrashLoopBackOff**: Application crashes on startup - ❌ **ImagePullBackOff**: Image not available - ❌ **Error**: Container exited with error - ⚠️ **Pending**: Waiting for resources or scheduling - ⚠️ **Init**: Init containers still running - ✓ **Running**: Pod healthy - ✓ **Completed**: Job finished successfully ### 3. Service Accessibility ```bash # Test all platform services via Gateway for service in grafana prometheus tempo phoenix kiali keycloak kagenti; do echo "=== Testing https://$service.localtest.me:9443/ ===" curl -k -I -m 5 "https://$service.localtest.me:9443/" 2>&1 | head -3 echo done # Check Gateway status kubectl get gateway -A kubectl describe gateway external-gateway -n default # Check HTTPRoutes kubectl get httproute -A kubectl describe httproute <route-name> -n <namespace> # Check service endpoints (should have IP addresses) kubectl get endpoints -A | grep -v "<none>" ``` **Expected Results**: - Grafana: HTTP/2 302 (redirect to /login) - Prometheus: HTTP/2 302 (OAuth redirect) - Keycloak: HTTP/2 200 - Kagenti UI: HTTP/2 200 ### 4. Certificate Health ```bash # All certificates status kubectl get certificate -A # Check certificate details kubectl describe certificate <cert-name> -n <namespace> # Check cert-manager logs for issues kubectl logs -n cert-manager deployment/cert-manager --tail=50 # Verify certificate expiration kubectl get certificate -A -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): expires \(.status.notAfter)"' ``` **Expected State**: All certificates show `Ready=True` ### 5. Istio Service Mesh Health ```bash # Check Istio components kubectl get pods -n istio-system # Verify sidecar injection (should show 2/2 containers) kubectl get pods -A -o wide | grep "2/2" # Check mTLS policies kubectl get peerauthentication -A kubectl get destinationrule -A # Istio proxy status istioctl proxy-status # Check specific pod mesh config istioctl x describe pod <pod-name> -n <namespace> ``` ### 6. Resource Usage ```bash # Node resources kubectl top nodes # Cluster-wide pod resources kubectl top pods -A --sort-by=memory | head -20 kubectl top pods -A --sort-by=cpu | head -20 # Namespace resource usage kubectl top pods -n observability kubectl top pods -n keycloak kubectl top pods -n kagenti-system # Check for resource pressure kubectl get nodes -o json | jq -r '.items[] | "\(.metadata.name): \(.status.conditions[] | select(.type=="MemoryPressure" or .type=="DiskPressure") | .type)=\(.status)"' ``` ### 7. Storage Health ```bash # PersistentVolumes kubectl get pv # PersistentVolumeClaims kubectl get pvc -A # Check PVC usage via metrics kubectl exec -n observability deployment/grafana -- \ curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \ --data-urlencode 'query=(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100' \ | python3 -m json.tool ``` ## Component-Specific Health Checks ### Observability Stack ```bash # Prometheus kubectl get pods -n observability -l app=prometheus kubectl exec -n observability deployment/grafana -- \ curl -s http://prometheus.observability.svc:9090/-/ready # Grafana kubectl get pods -n observability -l app=grafana curl -k -I https://grafana.localtest.me:9443/api/health # Loki kubectl get pods -n observability -l app=loki kubectl exec -n observability deployment/grafana -- \ curl -s http://loki.observability.svc:3100/ready # Tempo kubectl get pods -n observability -l app=tempo kubectl exec -n observability deployment/grafana -- \ curl -s http://tempo-query-frontend.observability.svc:3100/ready # Phoenix kubectl get pods -n observability -l app=phoenix curl -k -I https://phoenix.localtest.me:9443/ # AlertManager kubectl get pods -n observability -l app=alertmanager kubectl exec -n observability deployment/alertmanager -c alertmanager -- \ wget -qO- http://localhost:9093/-/ready ``` ### Authentication & Authorization ```bash # Keycloak kubectl get pods -n keycloak -l app=keycloak kubectl exec -n keycloak statefulset/keycloak -- \ curl -s http://localhost:8080/health/ready | python3 -m json.tool # OAuth2-Proxy instances kubectl get pods -n oauth2-proxy kubectl get deployment -n oauth2-proxy # Test Keycloak SSO curl -k "https://keycloak.localtest.me:9443/realms/master/.well-known/openid-configuration" ``` ### Platform Components ```bash # Kagenti Operator kubectl get pods -n kagenti-operator kubectl logs -n kagenti-operator deployment/kagenti-operator --tail=20 # Kagenti Platform Operator kubectl get pods -n kagenti-platform-operator kubectl logs -n kagenti-platform-operator deployment/kagenti-platform-operator --tail=20 # Kagenti UI kubectl get pods -n kagenti-platform -l app=kagenti-ui curl -k -I https://kagenti.localtest.me:9443/ # Tekton Pipelines kubectl get pods -n tekton-pipelines kubectl get pipelineruns -A ``` ## Health Check Checklists ### Post-Deployment Health Check - [ ] All ArgoCD apps Healthy and Synced - [ ] No pods in CrashLoopBackOff/ImagePullBackOff - [ ] All services have endpoints - [ ] All certificates Ready - [ ] All Gateway routes Programmed - [ ] Services accessible via browser - [ ] Integration tests passing - [ ] No firing critical alerts ### Pre-Change Health Check - [ ] Capture platform snapshot: `./scripts/capture-platform-snapshot.sh before-change` - [ ] All critical apps Healthy - [ ] No existing incidents in TODO_INCIDENTS.md - [ ] Resource usage within limits - [ ] Recent Git commits validated ### Incident Investigation Health Check - [ ] Identify degraded components - [ ] Check recent events - [ ] Collect logs from affected pods - [ ] Query metrics for anomalies - [ ] Check for correlated failures - [ ] Review recent changes (Git history) ## Common Health Issues ### Issue: Pods stuck in Pending ```bash # Check pod description for reason kubectl describe pod <pod-name> -n <namespace> # Common causes: # - Insufficient CPU/memory # - No nodes matching nodeSelector # - Unbound PersistentVolumeClaim ``` ### Issue: Pods CrashLoopBackOff ```bash # Check previous logs kubectl logs <pod-name> -n <namespace> --previous # Check events kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20 # Common causes: # - Application error on startup # - Missing configuration # - Dependency not available ``` ### Issue: Service not accessible ```bash # Check pod status kubectl get pods -n <namespace> -l app=<service> # Check service endpoints kubectl get endpoints -n <namespace> <service-name> # Check HTTPRoute kubectl get httproute -n <namespace> # Test from inside cluster kubectl run debug-curl -n <namespace> --image=curlimages/curl --rm -it \ -- curl http://<service-name>.<namespace>.svc:PORT ``` ### Issue: Certificate not Ready ```bash # Check certificate status kubectl describe certificate <cert-name> -n <namespace> # Check cert-manager logs kubectl logs -n cert-manager deployment/cert-manager # Common causes: # - DNS validation failing # - Rate limit reached # - Invalid configuration ``` ### Issue: High resource usage ```bash # Find top consumers kubectl top pods -A --sort-by=memory | head -10 kubectl top pods -A --sort-by=cpu | head -10 # Check for memory leaks kubectl logs <pod-name> -n <namespace> | grep -i "out of memory" # Check resource limits kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Limits:" ``` ## Automation & Monitoring ### Continuous Health Monitoring ```bash # Watch pod status watch -n 5 'kubectl get pods -A | grep -vE "Running|Completed"' # Watch ArgoCD apps watch -n 10 'argocd app list --port-forward --port-forward-namespace argocd --grpc-web | grep -vE "Healthy.*Synced"' # Monitor specific namespace watch -n 5 'kubectl get pods -n observability' ``` ### Scheduled Health Checks ```bash # Cron job for periodic health checks (local dev) # Add to crontab: crontab -e */15 * * * * /path/to/kagenti-demo-deployment/scripts/platform-status.sh > /tmp/health-$(date +\%Y\%m\%d-\%H\%M).log 2>&1 # Compare snapshots over time ./scripts/capture-platform-snapshot.sh hourly-check ``` ## Related Documentation - [CLAUDE.md Platform Status](../../../CLAUDE.md#monitoring--access) - Monitoring commands - [scripts/platform-status.sh](../../../scripts/platform-status.sh) - Automated health check - [TODO_INCIDENTS.md](../../../TODO_INCIDENTS.md) - Active incidents - [docs/INTEGRATION_TESTS.md](../../../docs/INTEGRATION_TESTS.md) - Test strategy ## Integration with Other Skills **After health check, if issues found**: - Use **investigate-incident** skill for RCA - Use **check-logs** skill to examine error logs - Use **check-metrics** skill for performance analysis - Use **check-alerts** skill to see if alerts fired ## Pro Tips 1. **Always baseline first**: Run health check BEFORE making changes 2. **Use platform-status.sh**: Single command for comprehensive check 3. **Capture snapshots**: Use `capture-platform-snapshot.sh` for historical comparison 4. **Check critical apps first**: Focus on gateway-api, istio, keycloak, operators 5. **Look for patterns**: Multiple pods failing often indicates cluster-wide issue 6. **Check Git history**: Recent commits may explain new issues 7. **Verify after fixes**: Always re-run health check after remediation 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Quick Install
$
npx ai-builder add skill redhat-et/platform-healthDetails
- Type
- skill
- Author
- redhat-et
- Slug
- redhat-et/platform-health
- Created
- 4d ago