skillby redhat-et
check-metrics
Query Prometheus metrics, check resource usage, and analyze platform performance in the Kagenti platform
Installs: 0
Used in: 1 repos
Updated: 7h ago
$
npx ai-builder add skill redhat-et/check-metricsInstalls to .claude/skills/check-metrics/
# Check Metrics Skill
This skill helps you query Prometheus metrics and analyze platform performance.
## When to Use
- User asks about resource usage (CPU, memory, disk)
- Investigating performance issues
- Checking service health metrics
- After deployments to verify metrics collection
- Analyzing platform capacity and scaling needs
## What This Skill Does
1. **Query Metrics**: Execute PromQL queries against Prometheus
2. **Resource Usage**: Check CPU, memory, disk usage
3. **Service Health**: Verify service metrics and availability
4. **Performance Analysis**: Analyze request rates, latency, errors
5. **Capacity Planning**: Review resource trends
## Examples
### Access Prometheus UI
**Prometheus UI**: Port-forward to access locally
```bash
kubectl port-forward -n observability svc/prometheus 9090:9090 &
# Open http://localhost:9090
```
**Grafana Explore**: https://grafana.localtest.me:9443/explore
- Select **Prometheus** datasource
- Enter PromQL queries
### Query Metrics via CLI
```bash
# Basic query
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode 'query=up' | python3 -m json.tool
# Query with time range
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query_range' \
--data-urlencode 'query=rate(container_cpu_usage_seconds_total[5m])' \
--data-urlencode 'start='$(date -u -v-1H +%s) \
--data-urlencode 'end='$(date -u +%s) \
--data-urlencode 'step=60' | python3 -m json.tool
```
## Common PromQL Queries
### Service Health
```promql
# Check if services are up
up{job="kubernetes-pods"}
# Count running pods by namespace
count by (kubernetes_namespace) (up == 1)
# Check deployment replicas
kube_deployment_status_replicas_available
# Check StatefulSet replicas
kube_statefulset_status_replicas_ready
```
### CPU Usage
```promql
# Pod CPU usage (percentage of limit)
sum(rate(container_cpu_usage_seconds_total{container!="",container!="POD"}[5m])) by (namespace, pod, container)
/ sum(container_spec_cpu_quota{container!="",container!="POD"} / container_spec_cpu_period{container!="",container!="POD"}) by (namespace, pod, container) * 100
# Node CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Top CPU consuming pods
topk(10,
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace, pod)
)
```
### Memory Usage
```promql
# Pod memory usage (percentage of limit)
sum(container_memory_working_set_bytes{container!="",container!="POD"}) by (namespace, pod, container)
/ sum(container_spec_memory_limit_bytes{container!="",container!="POD"}) by (namespace, pod, container) * 100
# Pod memory usage in bytes
container_memory_working_set_bytes{container!="",container!="POD"}
# Top memory consuming pods
topk(10,
sum(container_memory_working_set_bytes{container!=""}) by (namespace, pod)
)
```
### Network Traffic
```promql
# Network receive rate
rate(container_network_receive_bytes_total[5m])
# Network transmit rate
rate(container_network_transmit_bytes_total[5m])
# Total network I/O by pod
sum by (pod) (
rate(container_network_receive_bytes_total[5m]) +
rate(container_network_transmit_bytes_total[5m])
)
```
### Disk Usage
```promql
# Filesystem usage percentage
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100
# PVC usage by namespace
sum by (namespace, persistentvolumeclaim) (
kubelet_volume_stats_used_bytes
)
# Disk I/O rate
rate(container_fs_writes_bytes_total[5m])
```
### Pod Status
```promql
# Pods not running
kube_pod_status_phase{phase!="Running"}
# Pod restart count
kube_pod_container_status_restarts_total
# Pods waiting (pending)
kube_pod_status_phase{phase="Pending"}
# Pods in crash loop
kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}
```
### Request Metrics (if instrumented)
```promql
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# Request latency (p95)
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
```
## Check Specific Components
### Prometheus Metrics
```bash
# Check Prometheus scrape targets
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://prometheus.observability.svc:9090/api/v1/targets' | python3 -m json.tool
# Prometheus storage size
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://prometheus.observability.svc:9090/api/v1/status/tsdb' | python3 -m json.tool
```
### Grafana Metrics
```promql
# Grafana datasource queries
grafana_datasource_request_total
# Grafana dashboard loads
grafana_page_response_status_total
```
### Keycloak Metrics (if exposed)
```promql
# Keycloak sessions
keycloak_sessions
# Keycloak login failures
keycloak_failed_login_attempts
```
### Istio Metrics
```promql
# Istio requests
istio_requests_total
# Istio request duration
histogram_quantile(0.95,
rate(istio_request_duration_milliseconds_bucket[5m])
)
# Istio error rate
rate(istio_requests_total{response_code=~"5.."}[5m])
```
## Resource Monitoring via kubectl
### Quick Resource Check
```bash
# Node resources
kubectl top nodes
# Pod resources (all namespaces)
kubectl top pods -A --sort-by=memory
# Pod resources (specific namespace)
kubectl top pods -n observability --sort-by=cpu
# Container resources in pod
kubectl top pod <pod-name> -n <namespace> --containers
```
### Resource Limits and Requests
```bash
# Show resource requests/limits for deployment
kubectl describe deployment <name> -n <namespace> | grep -A 5 "Limits\|Requests"
# Show all pod resource requests
kubectl get pods -n <namespace> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].resources}{"\n"}{end}'
```
## Grafana Dashboards
**Access**: https://grafana.localtest.me:9443/dashboards
**Key Dashboards**:
1. **Kubernetes / Compute Resources / Cluster** - Overall cluster metrics
2. **Kubernetes / Compute Resources / Namespace (Pods)** - Per-namespace pod resources
3. **Kubernetes / Compute Resources / Pod** - Individual pod metrics
4. **Prometheus** - Prometheus self-monitoring
5. **Loki Logs** - Log volume and patterns
6. **Istio Mesh** - Service mesh metrics
### Create Custom Queries in Grafana
1. Navigate to **Explore** (compass icon in sidebar)
2. Select **Prometheus** datasource
3. Enter PromQL query
4. Click "Run query"
5. Optionally save to dashboard
## Troubleshooting with Metrics
### Issue: High CPU Usage
```promql
# Find pods using >80% CPU
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod, container)
/ sum(container_spec_cpu_quota / container_spec_cpu_period) by (namespace, pod, container) * 100 > 80
```
### Issue: High Memory Usage
```promql
# Find pods using >80% memory
sum(container_memory_working_set_bytes) by (namespace, pod, container)
/ sum(container_spec_memory_limit_bytes) by (namespace, pod, container) * 100 > 80
```
### Issue: Service Not Responding
```promql
# Check if service endpoints are up
up{job="kubernetes-service-endpoints"}
# Check scrape failures
up == 0
```
### Issue: Disk Full
```promql
# Find PVCs >80% full
(kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 80
```
## Alert Query Testing
When investigating alerts, test the PromQL query:
```bash
# Get alert query from Grafana
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://localhost:3000/api/v1/provisioning/alert-rules' \
-u admin:admin123 | python3 -c "
import sys, json
rules = json.load(sys.stdin)
alert_uid = 'prometheus-down' # Change this
rule = next((r for r in rules if r.get('uid') == alert_uid), None)
if rule:
query = rule['data'][0]['model']['expr']
print(f'Query: {query}')
"
# Test the query
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://prometheus.observability.svc:9090/api/v1/query' \
--data-urlencode "query=<QUERY_FROM_ABOVE>" | python3 -m json.tool
```
## Metrics Collection Issues
### Check if Metrics Are Being Scraped
```promql
# Check last scrape time
time() - timestamp(up)
# Check scrape duration
scrape_duration_seconds
```
### Verify Metric Exists
```bash
# List all metrics
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://prometheus.observability.svc:9090/api/v1/label/__name__/values' | python3 -m json.tool
# Search for specific metric
kubectl exec -n observability deployment/grafana -- \
curl -s 'http://prometheus.observability.svc:9090/api/v1/label/__name__/values' | grep "your_metric"
```
## Related Documentation
- [PromQL Documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/)
- [Grafana Explore](https://grafana.com/docs/grafana/latest/explore/)
- [Alert Testing Guide](../../../docs/04-observability/ALERT_TESTING_GUIDE.md)
- [CLAUDE.md Monitoring](../../../CLAUDE.md#monitoring--access)
## Pro Tips
1. **Use rate() for counters**: `rate(metric[5m])` instead of raw counter values
2. **Aggregate with by/without**: `sum by (namespace) (metric)` to group metrics
3. **Use recording rules**: For frequently used complex queries
4. **Set appropriate time ranges**: Use `[5m]` for rate calculations
5. **Test queries in Explore first**: Before adding to dashboards or alerts
š¤ Generated with [Claude Code](https://claude.com/claude-code)Quick Install
$
npx ai-builder add skill redhat-et/check-metricsDetails
- Type
- skill
- Author
- redhat-et
- Slug
- redhat-et/check-metrics
- Created
- 3d ago