Infrastructure metrics collection
All services have monitoring configured collecting CPU, memory, disk metrics. Service inventory documented from CLI. Database connection pools and Redis specifically monitored. Coverage matrix shows no gaps.
Question to ask
"Which service has no monitoring right now?"
Related items
Verification guide
Severity: Critical
Check automatically:
Get service inventory from CLI:
Kubernetes:
# List all deployments/services kubectl get deployments -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name}' # List statefulsets (databases, etc.) kubectl get statefulsets -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name}'Docker Compose:
# List services defined docker compose config --services # List running containers docker ps --format '{{.Names}}'AWS:
# List EC2 instances aws ec2 describe-instances --query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==`Name`].Value|[0],State:State.Name}' # List ECS services aws ecs list-services --cluster CLUSTER_NAME # List RDS instances aws rds describe-db-instances --query 'DBInstances[].DBInstanceIdentifier' # List ElastiCache clusters (Redis) aws elasticache describe-cache-clusters --query 'CacheClusters[].CacheClusterId'GCP:
# List Compute Engine instances gcloud compute instances list --format='json' | jq '.[].name' # List Cloud SQL instances gcloud sql instances list --format='json' | jq '.[].name' # List Memorystore (Redis) instances gcloud redis instances list --region=REGION --format='json' | jq '.[].name'Verify monitoring configuration exists:
Prometheus:
# Check prometheus config cat prometheus.yml # Look for scrape configs covering services grep -A 20 "scrape_configs:" prometheus.yml # Check alert rules exist ls -la rules/ || ls -la alerts/Datadog:
# Check Datadog agent config cat /etc/datadog-agent/datadog.yaml # List integrations ls /etc/datadog-agent/conf.d/CloudWatch:
# List custom metrics aws cloudwatch list-metrics --namespace "Custom" --query 'Metrics[].MetricName' # Check for CloudWatch agent config cat /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.jsonGCP Monitoring:
# List custom metrics gcloud monitoring metrics-scopes list # List alerting policies gcloud alpha monitoring policies list --format='json' | jq '.[].displayName'Verify specific metrics are collected:
Required metrics per service type:
- Compute: CPU utilization, memory usage, disk space
- Database: Connection count, pool utilization, query latency
- Redis: Memory usage, connected clients, hit rate, evictions
- All services: Health/up status
# Prometheus - check metrics exist curl -s localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep -E 'cpu|memory|disk|connection|redis' # CloudWatch - check EC2 metrics aws cloudwatch list-metrics --namespace AWS/EC2 --dimensions Name=InstanceId,Value=INSTANCE_ID
Cross-reference with:
- DB-001 (Connection pooling) - pool metrics should match pool config
- HEALTH-002 (Deep health endpoint) - metrics should cover same services deep health checks
Pass criteria:
- Service inventory is documented (from CLI output)
- Each service has monitoring configured
- CPU, memory, disk metrics collected for compute
- Connection pool metrics collected for databases
- Redis metrics collected if Redis in use
- Monitoring coverage matrix shows no gaps
Fail criteria:
- No service inventory exists
- Services exist without monitoring
- Missing critical metrics (CPU, memory, disk)
- Database has no connection pool metrics
- Redis in use but not monitored
If monitoring tool not identified, ask user: "What monitoring system does this project use? (Prometheus, Datadog, CloudWatch, GCP Monitoring, New Relic, etc.)
Please provide:
- List of all services/infrastructure components
- Screenshot or export of monitoring dashboard showing coverage
- Any services known to be unmonitored"
Evidence to capture:
- Service inventory (CLI output)
- Monitoring tool in use
- Screenshot of monitoring dashboard
- Coverage matrix: service → metrics collected
- Any gaps identified