MON-001 critical general

Infrastructure metrics collection

All services have monitoring configured collecting CPU, memory, disk metrics. Service inventory documented from CLI. Database connection pools and Redis specifically monitored. Coverage matrix shows no gaps.

Question to ask

"Which service has no monitoring right now?"

Verification guide

Severity: Critical

Check automatically:

  1. Get service inventory from CLI:

    Kubernetes:

    # List all deployments/services
    kubectl get deployments -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name}'
    
    # List statefulsets (databases, etc.)
    kubectl get statefulsets -A -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name}'
    

    Docker Compose:

    # List services defined
    docker compose config --services
    
    # List running containers
    docker ps --format '{{.Names}}'
    

    AWS:

    # List EC2 instances
    aws ec2 describe-instances --query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==`Name`].Value|[0],State:State.Name}'
    
    # List ECS services
    aws ecs list-services --cluster CLUSTER_NAME
    
    # List RDS instances
    aws rds describe-db-instances --query 'DBInstances[].DBInstanceIdentifier'
    
    # List ElastiCache clusters (Redis)
    aws elasticache describe-cache-clusters --query 'CacheClusters[].CacheClusterId'
    

    GCP:

    # List Compute Engine instances
    gcloud compute instances list --format='json' | jq '.[].name'
    
    # List Cloud SQL instances
    gcloud sql instances list --format='json' | jq '.[].name'
    
    # List Memorystore (Redis) instances
    gcloud redis instances list --region=REGION --format='json' | jq '.[].name'
    
  2. Verify monitoring configuration exists:

    Prometheus:

    # Check prometheus config
    cat prometheus.yml
    
    # Look for scrape configs covering services
    grep -A 20 "scrape_configs:" prometheus.yml
    
    # Check alert rules exist
    ls -la rules/ || ls -la alerts/
    

    Datadog:

    # Check Datadog agent config
    cat /etc/datadog-agent/datadog.yaml
    
    # List integrations
    ls /etc/datadog-agent/conf.d/
    

    CloudWatch:

    # List custom metrics
    aws cloudwatch list-metrics --namespace "Custom" --query 'Metrics[].MetricName'
    
    # Check for CloudWatch agent config
    cat /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
    

    GCP Monitoring:

    # List custom metrics
    gcloud monitoring metrics-scopes list
    
    # List alerting policies
    gcloud alpha monitoring policies list --format='json' | jq '.[].displayName'
    
  3. Verify specific metrics are collected:

    Required metrics per service type:

    • Compute: CPU utilization, memory usage, disk space
    • Database: Connection count, pool utilization, query latency
    • Redis: Memory usage, connected clients, hit rate, evictions
    • All services: Health/up status
    # Prometheus - check metrics exist
    curl -s localhost:9090/api/v1/label/__name__/values | jq '.data[]' | grep -E 'cpu|memory|disk|connection|redis'
    
    # CloudWatch - check EC2 metrics
    aws cloudwatch list-metrics --namespace AWS/EC2 --dimensions Name=InstanceId,Value=INSTANCE_ID
    

Cross-reference with:

  • DB-001 (Connection pooling) - pool metrics should match pool config
  • HEALTH-002 (Deep health endpoint) - metrics should cover same services deep health checks

Pass criteria:

  • Service inventory is documented (from CLI output)
  • Each service has monitoring configured
  • CPU, memory, disk metrics collected for compute
  • Connection pool metrics collected for databases
  • Redis metrics collected if Redis in use
  • Monitoring coverage matrix shows no gaps

Fail criteria:

  • No service inventory exists
  • Services exist without monitoring
  • Missing critical metrics (CPU, memory, disk)
  • Database has no connection pool metrics
  • Redis in use but not monitored

If monitoring tool not identified, ask user: "What monitoring system does this project use? (Prometheus, Datadog, CloudWatch, GCP Monitoring, New Relic, etc.)

Please provide:

  1. List of all services/infrastructure components
  2. Screenshot or export of monitoring dashboard showing coverage
  3. Any services known to be unmonitored"

Evidence to capture:

  • Service inventory (CLI output)
  • Monitoring tool in use
  • Screenshot of monitoring dashboard
  • Coverage matrix: service → metrics collected
  • Any gaps identified

Section

12. Monitoring

Observability