Monitoring | CTO Checklist

This guide walks you through auditing a project's monitoring setup, ensuring infrastructure metrics are collected, database performance is tracked, HTTP requests are logged with alerting, and status pages exist.

The Goal: No Blind Spots

You cannot fix what you cannot see. Complete observability means knowing when things break before users do, understanding why they broke, and having the data to prevent recurrence.

Complete coverage — Every infrastructure component (compute, databases, caches) has metrics collection with no gaps
Queryable — Slow queries logged and reviewed regularly; HTTP requests analyzable by status, endpoint, and time
Alertable — 404 volume spikes and immediate 500 detection route to the right people
Retained — At least 14 days of logs for debugging; status pages for incident communication

Before You Start

Identify the monitoring stack in use (Prometheus, Datadog, CloudWatch, GCP Monitoring, etc.)
Have CLI access to cloud providers (AWS, GCP, etc.)
Have database access or credentials to check slow query settings
Know where alerting is configured (PagerDuty, Opsgenie, Slack, etc.)
Identify status page URLs if they exist

general

MON-001

Infrastructure metrics collection critical

All services have monitoring configured collecting CPU, memory, disk metrics. Service inventory documented from CLI. Database connection pools and Redis specifically monitored. Coverage matrix shows no gaps.

“Which service has no monitoring right now?”

MON-002

Database performance monitoring critical

Slow query logging enabled with reasonable threshold. Query analysis tooling available (pg_stat_statements, Performance Schema, etc.). Dashboard or report exists showing slow queries, OR audit record shows regular review.

“When did you last look at your slow query log?”

MON-003

HTTP request logging and analysis critical

HTTP requests are logged (any source: app, CDN, LB, APM). Logs include timestamp, method, path, status code, response time. Analysis tool exists to query/visualize. Can filter by status code and see traffic patterns.

“Could you tell me how many 4xx errors hit you yesterday?”

MON-004

HTTP error alerting critical

Alert on high volume of 404s (team-defined threshold). Alert on any 500s within 1-minute window. Alerts route to appropriate channel. Alert delivery verified.

“Would a spike in 500s wake anyone up tonight?”

MON-005

Log retention policy recommended

Log retention is configured with minimum 2 weeks. Retention policy is documented. Consistent across systems.

“How far back can you investigate an incident?”

MON-006

Status pages and downtime alerts critical

Production status page exists and is accessible. Staging status page exists (recommended). Uptime monitoring configured checking health endpoints. Downtime alerts route to appropriate channel.

“Do your customers know before you do when you're down?”