Section 12 · Observability
Monitoring
Audit checklist for monitoring infrastructure metrics, database performance, HTTP logging, alerting, log retention, and status pages.
This guide walks you through auditing a project's monitoring setup, ensuring infrastructure metrics are collected, database performance is tracked, HTTP requests are logged with alerting, and status pages exist.
The Goal: No Blind Spots
You cannot fix what you cannot see. Complete observability means knowing when things break before users do, understanding why they broke, and having the data to prevent recurrence.
- Complete coverage — Every infrastructure component (compute, databases, caches) has metrics collection with no gaps
- Queryable — Slow queries logged and reviewed regularly; HTTP requests analyzable by status, endpoint, and time
- Alertable — 404 volume spikes and immediate 500 detection route to the right people
- Retained — At least 14 days of logs for debugging; status pages for incident communication
Before You Start
- Identify the monitoring stack in use (Prometheus, Datadog, CloudWatch, GCP Monitoring, etc.)
- Have CLI access to cloud providers (AWS, GCP, etc.)
- Have database access or credentials to check slow query settings
- Know where alerting is configured (PagerDuty, Opsgenie, Slack, etc.)
- Identify status page URLs if they exist
general
All services have monitoring configured collecting CPU, memory, disk metrics. Service inventory documented from CLI. Database connection pools and Redis specifically monitored. Coverage matrix shows no gaps.
“Which service has no monitoring right now?”
Slow query logging enabled with reasonable threshold. Query analysis tooling available (pg_stat_statements, Performance Schema, etc.). Dashboard or report exists showing slow queries, OR audit record shows regular review.
“When did you last look at your slow query log?”
HTTP requests are logged (any source: app, CDN, LB, APM). Logs include timestamp, method, path, status code, response time. Analysis tool exists to query/visualize. Can filter by status code and see traffic patterns.
“Could you tell me how many 4xx errors hit you yesterday?”
Alert on high volume of 404s (team-defined threshold). Alert on any 500s within 1-minute window. Alerts route to appropriate channel. Alert delivery verified.
“Would a spike in 500s wake anyone up tonight?”
Log retention is configured with minimum 2 weeks. Retention policy is documented. Consistent across systems.
“How far back can you investigate an incident?”
Production status page exists and is accessible. Staging status page exists (recommended). Uptime monitoring configured checking health endpoints. Downtime alerts route to appropriate channel.
“Do your customers know before you do when you're down?”