HA-001 recommended High Availability

Production database HA configured

Database has automatic failover to standby; Multi-AZ, regional HA, or replication configured; failover tested

Question to ask

"Your DB goes down at 2am — what happens next?"

Verification guide

Severity: Recommended (Critical when serious money involved)

Production databases should have automatic failover to a standby instance. If the primary fails, the standby is promoted with minimal downtime.

Check automatically:

  1. AWS RDS Multi-AZ:
# Check Multi-AZ status
aws rds describe-db-instances --query "DBInstances[].{ID:DBInstanceIdentifier,MultiAZ:MultiAZ,Engine:Engine}" --output table

# Check for read replicas (can be promoted)
aws rds describe-db-instances --query "DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=null].{ID:DBInstanceIdentifier,Source:ReadReplicaSourceDBInstanceIdentifier}"
  1. GCP Cloud SQL HA:
# Check availability type (REGIONAL = HA, ZONAL = no HA)
gcloud sql instances list --format="table(name,availabilityType,region)"

# Detailed HA config
gcloud sql instances describe INSTANCE_NAME --format="get(settings.availabilityType)"
  1. Azure SQL:
# Check zone redundancy
az sql db show --name DB_NAME --server SERVER_NAME --query "{name:name,zoneRedundant:zoneRedundant}"
  1. Check Terraform/IaC:
# AWS RDS
grep -rE "multi_az\s*=\s*true" --include="*.tf" 2>/dev/null

# GCP Cloud SQL
grep -rE "availability_type\s*=\s*\"REGIONAL\"" --include="*.tf" 2>/dev/null

# Self-hosted replication
grep -rE "streaming_replication|primary_conninfo|hot_standby" --include="*.tf" --include="*.conf" --include="*.yml" 2>/dev/null
  1. For self-hosted databases:
# PostgreSQL streaming replication
grep -rE "primary_conninfo|recovery_target|standby_mode|hot_standby" --include="*.conf" 2>/dev/null

# MySQL replication
grep -rE "server-id|log_bin|relay-log|read_only" --include="*.cnf" --include="*.conf" 2>/dev/null

# Check Docker Compose for replication setup
grep -rE "replication|replica|standby|primary" docker-compose*.yml 2>/dev/null

Ask user:

  • "Is your production database configured for high availability?"
  • "What happens if the primary database node fails?"
  • "Have you tested database failover?"

Cross-reference with:

  • HA-003 (backups - HA doesn't replace backups)
  • HA-005 (PITR - complementary to HA)
  • DB-001 (connection pooling - may need reconfiguration during failover)

Pass criteria:

  • Database has automatic failover to standby (Multi-AZ, REGIONAL, replication)
  • Failover has been tested or documented
  • RTO (Recovery Time Objective) is acceptable for business

Fail criteria:

  • Single database instance with no standby
  • "We've never tested failover"
  • HA configured but never verified working

Evidence to capture:

  • Database HA configuration (Multi-AZ, availability type, replication mode)
  • Failover RTO (expected downtime during failover)
  • Last failover test date (if any)

Section

26. High Availability & Backups

High Availability & DR