HA-001 recommended High Availability

Production database HA configured

Database has automatic failover to standby; Multi-AZ, regional HA, or replication configured; failover tested

Question to ask

"Your DB goes down at 2am — what happens next?"

Verification guide

Severity: Recommended (Critical when serious money involved)

Production databases should have automatic failover to a standby instance. If the primary fails, the standby is promoted with minimal downtime.

Check automatically:

AWS RDS Multi-AZ:

# Check Multi-AZ status
aws rds describe-db-instances --query "DBInstances[].{ID:DBInstanceIdentifier,MultiAZ:MultiAZ,Engine:Engine}" --output table

# Check for read replicas (can be promoted)
aws rds describe-db-instances --query "DBInstances[?ReadReplicaSourceDBInstanceIdentifier!=null].{ID:DBInstanceIdentifier,Source:ReadReplicaSourceDBInstanceIdentifier}"

GCP Cloud SQL HA:

# Check availability type (REGIONAL = HA, ZONAL = no HA)
gcloud sql instances list --format="table(name,availabilityType,region)"

# Detailed HA config
gcloud sql instances describe INSTANCE_NAME --format="get(settings.availabilityType)"

Azure SQL:

# Check zone redundancy
az sql db show --name DB_NAME --server SERVER_NAME --query "{name:name,zoneRedundant:zoneRedundant}"

Check Terraform/IaC:

# AWS RDS
grep -rE "multi_az\s*=\s*true" --include="*.tf" 2>/dev/null

# GCP Cloud SQL
grep -rE "availability_type\s*=\s*\"REGIONAL\"" --include="*.tf" 2>/dev/null

# Self-hosted replication
grep -rE "streaming_replication|primary_conninfo|hot_standby" --include="*.tf" --include="*.conf" --include="*.yml" 2>/dev/null

For self-hosted databases:

# PostgreSQL streaming replication
grep -rE "primary_conninfo|recovery_target|standby_mode|hot_standby" --include="*.conf" 2>/dev/null

# MySQL replication
grep -rE "server-id|log_bin|relay-log|read_only" --include="*.cnf" --include="*.conf" 2>/dev/null

# Check Docker Compose for replication setup
grep -rE "replication|replica|standby|primary" docker-compose*.yml 2>/dev/null

Ask user:

"Is your production database configured for high availability?"
"What happens if the primary database node fails?"
"Have you tested database failover?"

Cross-reference with:

HA-003 (backups - HA doesn't replace backups)
HA-005 (PITR - complementary to HA)
DB-001 (connection pooling - may need reconfiguration during failover)

Pass criteria:

Database has automatic failover to standby (Multi-AZ, REGIONAL, replication)
Failover has been tested or documented
RTO (Recovery Time Objective) is acceptable for business

Fail criteria:

Single database instance with no standby
"We've never tested failover"
HA configured but never verified working

Evidence to capture:

Database HA configuration (Multi-AZ, availability type, replication mode)
Failover RTO (expected downtime during failover)
Last failover test date (if any)

Section

26. High Availability & Backups

High Availability & DR