HA-002 recommended High Availability
Multi-region server deployment with failover
Servers in 2+ regions/data centers OR single region with quick-failover capability; traffic can route away from failed region
Question to ask
"One region goes dark — how long until users notice?"
Verification guide
Severity: Recommended (Critical when serious money involved)
Production servers should be deployed across multiple regions/data centers with the ability to failover traffic if one region goes down.
Check automatically:
- Check for multi-region deployment:
# AWS - instances across regions
aws ec2 describe-instances --query "Reservations[].Instances[].{ID:InstanceId,AZ:Placement.AvailabilityZone,State:State.Name}" --output table
# Check for global load balancer
aws elbv2 describe-load-balancers --query "LoadBalancers[].{Name:LoadBalancerName,Type:Type,Scheme:Scheme}" --output table
# GCP - instances across regions
gcloud compute instances list --format="table(name,zone,status)"
# Kubernetes nodes across zones
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"
- Check Terraform/IaC for multi-region:
# Look for multiple region definitions
grep -rE "region\s*=|availability_zone|location\s*=" --include="*.tf" 2>/dev/null | sort | uniq
# Check for global load balancer resources
grep -rE "aws_globalaccelerator|google_compute_global|azurerm_frontdoor|cloudflare_load_balancer" --include="*.tf" 2>/dev/null
- Check Kubernetes for multi-zone:
# Node distribution
kubectl get nodes --show-labels | grep -E "zone|region"
# Pod anti-affinity rules (spread across zones)
grep -rE "topologySpreadConstraints|podAntiAffinity" --include="*.yaml" --include="*.yml" 2>/dev/null
- Check for DNS failover:
# Cloudflare load balancing
curl -sX GET "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" 2>/dev/null | jq '.result[] | {name, proxied, pools}'
# Route53 health checks (DNS failover)
aws route53 list-health-checks --query "HealthChecks[].{Id:Id,Type:HealthCheckConfig.Type}" --output table
- Check for container orchestration multi-region:
# ECS clusters
aws ecs list-clusters --query "clusterArns"
# Check Fly.io regions
grep -rE "primary_region|regions\s*=" fly.toml 2>/dev/null
# Check Railway/Render multi-region config
grep -rE "region|replicas" --include="*.toml" --include="railway.json" 2>/dev/null
Ask user:
- "Are your production servers deployed across multiple regions or data centers?"
- "What happens if one region/data center goes down?"
- "How quickly can you spin up servers in a different region if needed?"
Acceptable alternatives to multi-region:
- Single region with documented quick-failover capability (can deploy elsewhere within hours)
- Multi-AZ within single region (less resilient but acceptable for smaller projects)
- PaaS with built-in regional failover (Vercel, Cloudflare Workers)
Cross-reference with:
- HA-001 (database HA - both layers need resilience)
- MON-006 (status pages - should reflect regional status)
- Section 34 (Rollback & Recovery - RTO/RPO)
Pass criteria:
- Servers in 2+ regions/data centers, OR
- Single region with multi-AZ AND documented quick-failover capability
- Traffic can route away from failed region (load balancer, DNS failover)
- Failover tested or documented
Fail criteria:
- Single region, single AZ deployment with no failover plan
- "We'd figure it out if it happened"
- Multi-region configured but no traffic routing
Evidence to capture:
- Regions/zones where servers are deployed
- Failover mechanism (load balancer, DNS, manual)
- RTO for regional failover
- Last failover test date (if any)