HA-002: Multi-region server deployment with failover

Severity: Recommended (Critical when serious money involved)

Production servers should be deployed across multiple regions/data centers with the ability to failover traffic if one region goes down.

Check automatically:

Check for multi-region deployment:

# AWS - instances across regions
aws ec2 describe-instances --query "Reservations[].Instances[].{ID:InstanceId,AZ:Placement.AvailabilityZone,State:State.Name}" --output table

# Check for global load balancer
aws elbv2 describe-load-balancers --query "LoadBalancers[].{Name:LoadBalancerName,Type:Type,Scheme:Scheme}" --output table

# GCP - instances across regions
gcloud compute instances list --format="table(name,zone,status)"

# Kubernetes nodes across zones
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"

Check Terraform/IaC for multi-region:

# Look for multiple region definitions
grep -rE "region\s*=|availability_zone|location\s*=" --include="*.tf" 2>/dev/null | sort | uniq

# Check for global load balancer resources
grep -rE "aws_globalaccelerator|google_compute_global|azurerm_frontdoor|cloudflare_load_balancer" --include="*.tf" 2>/dev/null

Check Kubernetes for multi-zone:

# Node distribution
kubectl get nodes --show-labels | grep -E "zone|region"

# Pod anti-affinity rules (spread across zones)
grep -rE "topologySpreadConstraints|podAntiAffinity" --include="*.yaml" --include="*.yml" 2>/dev/null

Check for DNS failover:

# Cloudflare load balancing
curl -sX GET "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" 2>/dev/null | jq '.result[] | {name, proxied, pools}'

# Route53 health checks (DNS failover)
aws route53 list-health-checks --query "HealthChecks[].{Id:Id,Type:HealthCheckConfig.Type}" --output table

Check for container orchestration multi-region:

# ECS clusters
aws ecs list-clusters --query "clusterArns"

# Check Fly.io regions
grep -rE "primary_region|regions\s*=" fly.toml 2>/dev/null

# Check Railway/Render multi-region config
grep -rE "region|replicas" --include="*.toml" --include="railway.json" 2>/dev/null

Ask user:

"Are your production servers deployed across multiple regions or data centers?"
"What happens if one region/data center goes down?"
"How quickly can you spin up servers in a different region if needed?"

Acceptable alternatives to multi-region:

Single region with documented quick-failover capability (can deploy elsewhere within hours)
Multi-AZ within single region (less resilient but acceptable for smaller projects)
PaaS with built-in regional failover (Vercel, Cloudflare Workers)

Cross-reference with:

HA-001 (database HA - both layers need resilience)
MON-006 (status pages - should reflect regional status)
Section 34 (Rollback & Recovery - RTO/RPO)

Pass criteria:

Servers in 2+ regions/data centers, OR
Single region with multi-AZ AND documented quick-failover capability
Traffic can route away from failed region (load balancer, DNS failover)
Failover tested or documented

Fail criteria:

Single region, single AZ deployment with no failover plan
"We'd figure it out if it happened"
Multi-region configured but no traffic routing

Evidence to capture:

Regions/zones where servers are deployed
Failover mechanism (load balancer, DNS, manual)
RTO for regional failover
Last failover test date (if any)

Multi-region server deployment with failover

Verification guide