Section 26 · High Availability & DR
High Availability & Backups
High availability configuration for databases and servers, backup strategies, point-in-time recovery, and off-site storage
This guide walks you through auditing a project's high availability configuration and backup strategy, ensuring production systems can survive failures and data can be recovered.
The Goal: Survivable Infrastructure
Production systems must survive failures at every level - from individual nodes to entire cloud providers. This means having redundancy that actually works when needed.
- Automatic failover — databases and servers recover without human intervention when primary nodes fail
- Regional resilience — infrastructure spans multiple regions or availability zones with proper traffic routing
- Verified backups — automated backups run successfully with appropriate retention, not just configured but tested
- Off-site protection — backups stored with a separate provider to survive provider-wide failures
- Point-in-time recovery — restore to any moment, not just the last daily snapshot, with windows aligned to RPO/RTO
Before You Start
- Identify database type and hosting (RDS, Cloud SQL, self-hosted PostgreSQL/MySQL, etc.)
- Identify cloud provider(s) (AWS, GCP, Azure, etc.)
- Understand project scale - "serious money involved" = Critical severity for HA items
- Get access to cloud console/CLI for verification commands
High Availability
Database has automatic failover to standby; Multi-AZ, regional HA, or replication configured; failover tested
“Your DB goes down at 2am — what happens next?”
Servers in 2+ regions/data centers OR single region with quick-failover capability; traffic can route away from failed region
“One region goes dark — how long until users notice?”
Backups
Automated backups enabled; retention period defined (minimum 7 days); backups verified running; restore tested
“When did you last verify a backup actually restores?”
Backups stored with different provider than production (not just cross-region); sync automated; external restore tested
“If AWS went down, where are your backups?”
PITR enabled for production database; recovery window appropriate (7-35 days); team knows how to perform PITR restore
“Bad deploy corrupts data — how far back can you go?”
Backup window intentional (low-traffic period); frequency aligns with business RPO; no performance impact during backups
“How much data are you willing to lose in a disaster?”