Section 26 · High Availability & DR

High Availability & Backups

High availability configuration for databases and servers, backup strategies, point-in-time recovery, and off-site storage

6 items 2 critical 4 recommended

This guide walks you through auditing a project's high availability configuration and backup strategy, ensuring production systems can survive failures and data can be recovered.

The Goal: Survivable Infrastructure

Production systems must survive failures at every level - from individual nodes to entire cloud providers. This means having redundancy that actually works when needed.

Automatic failover — databases and servers recover without human intervention when primary nodes fail
Regional resilience — infrastructure spans multiple regions or availability zones with proper traffic routing
Verified backups — automated backups run successfully with appropriate retention, not just configured but tested
Off-site protection — backups stored with a separate provider to survive provider-wide failures
Point-in-time recovery — restore to any moment, not just the last daily snapshot, with windows aligned to RPO/RTO

Before You Start

Identify database type and hosting (RDS, Cloud SQL, self-hosted PostgreSQL/MySQL, etc.)
Identify cloud provider(s) (AWS, GCP, Azure, etc.)
Understand project scale - "serious money involved" = Critical severity for HA items
Get access to cloud console/CLI for verification commands

High Availability

HA-001

Production database HA configured recommended

Database has automatic failover to standby; Multi-AZ, regional HA, or replication configured; failover tested

“Your DB goes down at 2am — what happens next?”

HA-002

Multi-region server deployment with failover recommended

Servers in 2+ regions/data centers OR single region with quick-failover capability; traffic can route away from failed region

“One region goes dark — how long until users notice?”

Backups

HA-003

Production database backup configured critical

Automated backups enabled; retention period defined (minimum 7 days); backups verified running; restore tested

“When did you last verify a backup actually restores?”

HA-004

Off-site backup storage (outside primary provider) recommended

Backups stored with different provider than production (not just cross-region); sync automated; external restore tested

“If AWS went down, where are your backups?”

HA-005

Point-in-time recovery enabled critical

PITR enabled for production database; recovery window appropriate (7-35 days); team knows how to perform PITR restore

“Bad deploy corrupts data — how far back can you go?”

HA-006

Backup window appropriate for RPO recommended

Backup window intentional (low-traffic period); frequency aligns with business RPO; no performance impact during backups

“How much data are you willing to lose in a disaster?”

← Previous section

Intrusion Detection

4 items

Next section →

Database Tooling

2 items