HA-005 critical Backups

Point-in-time recovery enabled

PITR enabled for production database; recovery window appropriate (7-35 days); team knows how to perform PITR restore

Question to ask

"Bad deploy corrupts data — how far back can you go?"

Verification guide

Severity: Critical (for production)

Point-in-time recovery (PITR) allows restoring the database to any moment, not just the last daily snapshot. Essential for recovering from accidental data deletion or corruption.

Check automatically:

  1. AWS RDS PITR:
# PITR is enabled if BackupRetentionPeriod > 0
aws rds describe-db-instances --query "DBInstances[].{ID:DBInstanceIdentifier,BackupRetention:BackupRetentionPeriod,LatestRestorableTime:LatestRestorableTime}" --output table

# Check how far back you can restore
aws rds describe-db-instances --query "DBInstances[].{ID:DBInstanceIdentifier,EarliestRestore:InstanceCreateTime,LatestRestore:LatestRestorableTime}"
  1. GCP Cloud SQL PITR:
# Check PITR specifically
gcloud sql instances describe INSTANCE_NAME --format="get(settings.backupConfiguration.pointInTimeRecoveryEnabled)"

# Check binary logging (required for PITR on MySQL)
gcloud sql instances describe INSTANCE_NAME --format="get(settings.backupConfiguration.binaryLogEnabled)"
  1. Azure SQL PITR:
# Check earliest restore point
az sql db show --name DB_NAME --server SERVER_NAME --query "{name:name,earliestRestoreDate:earliestRestoreDate}"
  1. Check Terraform/IaC:
# AWS RDS - PITR enabled via backup retention
grep -rE "backup_retention_period\s*=\s*[1-9]" --include="*.tf" 2>/dev/null

# GCP Cloud SQL PITR
grep -rE "point_in_time_recovery_enabled\s*=\s*true" --include="*.tf" 2>/dev/null
  1. For self-hosted PostgreSQL (WAL archiving):
# Check for WAL archiving configuration
grep -rE "archive_mode\s*=\s*on|archive_command|wal_level\s*=\s*replica" --include="*.conf" --include="*.tf" --include="*.yml" 2>/dev/null

# Check for WAL-G or pgBackRest (common PITR tools)
grep -rE "wal-g|pgbackrest|barman" --include="*.yml" --include="*.sh" --include="docker-compose*" 2>/dev/null
  1. For self-hosted MySQL (binary logging):
# Check binary log configuration
grep -rE "log_bin|binlog_format|expire_logs_days" --include="*.cnf" --include="*.conf" 2>/dev/null

Ask user:

  • "Can you restore your production database to a specific point in time?"
  • "What's your recovery window? (how far back can you restore?)"
  • "If someone accidentally deleted critical data at 2:47 PM, could you restore to 2:46 PM?"

Cross-reference with:

  • HA-003 (backups - PITR builds on backup infrastructure)
  • HA-004 (off-site - WAL archives should also be stored off-site)
  • Section 34 (RPO - PITR provides near-zero RPO)

Pass criteria:

  • PITR enabled on production database
  • Recovery window appropriate (typically 7-35 days)
  • Team knows how to perform PITR restore
  • PITR restore tested at least once

Fail criteria:

  • Only daily snapshots (can't restore to specific point)
  • PITR not enabled
  • "We've never done a point-in-time restore"
  • WAL archiving configured but failing

Evidence to capture:

  • PITR mechanism (managed service, WAL archiving, binary logs)
  • Recovery window (earliest to latest restorable time)
  • Last PITR restore test (if any)
  • RPO achieved with PITR (typically seconds/minutes)

Section

26. High Availability & Backups

High Availability & DR