RR-005 critical emergency-recovery

Recovery from backups documented

Step-by-step procedure to restore full stack from backups

Question to ask

"Who knows the steps to restore prod from scratch — right now, tonight?"

Verification guide

Severity: Critical

If your primary infrastructure is completely gone, you need written steps to restore everything from scratch. This isn't about rollback - it's about total recovery.

Check automatically:

  1. Look for disaster recovery documentation:
# Search for DR docs
grep -riE "disaster.*recovery|restore.*backup|recovery.*procedure|server.*down" docs/ runbooks/ README.md CLAUDE.md --include="*.md" 2>/dev/null

# Check for restore scripts
find . -name "*restore*" -o -name "*recovery*" 2>/dev/null | grep -v node_modules
  1. Check for infrastructure-as-code (makes recovery easier):
# Terraform, Pulumi, CDK
find . -name "*.tf" -o -name "pulumi.*" -o -name "cdk.*" 2>/dev/null | head -5
ls terraform/ pulumi/ cdk/ infrastructure/ 2>/dev/null

What the document should cover:

  1. Where are backups stored? (S3, provider snapshots, etc.)
  2. How to access them in emergency?
  3. How to provision new infrastructure?
  4. How to restore database from backup?
  5. How to restore application state?
  6. How to update DNS/routing to new infrastructure?
  7. Who has permissions to do this?

Ask user:

  • "If your primary server and database were completely gone, do you have written steps to restore?"
  • "Where are your backups stored? (Same provider = risky, different provider = better)"
  • "Who has access to restore from backups?"
  • "Is infrastructure defined as code (Terraform, Pulumi) or manual?"

Cross-reference with:

  • RR-006 (recovery procedure tested) - document is useless if untested
  • RR-007/RR-008 (RTO/RPO) - recovery doc should mention time objectives
  • Section 26 (backups) - backups must exist before you can restore them

Pass criteria:

  • Written step-by-step recovery procedure exists
  • Covers full stack (infra, database, application)
  • Multiple people can execute it
  • Backups are stored separately from primary infrastructure

Fail criteria:

  • No written procedure ("we'll figure it out")
  • Only covers partial recovery (database but not infra)
  • Only one person knows how
  • Backups on same provider/region as primary (could be lost together)

Evidence to capture:

  • Location of disaster recovery documentation
  • Backup storage location(s)
  • Whether infrastructure is codified
  • Who has restore permissions

Section

34. Rollback & Recovery

API & Security