RR-006 critical emergency-recovery

Recovery procedure tested

Full recovery drill completed at least annually including database restore

Question to ask

"Have you confirmed your backups actually restore correctly?"

Verification guide

Severity: Critical

Untested backups are Schrödinger's backups - you don't know if they work until you try. Many teams discover their backups are corrupted or incomplete only during a real disaster.

Check automatically:

  1. Look for recovery test records:
# Search for DR test documentation
grep -riE "dr.*test|disaster.*drill|recovery.*test|tested.*recovery" docs/ runbooks/ --include="*.md" 2>/dev/null

# Check for test dates
grep -riE "last.*tested|tested.*on|drill.*date" docs/ runbooks/ --include="*.md" 2>/dev/null

Ask user:

  • "Have you ever done a full restore from backups to a clean environment?"
  • "When was the last disaster recovery drill?"
  • "Did the drill include database restore, not just application redeploy?"
  • "What problems did you discover during testing?"

What a proper test covers:

  1. Provision fresh infrastructure (or use DR environment)
  2. Restore database from backup
  3. Deploy application
  4. Verify data integrity
  5. Verify application functionality
  6. Measure time taken (validates RTO)
  7. Document issues found

Cross-reference with:

  • RR-005 (recovery documented) - test validates the documentation
  • RR-007 (RTO) - test measures actual recovery time
  • RR-002 (rollback tested) - similar principle, different scope
  • Section 26 (backups) - tests verify backups are actually restorable

Pass criteria:

  • Full recovery tested at least annually
  • Test included database restore (not just app redeploy)
  • Issues found during test were fixed
  • Test results documented with time measurements

Fail criteria:

  • Never tested ("backups exist, that's enough")
  • Only tested app redeploy, never database restore
  • Test failed and issues weren't fixed
  • No record of when/how testing was done

Evidence to capture:

  • Date of last recovery test
  • Scope of test (full stack vs partial)
  • Time taken to recover
  • Issues discovered and their resolution status

Section

34. Rollback & Recovery

API & Security