RR-007 recommended emergency-recovery

Know RTO (Recovery Time Objective)

Maximum acceptable downtime defined and achievable with current infrastructure

Question to ask

"How long can your business actually survive prod being completely down?"

Verification guide

Severity: Recommended

RTO is the maximum acceptable time from incident start to service restoration. It drives infrastructure decisions and should be agreed with stakeholders.

Check automatically:

  1. Look for RTO documentation:
# Search for RTO mentions
grep -riE "RTO|recovery.*time.*objective|time.*to.*recover|downtime.*target" docs/ runbooks/ README.md CLAUDE.md SLA* --include="*.md" 2>/dev/null

# Check for SLA documentation
find . -name "*sla*" -o -name "*SLA*" 2>/dev/null | grep -v node_modules

RTO tiers and required strategies:

RTO Strategy Required
< 1 min Hot standby, automatic failover
< 15 min Warm standby, quick promotion
< 1 hour Pre-provisioned DR environment
< 4 hours Restore from backups to fresh infra
< 24 hours Manual recovery acceptable

Ask user:

  • "What's the maximum acceptable downtime for your service?"
  • "Is this documented/agreed with stakeholders?"
  • "Does your current infrastructure support achieving this RTO?"
  • "Have you measured actual recovery time in drills?"

Cross-reference with:

  • RR-006 (recovery tested) - tests measure actual recovery time
  • RR-008 (RPO) - related objective, often defined together
  • Section 26 (HA/backups) - infrastructure must support RTO
  • RR-005 (recovery documented) - procedure should mention RTO target

Pass criteria:

  • RTO is defined (even informally: "we need to be up within 4 hours")
  • RTO is realistic given current infrastructure
  • Team knows the RTO and it influences decisions
  • Actual recovery time (from drills) meets or beats RTO

Fail criteria:

  • No idea what acceptable downtime is
  • RTO is defined but infrastructure can't achieve it
  • RTO exists on paper but team doesn't know it
  • Never measured actual recovery time

Evidence to capture:

  • Defined RTO (or lack thereof)
  • Whether it's documented/agreed with business
  • Actual measured recovery time from drills
  • Gap between target RTO and actual capability

Section

34. Rollback & Recovery

API & Security