RR-003 critical rollback-strategy

Can rollback in < 2 minutes

Rollback execution time under 2 minutes with no blocking approval gates

Question to ask

"How long did your last bad deploy actually take to reverse?"

Verification guide

Severity: Critical

Speed matters during incidents. A rollback that takes 10+ minutes extends the outage. Target: under 2 minutes from decision to live.

Check automatically:

  1. Check deployment platform:
# Vercel - instant rollback via dashboard or CLI
grep -E "vercel" package.json vercel.json 2>/dev/null

# Railway - instant rollback via dashboard
grep -E "railway" package.json railway.json 2>/dev/null

# Fly.io - `fly releases rollback`
find . -name "fly.toml" 2>/dev/null

# Kubernetes - `kubectl rollout undo`
find . -name "*.yaml" -exec grep -l "kind: Deployment" {} \; 2>/dev/null

# Check for rollback scripts
find . -name "*rollback*" -o -name "*revert*" 2>/dev/null | grep -v node_modules
  1. Check CI/CD pipeline duration (if rollback goes through CI):
# Check for pipeline config
cat .github/workflows/*.yml 2>/dev/null | head -100

Rollback speed tiers:

Speed Method
Instant (< 30 sec) Vercel, Railway, Fly.io dashboard click, K8s rollout undo
Fast (< 2 min) Git revert + fast CI/CD pipeline
Slow (> 5 min) Manual deploy process, long CI pipelines, approval gates

Ask user:

  • "How do you deploy? (Platform with instant rollback vs manual CI/CD)"
  • "How long does your CI/CD pipeline take?"
  • "Are there approval gates that slow down emergency rollbacks?"

Cross-reference with:

  • RR-001 (documented procedure) - speed comes from clear process
  • RR-002 (tested regularly) - testing reveals actual time
  • FF-002 (kill switches) - sometimes faster than rollback, use as complement

Pass criteria:

  • Can rollback production in under 2 minutes
  • No blocking approval gates for emergency rollbacks
  • Process is known and practiced

Fail criteria:

  • Rollback requires full CI/CD run (> 5 min)
  • Approval gates block emergency rollbacks
  • Nobody knows how long it actually takes

Notes: Kill switches (Section 33) can disable features in seconds, which is faster than any rollback. For critical features, consider: kill switch first (instant), then rollback if needed (< 2 min).

Evidence to capture:

  • Deployment platform and its rollback mechanism
  • Measured or estimated rollback time
  • Any blockers (approval gates, long pipelines)

Section

34. Rollback & Recovery

API & Security