RR-003 critical rollback-strategy
Can rollback in < 2 minutes
Rollback execution time under 2 minutes with no blocking approval gates
Question to ask
"How long did your last bad deploy actually take to reverse?"
Verification guide
Severity: Critical
Speed matters during incidents. A rollback that takes 10+ minutes extends the outage. Target: under 2 minutes from decision to live.
Check automatically:
- Check deployment platform:
# Vercel - instant rollback via dashboard or CLI
grep -E "vercel" package.json vercel.json 2>/dev/null
# Railway - instant rollback via dashboard
grep -E "railway" package.json railway.json 2>/dev/null
# Fly.io - `fly releases rollback`
find . -name "fly.toml" 2>/dev/null
# Kubernetes - `kubectl rollout undo`
find . -name "*.yaml" -exec grep -l "kind: Deployment" {} \; 2>/dev/null
# Check for rollback scripts
find . -name "*rollback*" -o -name "*revert*" 2>/dev/null | grep -v node_modules
- Check CI/CD pipeline duration (if rollback goes through CI):
# Check for pipeline config
cat .github/workflows/*.yml 2>/dev/null | head -100
Rollback speed tiers:
| Speed | Method |
|---|---|
| Instant (< 30 sec) | Vercel, Railway, Fly.io dashboard click, K8s rollout undo |
| Fast (< 2 min) | Git revert + fast CI/CD pipeline |
| Slow (> 5 min) | Manual deploy process, long CI pipelines, approval gates |
Ask user:
- "How do you deploy? (Platform with instant rollback vs manual CI/CD)"
- "How long does your CI/CD pipeline take?"
- "Are there approval gates that slow down emergency rollbacks?"
Cross-reference with:
- RR-001 (documented procedure) - speed comes from clear process
- RR-002 (tested regularly) - testing reveals actual time
- FF-002 (kill switches) - sometimes faster than rollback, use as complement
Pass criteria:
- Can rollback production in under 2 minutes
- No blocking approval gates for emergency rollbacks
- Process is known and practiced
Fail criteria:
- Rollback requires full CI/CD run (> 5 min)
- Approval gates block emergency rollbacks
- Nobody knows how long it actually takes
Notes: Kill switches (Section 33) can disable features in seconds, which is faster than any rollback. For critical features, consider: kill switch first (instant), then rollback if needed (< 2 min).
Evidence to capture:
- Deployment platform and its rollback mechanism
- Measured or estimated rollback time
- Any blockers (approval gates, long pipelines)