Section 34 · API & Security
Rollback & Recovery
Deployment rollback, database migration rollback, and disaster recovery capabilities
This guide walks you through auditing a project's rollback and disaster recovery capabilities - deployment rollback, database migration rollback, and full recovery from backups.
The Goal: Two-Minute Recovery
When things go wrong, speed matters. Every minute of downtime costs trust. This audit ensures you can undo mistakes fast.
- Documented — Deployment rollback procedures can be executed quickly (under 2 minutes) by any team member
- Migration-aware — Database rollback strategies exist, especially for destructive schema changes
- Full-stack — Disaster recovery procedures cover restoring the entire system from backups
- Tested — Rollback and recovery procedures have been validated and actually work
- Objective-driven — Recovery objectives (RTO/RPO) are defined and achievable with current infrastructure
Before You Start
- Identify deployment platform (Vercel, Railway, Fly.io, K8s, custom CI/CD)
- Identify database and migration tool (Prisma, Drizzle, Knex, etc.)
- Identify backup strategy (provider snapshots, S3, PITR)
- Check for existing runbooks (disaster recovery documentation)
rollback-strategy
Written procedure for how to rollback a bad deployment
“Where's the rollback doc, and when did you last read it?”
Rollback procedure tested at least quarterly with multiple team members
“When was the last time you actually ran a rollback drill?”
Rollback execution time under 2 minutes with no blocking approval gates
“How long did your last bad deploy actually take to reverse?”
Strategy for rolling back database migrations including destructive changes
“What happens when you deploy with a DROP COLUMN and need to revert?”
emergency-recovery
Step-by-step procedure to restore full stack from backups
“Who knows the steps to restore prod from scratch — right now, tonight?”
Full recovery drill completed at least annually including database restore
“Have you confirmed your backups actually restore correctly?”
Maximum acceptable downtime defined and achievable with current infrastructure
“How long can your business actually survive prod being completely down?”
Maximum acceptable data loss defined with backup frequency to support it
“How much data could you lose right now before it becomes a crisis?”