Rollback & Recovery | CTO Checklist

This guide walks you through auditing a project's rollback and disaster recovery capabilities - deployment rollback, database migration rollback, and full recovery from backups.

The Goal: Two-Minute Recovery

When things go wrong, speed matters. Every minute of downtime costs trust. This audit ensures you can undo mistakes fast.

Documented — Deployment rollback procedures can be executed quickly (under 2 minutes) by any team member
Migration-aware — Database rollback strategies exist, especially for destructive schema changes
Full-stack — Disaster recovery procedures cover restoring the entire system from backups
Tested — Rollback and recovery procedures have been validated and actually work
Objective-driven — Recovery objectives (RTO/RPO) are defined and achievable with current infrastructure

Before You Start

Identify deployment platform (Vercel, Railway, Fly.io, K8s, custom CI/CD)
Identify database and migration tool (Prisma, Drizzle, Knex, etc.)
Identify backup strategy (provider snapshots, S3, PITR)
Check for existing runbooks (disaster recovery documentation)

rollback-strategy

RR-001

Rollback procedure documented critical

Written procedure for how to rollback a bad deployment

“Where's the rollback doc, and when did you last read it?”

RR-002

Rollback tested regularly recommended

Rollback procedure tested at least quarterly with multiple team members

“When was the last time you actually ran a rollback drill?”

RR-003

Can rollback in < 2 minutes critical

Rollback execution time under 2 minutes with no blocking approval gates

“How long did your last bad deploy actually take to reverse?”

RR-004

Database migration rollback plan critical

Strategy for rolling back database migrations including destructive changes

“What happens when you deploy with a DROP COLUMN and need to revert?”

emergency-recovery

RR-005

Recovery from backups documented critical

Step-by-step procedure to restore full stack from backups

“Who knows the steps to restore prod from scratch — right now, tonight?”

RR-006

Recovery procedure tested critical

Full recovery drill completed at least annually including database restore

“Have you confirmed your backups actually restore correctly?”

RR-007

Know RTO (Recovery Time Objective) recommended

Maximum acceptable downtime defined and achievable with current infrastructure

“How long can your business actually survive prod being completely down?”

RR-008

Know RPO (Recovery Point Objective) recommended

Maximum acceptable data loss defined with backup frequency to support it

“How much data could you lose right now before it becomes a crisis?”