RES-001 critical third-party-services

Third-party service resilience

App handles external service failures gracefully - doesn't crash, other requests still work, affected features return errors instead of crashing the process

Question to ask

"Stripe goes down for an hour — does your whole app crash?"

Verification guide

Severity: Critical

The app should not crash when external services fail. Other requests should continue working. 500s are acceptable for affected features, but the app must stay up.

Step 1: Identify external dependencies

# Find HTTP clients and external service calls
grep -rE "(axios|fetch|got|request|node-fetch)" --include="*.ts" --include="*.js" -l

# Find service URLs in env configuration
grep -rE "(API_URL|SERVICE_URL|ENDPOINT|_HOST|_URI|_URL)" .env.example .env 2>/dev/null

# Find connection strings (Redis, queues, external databases)
grep -rE "(redis://|amqp://|mongodb\+srv://|mysql://)" --include="*.ts" --include="*.js" -l

# Check Docker Compose for service dependencies
grep -E "^\s+[a-z]+:" docker-compose.yml 2>/dev/null | grep -v "#"

Step 2: Check for error handling patterns

# Look for try/catch around external calls
grep -rE "try\s*\{" --include="*.ts" --include="*.js" -l | head -10

# Look for .catch() on promises
grep -rE "\.catch\s*\(" --include="*.ts" --include="*.js" -l | head -10

# Look for timeout configurations
grep -rE "(timeout:|timeout =|setTimeout|AbortController)" --include="*.ts" --include="*.js" | head -10

# Look for circuit breaker patterns (opossum, cockatiel, etc.)
grep -rE "(circuit|CircuitBreaker|opossum|cockatiel)" --include="*.ts" --include="*.js"

Step 3: Test startup resilience

For each non-critical dependency identified:

# Start app with broken service URL
REDIS_URL=redis://localhost:9999 npm run dev

# Or start without Docker Compose dependencies
docker compose stop redis  # Stop just redis
npm run dev                 # See if app starts

Questions to consider:

  • Does the app start?
  • Does it log a warning about the unavailable service?
  • Or does it crash with an unhandled error?

Step 4: Test runtime resilience

With the app running:

# Kill a non-critical dependency mid-flight
docker compose stop redis

# Hit an unrelated endpoint - should still work
curl -s http://localhost:3000/api/users | head -c 200

# Hit an endpoint that uses the failed service - should error gracefully
curl -s http://localhost:3000/api/cached-data

# Check the process is still alive
pgrep -f "node" | wc -l

Step 5: Verify process stability

# Check for uncaught exception handlers
grep -rE "(uncaughtException|unhandledRejection)" --include="*.ts" --include="*.js"

# Look for process exit calls that might be triggered on errors
grep -rE "process\.exit" --include="*.ts" --include="*.js"

Cross-reference with:

  • Section 7 (Health Endpoints) - deep health endpoint should report which services are down
  • Section 5 (Database) - database is typically critical and may require different handling

Pass criteria:

  • App starts even when non-critical services are unavailable
  • Process remains running after external service errors occur
  • Unaffected endpoints continue responding normally
  • Affected endpoints return proper error responses (4xx/5xx), not process crashes
  • Error handling exists around external service calls (try/catch, .catch(), timeouts)

Fail criteria:

  • App refuses to start because an optional service is down
  • Unhandled promise rejections or exceptions crash the process
  • One failing service causes unrelated endpoints to fail
  • No error handling around external service calls
  • Process exits on transient external failures

If unclear, ask user:

  • "Which of these dependencies are critical (app cannot function without) vs optional (can degrade gracefully)?"
  • "Is there documentation of external service dependencies?"
  • "What's the expected behavior when [service X] is unavailable?"

Evidence to capture:

  • List of external dependencies discovered (URLs, connection strings)
  • Classification: critical vs optional for each dependency
  • Error handling patterns found (or missing)
  • Startup behavior with each non-critical dependency unavailable
  • Runtime behavior when dependency fails mid-operation
  • Whether process stayed alive through all tests

Section

06. Resilience

Database & Data