AS
AgSkills.dev
MARKETPLACE

incident-runbook-templates

Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

29.3k
3.2k

Preview

SKILL.md
name
incident-runbook-templates
description
Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use when building runbooks, responding to incidents, or establishing incident response procedures.

Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

When to Use This Skill

  • Creating incident response procedures
  • Building service-specific runbooks
  • Establishing escalation paths
  • Documenting recovery procedures
  • Responding to active incidents
  • Onboarding on-call engineers

Core Concepts

1. Incident Severity Levels

SeverityImpactResponse TimeExample
SEV1Complete outage, data loss15 minProduction down
SEV2Major degradation30 minCritical feature broken
SEV3Minor impact2 hoursNon-critical bug
SEV4Minimal impactNext business dayCosmetic issue

2. Runbook Structure

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix

Runbook Templates

Template 1: Service Outage Runbook

# [Service Name] Outage Runbook ## Overview **Service**: Payment Processing Service **Owner**: Platform Team **Slack**: #payments-incidents **PagerDuty**: payments-oncall ## Impact Assessment - [ ] Which customers are affected? - [ ] What percentage of traffic is impacted? - [ ] Are there financial implications? - [ ] What's the blast radius? ## Detection ### Alerts - `payment_error_rate > 5%` (PagerDuty) - `payment_latency_p99 > 2s` (Slack) - `payment_success_rate < 95%` (PagerDuty) ### Dashboards - [Payment Service Dashboard](https://grafana/d/payments) - [Error Tracking](https://sentry.io/payments) - [Dependency Status](https://status.stripe.com) ## Initial Triage (First 5 Minutes) ### 1. Assess Scope ```bash # Check service health kubectl get pods -n payments -l app=payment-service # Check recent deployments kubectl rollout history deployment/payment-service -n payments # Check error rates curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" ```

2. Quick Health Checks

  • Can you reach the service? curl -I https://api.company.com/payments/health
  • Database connectivity? Check connection pool metrics
  • External dependencies? Check Stripe, bank API status
  • Recent changes? Check deploy history

3. Initial Classification

SymptomLikely CauseGo To Section
All requests failingService downSection 4.1
High latencyDatabase/dependencySection 4.2
Partial failuresCode bugSection 4.3
Spike in errorsTraffic surgeSection 4.4

Mitigation Procedures

4.1 Service Completely Down

# Step 1: Check pod status kubectl get pods -n payments # Step 2: If pods are crash-looping, check logs kubectl logs -n payments -l app=payment-service --tail=100 # Step 3: Check recent deployments kubectl rollout history deployment/payment-service -n payments # Step 4: ROLLBACK if recent deploy is suspect kubectl rollout undo deployment/payment-service -n payments # Step 5: Scale up if resource constrained kubectl scale deployment/payment-service -n payments --replicas=10 # Step 6: Verify recovery kubectl rollout status deployment/payment-service -n payments

4.2 High Latency

# Step 1: Check database connections kubectl exec -n payments deploy/payment-service -- \ curl localhost:8080/metrics | grep db_pool # Step 2: Check slow queries (if DB issue) psql -h $DB_HOST -U $DB_USER -c " SELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND duration > interval '5 seconds' ORDER BY duration DESC;" # Step 3: Kill long-running queries if needed psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);" # Step 4: Check external dependency latency curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health # Step 5: Enable circuit breaker if dependency is slow kubectl set env deployment/payment-service \ STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

4.3 Partial Failures (Specific Errors)

# Step 1: Identify error pattern kubectl logs -n payments -l app=payment-service --tail=500 | \ grep -i error | sort | uniq -c | sort -rn | head -20 # Step 2: Check error tracking # Go to Sentry: https://sentry.io/payments # Step 3: If specific endpoint, enable feature flag to disable curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}' # Step 4: If data issue, check recent data changes psql -h $DB_HOST -c " SELECT * FROM audit_log WHERE table_name = 'payment_methods' AND created_at > now() - interval '1 hour';"

4.4 Traffic Surge

# Step 1: Check current request rate kubectl top pods -n payments # Step 2: Scale horizontally kubectl scale deployment/payment-service -n payments --replicas=20 # Step 3: Enable rate limiting kubectl set env deployment/payment-service \ RATE_LIMIT_ENABLED=true \ RATE_LIMIT_RPS=1000 -n payments # Step 4: If attack, block suspicious IPs kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: block-suspicious namespace: payments spec: podSelector: matchLabels: app: payment-service ingress: - from: - ipBlock: cidr: 0.0.0.0/0 except: - 192.168.1.0/24 # Suspicious range EOF

Verification Steps

# Verify service is healthy curl -s https://api.company.com/payments/health | jq # Verify error rate is back to normal curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]' # Verify latency is acceptable curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq # Smoke test critical flows ./scripts/smoke-test-payments.sh

Rollback Procedures

# Rollback Kubernetes deployment kubectl rollout undo deployment/payment-service -n payments # Rollback database migration (if applicable) ./scripts/db-rollback.sh $MIGRATION_VERSION # Rollback feature flag curl -X POST https://api.company.com/internal/feature-flags \ -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

Escalation Matrix

ConditionEscalate ToContact
> 15 min unresolved SEV1Engineering Manager@manager (Slack)
Data breach suspectedSecurity Team#security-incidents
Financial impact > $10kFinance + Legal@finance-oncall
Customer communication neededSupport Lead@support-lead

Communication Templates

Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents

Status Update

πŸ“Š UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 β†’ v2.3.3
- Scaled service from 5 β†’ 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes

Resolution Notification

βœ… RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress

### Template 2: Database Incident Runbook

```markdown
# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections | `SELECT count(*) FROM pg_stat_activity;` |
| Kill query | `SELECT pg_terminate_backend(pid);` |
| Check replication lag | `SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));` |
| Check locks | `SELECT * FROM pg_locks WHERE NOT granted;` |

## Connection Pool Exhaustion
```sql
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

Replication Lag

-- Check lag on replica SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE extract(epoch from now() - pg_last_xact_replay_timestamp()) END AS lag_seconds; -- If lag > 60s, consider: -- 1. Check network between primary/replica -- 2. Check replica disk I/O -- 3. Consider failover if unrecoverable

Disk Space Critical

# Check disk usage df -h /var/lib/postgresql/data # Find large tables psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;" # VACUUM to reclaim space psql -c "VACUUM FULL large_table;" # If emergency, delete old data or expand disk

## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Resources

- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)
GitHub Repository
wshobson/agents
Stars
29,317
Forks
3,212
Open Repository
Install Skill
Download ZIP1 files