agskills.dev
MARKETPLACE

posthog-incident-runbook

Execute PostHog incident response procedures with triage, mitigation, and postmortem. Use when responding to PostHog-related outages, investigating errors, or running post-incident reviews for PostHog integration failures. Trigger with phrases like "posthog incident", "posthog outage", "posthog down", "posthog on-call", "posthog emergency", "posthog broken".

jeremylongshore1.3k159

プレビュー

SKILL.md
Metadata
name
posthog-incident-runbook
description
|
allowed-tools
Read, Grep, Bash(kubectl:*), Bash(curl:*)
version
1.0.0
license
MIT
author
Jeremy Longshore <[email protected]>

PostHog Incident Runbook

Overview

Rapid incident response procedures for PostHog-related outages.

Prerequisites

  • Access to PostHog dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)

Severity Levels

LevelDefinitionResponse TimeExamples
P1Complete outage< 15 minPostHog API unreachable
P2Degraded service< 1 hourHigh latency, partial failures
P3Minor impact< 4 hoursWebhook delays, non-critical errors
P4No user impactNext business dayMonitoring gaps

Quick Triage

# 1. Check PostHog status curl -s https://status.posthog.com | jq # 2. Check our integration health curl -s https://api.yourapp.com/health | jq '.services.posthog' # 3. Check error rate (last 5 min) curl -s localhost:9090/api/v1/query?query=rate(posthog_errors_total[5m]) # 4. Recent error logs kubectl logs -l app=posthog-integration --since=5m | grep -i error | tail -20

Decision Tree

PostHog API returning errors?
├─ YES: Is status.posthog.com showing incident?
│   ├─ YES → Wait for PostHog to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.

Immediate Actions by Error Type

401/403 - Authentication

# Verify API key is set kubectl get secret posthog-secrets -o jsonpath='{.data.api-key}' | base64 -d # Check if key was rotated # → Verify in PostHog dashboard # Remediation: Update secret and restart pods kubectl create secret generic posthog-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f - kubectl rollout restart deployment/posthog-integration

429 - Rate Limited

# Check rate limit headers curl -v https://api.posthog.com 2>&1 | grep -i rate # Enable request queuing kubectl set env deployment/posthog-integration RATE_LIMIT_MODE=queue # Long-term: Contact PostHog for limit increase

500/503 - PostHog Errors

# Enable graceful degradation kubectl set env deployment/posthog-integration POSTHOG_FALLBACK=true # Notify users of degraded service # Update status page # Monitor PostHog status for resolution

Communication Templates

Internal (Slack)

🔴 P1 INCIDENT: PostHog Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]

External (Status Page)

PostHog Integration Issue

We're experiencing issues with our PostHog integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]

Post-Incident

Evidence Collection

# Generate debug bundle ./scripts/posthog-debug-bundle.sh # Export relevant logs kubectl logs -l app=posthog-integration --since=1h > incident-logs.txt # Capture metrics curl "localhost:9090/api/v1/query_range?query=posthog_errors_total&start=2h" > metrics.json

Postmortem Template

## Incident: PostHog [Error Type] **Date:** YYYY-MM-DD **Duration:** X hours Y minutes **Severity:** P[1-4] ### Summary [1-2 sentence description] ### Timeline - HH:MM - [Event] - HH:MM - [Event] ### Root Cause [Technical explanation] ### Impact - Users affected: N - Revenue impact: $X ### Action Items - [ ] [Preventive measure] - Owner - Due date

Instructions

Step 1: Quick Triage

Run the triage commands to identify the issue source.

Step 2: Follow Decision Tree

Determine if the issue is PostHog-side or internal.

Step 3: Execute Immediate Actions

Apply the appropriate remediation for the error type.

Step 4: Communicate Status

Update internal and external stakeholders.

Output

  • Issue identified and categorized
  • Remediation applied
  • Stakeholders notified
  • Evidence collected for postmortem

Error Handling

IssueCauseSolution
Can't reach status pageNetwork issueUse mobile or VPN
kubectl failsAuth expiredRe-authenticate
Metrics unavailablePrometheus downCheck backup metrics
Secret rotation failsPermission deniedEscalate to admin

Examples

One-Line Health Check

curl -sf https://api.yourapp.com/health | jq '.services.posthog.status' || echo "UNHEALTHY"

Resources

Next Steps

For data handling, see posthog-data-handling.