AS
AgSkills.dev
MARKETPLACE

slo-implementation

Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.

28.2k
3.1k

Preview

SKILL.md
name
slo-implementation
description
Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.

SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

When to Use

  • Define service reliability targets
  • Measure user-perceived reliability
  • Implement error budgets
  • Create SLO-based alerts
  • Track reliability goals

SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement)
  ↓ Contract with customers
SLO (Service Level Objective)
  ↓ Internal reliability target
SLI (Service Level Indicator)
  ↓ Actual measurement

Defining SLIs

Common SLI Types

1. Availability SLI

# Successful requests / Total requests sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d]))

2. Latency SLI

# Requests below latency threshold / Total requests sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

3. Durability SLI

# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)

Reference: See references/slo-definitions.md

Setting SLO Targets

Availability SLO Examples

SLO %Downtime/MonthDowntime/Year
99%7.2 hours3.65 days
99.9%43.2 minutes8.76 hours
99.95%21.6 minutes4.38 hours
99.99%4.32 minutes52.56 minutes

Choose Appropriate SLOs

Consider:

  • User expectations
  • Business requirements
  • Current performance
  • Cost of reliability
  • Competitor benchmarks

Example SLOs:

slos: - name: api_availability target: 99.9 window: 28d sli: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) - name: api_latency_p95 target: 99 window: 28d sli: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))

Error Budget Calculation

Error Budget Formula

Error Budget = 1 - SLO Target

Example:

  • SLO: 99.9% availability
  • Error Budget: 0.1% = 43.2 minutes/month
  • Current Error: 0.05% = 21.6 minutes/month
  • Remaining Budget: 50%

Error Budget Policy

error_budget_policy: - remaining_budget: 100% action: Normal development velocity - remaining_budget: 50% action: Consider postponing risky changes - remaining_budget: 10% action: Freeze non-critical changes - remaining_budget: 0% action: Feature freeze, focus on reliability

Reference: See references/error-budget.md

SLO Implementation

Prometheus Recording Rules

# SLI Recording Rules groups: - name: sli_rules interval: 30s rules: # Availability SLI - record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) # Latency SLI (requests < 500ms) - record: sli:http_latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) - name: slo_rules interval: 5m rules: # SLO compliance (1 = meeting SLO, 0 = violating) - record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999 - record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99 # Error budget remaining (percentage) - record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100 # Error budget burn rate - record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999)

SLO Alerting Rules

groups: - name: slo_alerts interval: 1m rules: # Fast burn: 14.4x rate, 1 hour window # Consumes 2% error budget in 1 hour - alert: SLOErrorBudgetBurnFast expr: | slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn detected" description: "Error budget burning at {{ $value }}x rate" # Slow burn: 6x rate, 6 hour window # Consumes 5% error budget in 6 hours - alert: SLOErrorBudgetBurnSlow expr: | slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 for: 15m labels: severity: warning annotations: summary: "Slow error budget burn detected" description: "Error budget burning at {{ $value }}x rate" # Error budget exhausted - alert: SLOErrorBudgetExhausted expr: slo:http_availability:error_budget_remaining < 0 for: 5m labels: severity: critical annotations: summary: "SLO error budget exhausted" description: "Error budget remaining: {{ $value }}%"

SLO Dashboard

Grafana Dashboard Structure:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SLO Compliance (Current)           β”‚
β”‚ βœ“ 99.95% (Target: 99.9%)          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Error Budget Remaining: 65%        β”‚
β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ 65%                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SLI Trend (28 days)                β”‚
β”‚ [Time series graph]                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Burn Rate Analysis                 β”‚
β”‚ [Burn rate by time window]         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Example Queries:

# Current SLO compliance sli:http_availability:ratio * 100 # Error budget remaining slo:http_availability:error_budget_remaining # Days until error budget exhausted (at current burn rate) (slo:http_availability:error_budget_remaining / 100) * 28 / (1 - sli:http_availability:ratio) * (1 - 0.999)

Multi-Window Burn Rate Alerts

# Combination of short and long windows reduces false positives rules: - alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical

SLO Review Process

Weekly Review

  • Current SLO compliance
  • Error budget status
  • Trend analysis
  • Incident impact

Monthly Review

  • SLO achievement
  • Error budget usage
  • Incident postmortems
  • SLO adjustments

Quarterly Review

  • SLO relevance
  • Target adjustments
  • Process improvements
  • Tooling enhancements

Best Practices

  1. Start with user-facing services
  2. Use multiple SLIs (availability, latency, etc.)
  3. Set achievable SLOs (don't aim for 100%)
  4. Implement multi-window alerts to reduce noise
  5. Track error budget consistently
  6. Review SLOs regularly
  7. Document SLO decisions
  8. Align with business goals
  9. Automate SLO reporting
  10. Use SLOs for prioritization

Reference Files

  • assets/slo-template.md - SLO definition template
  • references/slo-definitions.md - SLO definition patterns
  • references/error-budget.md - Error budget calculations

Related Skills

  • prometheus-configuration - For metric collection
  • grafana-dashboards - For SLO visualization
GitHub Repository
wshobson/agents
Stars
28,275
Forks
3,111
Open Repository
Install Skill
Download ZIP1 files