Prometheus Configuration
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
Purpose
Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
When to Use
- Set up Prometheus monitoring
- Configure metric scraping
- Create recording rules
- Design alert rules
- Implement service discovery
Prometheus Architecture
ββββββββββββββββ
β Applications β β Instrumented with client libraries
ββββββββ¬ββββββββ
β /metrics endpoint
β
ββββββββββββββββ
β Prometheus β β Scrapes metrics periodically
β Server β
ββββββββ¬ββββββββ
β
βββ AlertManager (alerts)
βββ Grafana (visualization)
βββ Long-term storage (Thanos/Cortex)
Installation
Kubernetes with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageVolumeSize=50Gi
Docker Compose
version: "3.8" services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus command: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention.time=30d" volumes: prometheus-data:
Configuration File
prometheus.yml:
global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: "production" region: "us-west-2" # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Load rules files rule_files: - /etc/prometheus/rules/*.yml # Scrape configurations scrape_configs: # Prometheus itself - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Node exporters - job_name: "node-exporter" static_configs: - targets: - "node1:9100" - "node2:9100" - "node3:9100" relabel_configs: - source_labels: [__address__] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}" # Kubernetes pods with annotations - job_name: "kubernetes-pods" kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod # Application metrics - job_name: "my-app" static_configs: - targets: - "app1.example.com:9090" - "app2.example.com:9090" metrics_path: "/metrics" scheme: "https" tls_config: ca_file: /etc/prometheus/ca.crt cert_file: /etc/prometheus/client.crt key_file: /etc/prometheus/client.key
Reference: See assets/prometheus.yml.template
Scrape Configurations
Static Targets
scrape_configs: - job_name: "static-targets" static_configs: - targets: ["host1:9100", "host2:9100"] labels: env: "production" region: "us-west-2"
File-based Service Discovery
scrape_configs: - job_name: "file-sd" file_sd_configs: - files: - /etc/prometheus/targets/*.json - /etc/prometheus/targets/*.yml refresh_interval: 5m
targets/production.json:
[ { "targets": ["app1:9090", "app2:9090"], "labels": { "env": "production", "service": "api" } } ]
Kubernetes Service Discovery
scrape_configs: - job_name: "kubernetes-services" kubernetes_sd_configs: - role: service relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)
Reference: See references/scrape-configs.md
Recording Rules
Create pre-computed metrics for frequently queried expressions:
# /etc/prometheus/rules/recording_rules.yml groups: - name: api_metrics interval: 15s rules: # HTTP request rate per service - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) # Error rate percentage - record: job:http_requests_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) - record: job:http_requests_error_rate:percentage expr: | (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100 # P95 latency - record: job:http_request_duration:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) ) - name: resource_metrics interval: 30s rules: # CPU utilization percentage - record: instance:node_cpu:utilization expr: | 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory utilization percentage - record: instance:node_memory:utilization expr: | 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) # Disk usage percentage - record: instance:node_disk:utilization expr: | 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Reference: See references/recording-rules.md
Alert Rules
# /etc/prometheus/rules/alert_rules.yml groups: - name: availability interval: 30s rules: - alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} has been down for more than 1 minute" - alert: HighErrorRate expr: job:http_requests_error_rate:percentage > 5 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.job }}" description: "Error rate is {{ $value }}% (threshold: 5%)" - alert: HighLatency expr: job:http_request_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "High latency for {{ $labels.job }}" description: "P95 latency is {{ $value }}s (threshold: 1s)" - name: resources interval: 1m rules: - alert: HighCPUUsage expr: instance:node_cpu:utilization > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%" - alert: HighMemoryUsage expr: instance:node_memory:utilization > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%" - alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}%"
Validation
# Validate configuration promtool check config prometheus.yml # Validate rules promtool check rules /etc/prometheus/rules/*.yml # Test query promtool query instant http://localhost:9090 'up'
Reference: See scripts/validate-prometheus.sh
Best Practices
- Use consistent naming for metrics (prefix_name_unit)
- Set appropriate scrape intervals (15-60s typical)
- Use recording rules for expensive queries
- Implement high availability (multiple Prometheus instances)
- Configure retention based on storage capacity
- Use relabeling for metric cleanup
- Monitor Prometheus itself
- Implement federation for large deployments
- Use Thanos/Cortex for long-term storage
- Document custom metrics
Troubleshooting
Check scrape targets:
curl http://localhost:9090/api/v1/targets
Check configuration:
curl http://localhost:9090/api/v1/status/config
Test query:
curl 'http://localhost:9090/api/v1/query?query=up'
Reference Files
assets/prometheus.yml.template- Complete configuration templatereferences/scrape-configs.md- Scrape configuration patternsreferences/recording-rules.md- Recording rule examplesscripts/validate-prometheus.sh- Validation script
Related Skills
grafana-dashboards- For visualizationslo-implementation- For SLO monitoringdistributed-tracing- For request tracing