Kubernetes Specialist

Purpose

Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.

When to Use

Designing Kubernetes cluster architecture for production workloads
Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
Troubleshooting cluster issues (networking, storage, performance)
Planning Kubernetes upgrades or multi-cluster strategies
Optimizing resource utilization and cost in Kubernetes environments
Setting up service mesh (Istio, Linkerd) and observability
Implementing Kubernetes security and RBAC policies

Quick Start

Invoke this skill when:

Designing Kubernetes cluster architecture for production workloads
Implementing Helm charts, operators, or GitOps workflows
Troubleshooting cluster issues (networking, storage, performance)
Planning Kubernetes upgrades or multi-cluster strategies
Optimizing resource utilization and cost in Kubernetes environments

Do NOT invoke when:

Simple Docker container needs (use docker commands directly)
Cloud infrastructure provisioning (use cloud-architect instead)
Application code debugging (use backend-developer/frontend-developer)
Database-specific issues (use database-administrator instead)

Decision Framework

Deployment Strategy Selection

├─ Zero downtime required?
│   ├─ Instant rollback needed → Blue-Green Deployment
│   │   Pros: Instant switch, easy rollback
│   │   Cons: 2x resources during deployment
│   │
│   ├─ Gradual rollout → Canary Deployment
│   │   Pros: Test with subset of traffic
│   │   Cons: Complex routing setup
│   │
│   └─ Simple updates → Rolling Update (default)
│       Pros: Built-in, no extra resources
│       Cons: Rollback takes time
│
├─ Stateful application?
│   ├─ Database → StatefulSet + PVC
│   │   Pros: Stable network IDs, ordered deployment
│   │   Cons: Complex scaling
│   │
│   └─ Stateless → Deployment
│       Pros: Easy scaling, self-healing
│
└─ Batch processing?
    ├─ One-time → Job
    ├─ Scheduled → CronJob
    └─ Parallel processing → Job with parallelism

Resource Configuration Matrix

Workload Type	CPU Request	CPU Limit	Memory Request	Memory Limit
Web API	100m-500m	1000m	256Mi-512Mi	1Gi
Worker	500m-1000m	2000m	512Mi-1Gi	2Gi
Database	1000m-2000m	4000m	2Gi-4Gi	8Gi
Cache	100m-250m	500m	1Gi-4Gi	8Gi
Batch Job	500m-2000m	4000m	1Gi-4Gi	8Gi

Node Pool Strategy

Use Case	Instance Type	Scaling	Cost
System pods	t3.large (3 nodes)	Fixed	Low
Applications	m5.xlarge	Auto 3-20	Medium
Batch/Spot	m5.large-2xlarge	Auto 0-50	Very Low
GPU workloads	p3.2xlarge	Manual	High

Red Flags → Escalate

STOP and escalate if:

Cluster upgrade with breaking API changes (deprecated versions)
Multi-region active-active requirements
Compliance requirements (PCI-DSS, HIPAA) need validation
Custom scheduler or controller development needed
etcd corruption or cluster state issues

Quality Checklist

Cluster Configuration

[ ] Multi-AZ deployment (nodes spread across availability zones)
[ ] Node autoscaling configured (Cluster Autoscaler or Karpenter)
[ ] System node pool with taints (separate critical addons from apps)
[ ] Encryption enabled (secrets at rest with KMS)
[ ] Audit logging enabled (API server logs)

Security

[ ] Pod Security Standards enforced (restricted or baseline)
[ ] Network policies configured (default deny + explicit allow)
[ ] RBAC configured (least privilege for all service accounts)
[ ] Image scanning enabled (scan for vulnerabilities)
[ ] Private container registry configured

Resource Management

[ ] All pods have resource requests and limits
[ ] HorizontalPodAutoscalers configured for scalable workloads
[ ] PodDisruptionBudgets defined (prevent too many pods down)
[ ] ResourceQuotas set per namespace
[ ] LimitRanges defined (default limits for pods)

High Availability

[ ] Deployments have ≥2 replicas
[ ] Anti-affinity rules prevent pod co-location
[ ] Readiness and liveness probes configured
[ ] PodDisruptionBudgets allow for rolling updates
[ ] Multi-region cluster (if global scale required)

Observability

[ ] Metrics server installed (kubectl top works)
[ ] Prometheus monitoring application metrics
[ ] Centralized logging (CloudWatch, Elasticsearch, Loki)
[ ] Distributed tracing (Jaeger, Tempo)
[ ] Dashboards for cluster and application health

Disaster Recovery

[ ] Velero installed for cluster backups
[ ] Backup schedule configured (daily minimum)
[ ] Restore tested (annual drill)
[ ] etcd backups automated (cloud-managed clusters)

Additional Resources

Detailed Technical Reference: See REFERENCE.md
Code Examples & Patterns: See EXAMPLES.md

kubernetes-specialist