Kubernetes Production Best Practices - A Comprehensive Guide

December 10, 2024 · 6 min read

Senior Platform Engineer @ CData

Kubernetes has become the de facto standard for container orchestration in modern cloud-native applications. However, running Kubernetes in production requires careful planning, implementation of best practices, and continuous monitoring. In this comprehensive guide, we'll explore the essential practices that will help you build robust, scalable, and secure Kubernetes clusters.

Disclaimer: Kubernetes®, K8s®, Docker®, and other product names mentioned in this article are trademarks of their respective owners. All logos and trademarks are used for representation purposes only. No prior copyright or trademark authorization has been obtained. This content is for educational purposes only.

1. Resource Management and Limits

One of the most critical aspects of running Kubernetes in production is proper resource management. Without it, you risk cluster instability, performance degradation, and unexpected costs.

Setting Resource Requests and Limits

Always define resource requests and limits for your containers:

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

Why this matters:

Requests ensure the scheduler finds appropriate nodes
Limits prevent resource overconsumption and noisy neighbor problems
Proper resource allocation improves cluster bin-packing efficiency

Quality of Service (QoS) Classes

Kubernetes assigns QoS classes based on resource specifications:

Guaranteed - Requests = Limits for all containers
Burstable - At least one container has requests < limits
BestEffort - No requests or limits defined

In production, aim for Guaranteed or Burstable QoS for critical workloads.

2. High Availability and Scalability

Horizontal Pod Autoscaling (HPA)

Implement HPA to automatically scale based on metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app-deployment
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Pod Disruption Budgets (PDB)

Protect your applications during voluntary disruptions:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: critical-app

Multi-Zone Deployments

Distribute pods across availability zones using topology spread constraints:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-zone-app
spec:
  replicas: 6
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: multi-zone-app

3. Security Best Practices

Network Policies

Implement zero-trust networking with Network Policies:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-policy
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432

Pod Security Standards

Enforce security contexts and pod security standards:

apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: nginx:latest
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
        - ALL

RBAC Configuration

Implement least-privilege access control:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: production
subjects:
- kind: ServiceAccount
  name: app-service-account
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

4. Monitoring and Observability

Health Checks

Implement comprehensive health checks:

apiVersion: v1
kind: Pod
metadata:
  name: health-check-demo
spec:
  containers:
  - name: app
    image: myapp:latest
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      failureThreshold: 30
      periodSeconds: 10

Structured Logging

Use structured logging for better observability:

// Example in Go
log.WithFields(log.Fields{
    "user_id": userID,
    "action": "login",
    "status": "success",
    "duration_ms": duration,
}).Info("User logged in")

5. Deployment Strategies

Rolling Updates with Safety Checks

Configure safe rolling updates:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: safe-deployment
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  minReadySeconds: 30
  progressDeadlineSeconds: 600
  template:
    spec:
      containers:
      - name: app
        image: myapp:v2

Blue-Green Deployments

Use service selectors for blue-green deployments:

# Blue deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: blue
---
# Service pointing to blue
apiVersion: v1
kind: Service
metadata:
  name: app-service
spec:
  selector:
    app: myapp
    version: blue  # Switch to 'green' for cutover
  ports:
  - port: 80

6. Storage and Persistence

Using StatefulSets for Stateful Applications

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: database
spec:
  serviceName: "db"
  replicas: 3
  selector:
    matchLabels:
      app: database
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:14
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "fast-ssd"
      resources:
        requests:
          storage: 100Gi

7. Configuration Management

Using ConfigMaps and Secrets Properly

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  database.host: "db.production.svc.cluster.local"
  cache.enabled: "true"
---
apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
stringData:
  database.password: "encrypted-password"
  api.key: "encrypted-api-key"

Mount as environment variables or volumes:

spec:
  containers:
  - name: app
    envFrom:
    - configMapRef:
        name: app-config
    - secretRef:
        name: app-secrets

8. Backup and Disaster Recovery

Regular Backup Strategy

etcd Backups - Automate daily etcd snapshots
Persistent Volume Backups - Use tools like Velero
Configuration Backups - Version control all YAML files

Example Velero Backup Schedule

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
spec:
  schedule: "0 2 * * *"
  template:
    includedNamespaces:
    - production
    - staging
    snapshotVolumes: true
    ttl: 720h0m0s

9. Cost Optimization

Resource Optimization Tips

Use Cluster Autoscaler for node-level scaling
Implement Pod Priority Classes for critical workloads
Use Spot/Preemptible Instances for non-critical workloads
Monitor and Right-Size resources based on actual usage

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000
globalDefault: false
description: "High priority for critical services"

10. GitOps and CI/CD

Implementing GitOps with ArgoCD

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/yourorg/k8s-manifests
    targetRevision: main
    path: production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

Conclusion

Running Kubernetes in production is a journey that requires continuous learning and improvement. By implementing these best practices, you'll build a solid foundation for reliable, secure, and scalable applications.

Key Takeaways

✅ Always set resource requests and limits
✅ Implement comprehensive monitoring and alerting
✅ Use RBAC and network policies for security
✅ Automate scaling with HPA and cluster autoscaler
✅ Regular backups and disaster recovery testing
✅ Adopt GitOps for declarative infrastructure
✅ Continuous optimization based on metrics

Remember: Production readiness is not a destination but a continuous process of refinement and adaptation to your organization's needs.

What's your experience with Kubernetes in production? Share your challenges and solutions in the comments below!

1. Resource Management and Limits​

Setting Resource Requests and Limits​

Quality of Service (QoS) Classes​

2. High Availability and Scalability​

Horizontal Pod Autoscaling (HPA)​

Pod Disruption Budgets (PDB)​

Multi-Zone Deployments​

3. Security Best Practices​

Network Policies​

Pod Security Standards​

RBAC Configuration​

4. Monitoring and Observability​

Health Checks​

Structured Logging​

5. Deployment Strategies​

Rolling Updates with Safety Checks​

Blue-Green Deployments​

6. Storage and Persistence​

Using StatefulSets for Stateful Applications​

7. Configuration Management​

Using ConfigMaps and Secrets Properly​

8. Backup and Disaster Recovery​

Regular Backup Strategy​

Example Velero Backup Schedule​

9. Cost Optimization​

Resource Optimization Tips​

10. GitOps and CI/CD​

Implementing GitOps with ArgoCD​

Conclusion​

Key Takeaways​