Modern Platform Engineering Tools - The Essential DevOps Stack for 2024
Platform engineering has emerged as a critical discipline that bridges development and operations, focusing on building internal developer platforms (IDPs) that enable teams to ship software faster and more reliably. In this comprehensive guide, we'll explore the cutting-edge tools and practices that define modern platform engineering in 2024.
Disclaimer: Kubernetes®, Docker®, Terraform®, AWS®, Azure®, Google Cloud®, ArgoCD™, and other product names mentioned in this article are trademarks of their respective owners. All logos and trademarks are used for representation purposes only. No prior copyright or trademark authorization has been obtained. This content is for educational purposes only.
What is Platform Engineering?
Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. It's about creating "golden paths" that reduce cognitive load while maintaining flexibility.
Core Principles
🎯 Developer Experience First - Reduce friction and complexity
🔄 Self-Service - Enable teams to be autonomous
🔧 Automation - Eliminate manual toil
📊 Observability - Make systems transparent
🔒 Security by Default - Shift left on security
1. Infrastructure as Code (IaC)
Terraform: The Industry Standard
Terraform has become the de facto standard for infrastructure provisioning across multiple cloud providers.
# Modern Terraform with best practices
terraform {
required_version = ">= 1.6"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "terraform-state-prod"
key = "infrastructure/eks-cluster"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
# VPC Module
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "production-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false
enable_dns_hostnames = true
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
# EKS Cluster
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "production-eks"
cluster_version = "1.28"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
eks_managed_node_groups = {
general = {
desired_size = 3
min_size = 2
max_size = 10
instance_types = ["t3.large"]
capacity_type = "ON_DEMAND"
labels = {
workload = "general"
}
}
}
tags = {
Environment = "production"
}
}
Pulumi: Infrastructure as Real Code
For teams preferring general-purpose languages:
import * as aws from "@pulumi/aws";
import * as eks from "@pulumi/eks";
// Create VPC
const vpc = new aws.ec2.Vpc("production-vpc", {
cidrBlock: "10.0.0.0/16",
enableDnsHostnames: true,
tags: { Name: "production-vpc" },
});
// Create EKS Cluster
const cluster = new eks.Cluster("production-eks", {
vpcId: vpc.id,
instanceType: "t3.medium",
desiredCapacity: 3,
minSize: 2,
maxSize: 10,
version: "1.28",
});
export const kubeconfig = cluster.kubeconfig;
OpenTofu: Open-Source Terraform Alternative
With Terraform's license change, OpenTofu has emerged as the open-source alternative:
# Install OpenTofu
brew install opentofu
# Compatible with Terraform syntax
tofu init
tofu plan
tofu apply
2. GitOps: Declarative Infrastructure
ArgoCD: Kubernetes GitOps Leader
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/k8s-manifests
targetRevision: main
path: apps/production
# Helm values
helm:
valueFiles:
- values-production.yaml
parameters:
- name: image.tag
value: "v1.2.3"
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PruneLast=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
Flux CD: Alternative GitOps Solution
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: production-repo
namespace: flux-system
spec:
interval: 1m
url: https://github.com/company/k8s-manifests
ref:
branch: main
secretRef:
name: git-credentials
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: production-apps
namespace: flux-system
spec:
interval: 10m
sourceRef:
kind: GitRepository
name: production-repo
path: ./apps/production
prune: true
wait: true
timeout: 5m
3. CI/CD: Modern Pipeline Tools
GitHub Actions: Integrated CI/CD
name: Production Deployment
on:
push:
branches: [main]
tags: ['v*']
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build-and-test:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=ref,event=branch
type=semver,pattern={{version}}
type=sha,prefix={{branch}}-
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Run tests
run: |
docker run --rm ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
npm test
deploy:
needs: build-and-test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- name: Update K8s manifests
run: |
git clone https://${GITHUB_TOKEN}@github.com/company/k8s-manifests
cd k8s-manifests
yq e -i '.spec.template.spec.containers[0].image = "${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"' \
apps/production/deployment.yaml
git add .
git commit -m "Update image to ${{ github.sha }}"
git push
Tekton: Kubernetes-Native Pipelines
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: build-and-deploy
spec:
params:
- name: git-url
- name: image-name
workspaces:
- name: shared-workspace
tasks:
- name: fetch-repository
taskRef:
name: git-clone
workspaces:
- name: output
workspace: shared-workspace
params:
- name: url
value: $(params.git-url)
- name: build-image
taskRef:
name: buildah
runAfter:
- fetch-repository
workspaces:
- name: source
workspace: shared-workspace
params:
- name: IMAGE
value: $(params.image-name)
- name: deploy-to-k8s
taskRef:
name: kubernetes-actions
runAfter:
- build-image
params:
- name: script
value: |
kubectl set image deployment/app app=$(params.image-name)
kubectl rollout status deployment/app
4. Service Mesh: Traffic Management
Istio: Feature-Rich Service Mesh
# Virtual Service for Canary Deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: reviews-route
spec:
hosts:
- reviews
http:
- match:
- headers:
user-agent:
regex: ".*Chrome.*"
route:
- destination:
host: reviews
subset: v2
weight: 100
- route:
- destination:
host: reviews
subset: v1
weight: 90
- destination:
host: reviews
subset: v2
weight: 10
---
# Destination Rule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
loadBalancer:
simple: LEAST_REQUEST
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
Linkerd: Lightweight Alternative
# Linkerd Traffic Split for Canary
apiVersion: split.smi-spec.io/v1alpha1
kind: TrafficSplit
metadata:
name: api-canary
spec:
service: api
backends:
- service: api-stable
weight: 90
- service: api-canary
weight: 10
5. Observability Stack
The Three Pillars
Metrics → Logs → Traces
Prometheus + Grafana: Metrics
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
---
# PrometheusRule for Alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
spec:
groups:
- name: app
interval: 30s
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
Loki: Log Aggregation
# Promtail configuration
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
pipeline_stages:
- docker: {}
- json:
expressions:
level: level
message: message
- labels:
level:
Jaeger/Tempo: Distributed Tracing
# OpenTelemetry Collector
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
6. Secret Management
HashiCorp Vault
# Vault Kubernetes Auth
path "auth/kubernetes/login" {
capabilities = ["create", "read"]
}
# Policy for application secrets
path "secret/data/production/app/*" {
capabilities = ["read", "list"]
}
# Dynamic database credentials
path "database/creds/app-readonly" {
capabilities = ["read"]
}
External Secrets Operator
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
name: vault-backend
spec:
provider:
vault:
server: "https://vault.company.com"
path: "secret"
version: "v2"
auth:
kubernetes:
mountPath: "kubernetes"
role: "app-role"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: SecretStore
target:
name: app-secrets
creationPolicy: Owner
data:
- secretKey: database-password
remoteRef:
key: production/app/database
property: password
7. Policy as Code
Open Policy Agent (OPA)
package kubernetes.admission
# Deny pods without resource limits
deny[msg] {
input.request.kind.kind == "Pod"
container := input.request.object.spec.containers[_]
not container.resources.limits
msg := sprintf("Container %v must have resource limits", [container.name])
}
# Require non-root user
deny[msg] {
input.request.kind.kind == "Pod"
not input.request.object.spec.securityContext.runAsNonRoot
msg := "Pods must run as non-root user"
}
# Deny privileged containers
deny[msg] {
input.request.kind.kind == "Pod"
container := input.request.object.spec.containers[_]
container.securityContext.privileged
msg := sprintf("Container %v cannot run in privileged mode", [container.name])
}
Kyverno: Kubernetes Native Policies
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-labels
spec:
validationFailureAction: enforce
rules:
- name: check-labels
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Labels 'app' and 'env' are required"
pattern:
metadata:
labels:
app: "?*"
env: "?*"
---
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-network-policy
spec:
rules:
- name: default-deny
match:
any:
- resources:
kinds:
- Namespace
generate:
kind: NetworkPolicy
name: default-deny
namespace: "{{request.object.metadata.name}}"
data:
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
8. Cost Management
OpenCost: Kubernetes Cost Monitoring
apiVersion: v1
kind: ConfigMap
metadata:
name: opencost-config
data:
default.json: |
{
"provider": "aws",
"description": "AWS Cost Configuration",
"CPU": "0.031611",
"spotCPU": "0.006655",
"RAM": "0.004237",
"spotRAM": "0.000892",
"GPU": "0.95",
"storage": "0.00005479452",
"zoneNetworkEgress": "0.01",
"regionNetworkEgress": "0.01",
"internetNetworkEgress": "0.12"
}
Kubecost Integration
# Install Kubecost
helm install kubecost kubecost/cost-analyzer \
--namespace kubecost --create-namespace \
--set kubecostToken="YOUR_TOKEN"
# View cost allocation
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090
9. Developer Portals
Backstage: Internal Developer Platform
# app-config.yaml
app:
title: FreeOps Developer Portal
baseUrl: https://portal.freeops.io
backend:
baseUrl: https://portal.freeops.io
cors:
origin: https://portal.freeops.io
database:
client: pg
connection:
host: ${POSTGRES_HOST}
port: ${POSTGRES_PORT}
user: ${POSTGRES_USER}
password: ${POSTGRES_PASSWORD}
catalog:
rules:
- allow: [Component, System, API, Resource, Location]
locations:
- type: url
target: https://github.com/company/software-catalog/blob/main/catalog-info.yaml
Port: Developer Portal Alternative
# port-entity.yaml
identifier: payment-service
title: Payment Service
blueprint: microservice
properties:
language: golang
team: payments
tier: critical
lifecycle: production
runtime: kubernetes
relations:
dependsOn:
- postgres-db
- redis-cache
10. Platform Automation
Crossplane: Universal Control Plane
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
name: production-db
spec:
forProvider:
region: us-east-1
dbInstanceClass: db.t3.medium
masterUsername: admin
allocatedStorage: 100
engine: postgres
engineVersion: "14.7"
skipFinalSnapshot: false
writeConnectionSecretToRef:
name: db-credentials
namespace: production
providerConfigRef:
name: aws-provider
Emerging Tools to Watch
🔥 Kratix - Framework for building platforms
🔥 Score - Workload specification standard
🔥 Dapr - Distributed application runtime
🔥 Telepresence - Local development for K8s
🔥 Garden - Development orchestration
🔥 Tilt - Multi-service development
Building Your Platform Stack
Recommended Starting Stack
┌─────────────────────────────────────┐
│ Developer Portal (Backstage) │
└─────────────────────────────────────┘
│
┌─────────────────────────────────────┐
│ GitOps (ArgoCD/Flux) │
└─────────────────────────────────────┘
│
┌─────────────────────────────────────┐
│ Kubernetes (EKS/GKE/AKS) │
│ + Service Mesh (Istio/Linkerd) │
└─────────────────────────────────────┘
│
┌─────────────────────────────────────┐
│ Observability (Prometheus/Loki) │
│ + Tracing (Jaeger/Tempo) │
└─────────────────────────────────────┘
Conclusion
Platform engineering is about creating paved roads that make the right way the easy way. The tools landscape is rich and constantly evolving, but the principles remain constant:
✅ Automate everything - Reduce manual toil
✅ Self-service - Enable developer autonomy
✅ Observability - Make systems transparent
✅ Security - Shift left on security practices
✅ Standards - Create golden paths
✅ Documentation - Make knowledge accessible
Next Steps
- Assess current state - Understand pain points
- Define platform vision - Set clear goals
- Start small - Pick one area to improve
- Measure success - Track DORA metrics
- Iterate - Continuously improve based on feedback
The future of platform engineering is bright, with tools becoming more powerful and accessible. Start building your platform today and empower your teams to ship software faster and more reliably.
Which platform engineering tools are you excited about? Share your experiences and challenges in the comments!
