Kubernetes Production Checklist
Before we jump into the checklist, let me be upfront: this is the exact list I use when reviewing or building production-grade Kubernetes setups—whether it's for an internal platform or a client-facing app. These practices aren't hypothetical or aspirational; they're grounded in real incidents, hard lessons, and the pressure to keep systems up 24/7. If you're running anything beyond toy clusters, you'll likely find something here worth applying.
Kubernetes has become the default platform for orchestrating containerized applications. But while getting your first cluster running is straightforward, taking Kubernetes to production is a different game altogether. There are countless configurations, edge cases, and behaviors that can make or break the availability, scalability, and security of your system.
This guide provides a detailed and opinionated checklist of production best practices for Kubernetes, categorized into key areas such as health checks, scaling, security, governance, logging, and more. Whether you're building new workloads or reviewing existing ones, this document helps you benchmark your setup.
Health Checks¶
Kubernetes supports three kinds of probes: readiness, liveness, and startup. Each serves a distinct purpose in the lifecycle of a pod. Readiness indicates when an app is ready to serve traffic, liveness checks whether it's still responsive, and startup allows for complex apps to boot before other checks begin.
- Why it matters
Without proper health checks, Kubernetes may send traffic to containers that aren’t ready or fail to restart containers that are stuck. This leads to unreliable user experiences and degraded service availability.
Best Practices¶
Define Readiness Probes¶
Readiness probes prevent traffic from being routed to a container until it’s fully initialized.
- Use HTTP, TCP, or command-based probes.
- Tune
initialDelaySeconds
,periodSeconds
, andfailureThreshold
for your app’s boot profile.
- Note
A missing readiness probe can cause early traffic to fail, particularly for apps with slow startup logic (e.g., JIT compilation, DB initialization).
Use Liveness Probes for Recovery¶
Liveness probes detect when your application is stuck and restart it.
- Don’t use liveness for fatal errors—crash the app instead.
- Use lightweight endpoints that always respond 200 OK when healthy.
- Note
Liveness probes are not an error handling mechanism. They are for recovery only.
Avoid Sharing Readiness and Liveness Endpoints¶
Pointing both probes to the same endpoint can cause containers to be restarted before they’ve ever reported as ready.
- Split endpoints or vary the logic if reusing the same one.
Consider Startup Probes for Slow Boot Apps¶
Startup probes temporarily disable liveness and readiness checks while your app initializes.
- Useful for apps like JVM-based services or apps requiring DB migrations.
- Note
Without startup probes, your app could be killed before it's even ready to start.
Application Resilience¶
Application resilience ensures that your apps recover gracefully from failures and are not affected by external service dependencies. Kubernetes assumes containers can start independently. Apps must not fail if their dependencies (e.g., databases, APIs) are momentarily unavailable. Handling shutdowns properly ensures smoother deployments and rollouts.
- Why it matters
Apps that fail on startup due to temporary issues or do not shut down properly can lead to service outages and cascading failures.
Best Practices¶
- Gracefully handle SIGTERM using a preStop hook and drain logic.
- Make readiness checks independent of external services.
- Implement retry logic for startup dependencies.
Scaling¶
Scaling enables Kubernetes to adjust resources based on real-time demand to maintain availability and efficiency. HPA scales workloads based on resource usage like CPU/memory. Cluster Autoscaler adjusts node counts. Local storage impedes scalability; external persistence is preferred.
- Why it matters
Without effective scaling strategies, your system may underperform during spikes or over-provision during low usage, leading to poor reliability and high costs.
Best Practices¶
- Use Horizontal Pod Autoscaler (HPA) to scale based on metrics.
- Use Cluster Autoscaler to scale infrastructure nodes.
- Avoid Vertical Pod Autoscaler (VPA) in production unless necessary.
- Avoid local storage for stateful apps.
Resource Management¶
Resource limits and requests help Kubernetes schedule workloads accurately and protect node stability. Setting resource requests informs the scheduler about minimum requirements. Limits prevent a pod from consuming too much. Use Vertical Pod Autoscaler (VPA) in recommendation mode to refine these settings.
- Why it matters
Improper resource limits can lead to pod evictions, node crashes, or poor performance. They are also essential for cluster scheduling.
Best Practices¶
- Always define CPU and memory requests and limits.
- Use LimitRange to enforce defaults in namespaces.
- Be cautious with CPU limits; prefer setting only requests.
Security¶
Security practices reduce the attack surface and contain any potential breaches within the cluster. SecurityContext in Kubernetes lets you define user privileges, file system access, and allowed kernel capabilities. By default, containers run with high privileges unless restricted explicitly.
- Why it matters
Kubernetes runs production infrastructure. Misconfigured security policies can result in privilege escalations or data exfiltration.
Best Practices¶
- Run containers as non-root using
runAsUser
andrunAsNonRoot
. - Use read-only root filesystems to prevent tampering.
- Disable privilege escalation.
- Drop unnecessary Linux capabilities.
Secrets and Configuration¶
Configuration and secrets management separates application code from environment-specific details. Separate configuration from code for better portability. Avoid using env vars for secrets because they can be accessed via process inspection. Secrets should be encrypted at rest and access-controlled.
- Why it matters
Improper handling of configuration and secrets can lead to credentials exposure, leaked data, or insecure systems.
Best Practices¶
- Use ConfigMaps for non-sensitive configurations.
- Mount secrets as volumes instead of env vars.
- Use encrypted secret management tools like Sealed Secrets, SOPS, or Vault.
Networking¶
Networking governs communication between services and enforces access controls across the cluster. Ingress controllers manage external traffic using Layer 7 rules. NetworkPolicies are firewalls within the cluster that restrict ingress and egress between pods.
- Why it matters
Unrestricted network traffic between pods can lead to lateral movement in case of a breach or cause unintentional outages.
Best Practices¶
- Use an ingress controller for HTTP routing.
- Define NetworkPolicies to enforce communication rules.
Tagging and Metadata¶
Metadata tagging helps with ownership tracking, auditing, cost attribution, and compliance. Proper labels enable observability, cost tracking, and automated policy enforcement. They also help tools like Prometheus or cost dashboards to group resources meaningfully.
- Why it matters
Without structured tagging, it's difficult to answer basic questions about resource ownership, cost, or compliance status.
Best Practices¶
- Use Kubernetes standard labels (
app.kubernetes.io/*
). - Add business metadata like
owner
,project
, andcost-center
. - Tag for security (
confidentiality
,compliance
).
Observability and Logging¶
Logging and monitoring help you detect failures early and understand the system state. Passive logging keeps the app simple and defers aggregation to external tools. Node-level daemons collect logs from containers and send them to systems like Elasticsearch or Loki.
- Why it matters
Without observability, diagnosing incidents becomes guesswork. Log access and retention are essential for debugging and audits.
Best Practices¶
- Log to stdout/stderr following 12-factor principles.
- Use a node-level logging daemon (e.g., FluentBit).
- Avoid logging via sidecars unless necessary.
- Plan for 30+ days of log retention.
Governance and RBAC¶
Governance ensures workloads adhere to organizational rules and limits access to what’s necessary. RBAC ensures that users and services can perform only authorized actions. Quotas and limits protect cluster stability.
- Why it matters
Over-permissioned service accounts and lack of quotas can lead to resource exhaustion or security breaches.
Best Practices¶
- Set ResourceQuotas and LimitRanges in all namespaces.
- Disable auto-mounted default service accounts.
- Use granular RBAC roles per application/service.
Policy Enforcement¶
Policies enforce organizational standards across multiple teams and namespaces. Policy agents validate Kubernetes resources at admission time and prevent violations. This improves trust, especially in shared clusters.
- Why it matters
Without policy controls, different teams may unknowingly violate security, compliance, or architectural standards.
Best Practices¶
- Use OPA Gatekeeper or Kyverno for image and label policies.
- Restrict ingress hostname duplication.
- Allow only known domains in ingress rules.
Cluster Hardening¶
Hardening the cluster base ensures your platform is built on secure, stable foundations. Cluster hardening includes setting admission controls, auditing, and avoiding insecure defaults. Tools like kube-bench
simplify this process.
- Why it matters
The default configuration of many clusters exposes dangerous features. Attackers often target misconfigured control planes.
Best Practices¶
- Run
kube-bench
to check compliance with CIS benchmarks. - Disable access to cloud metadata APIs from pods.
- Disable unused alpha/beta features.
Authentication¶
Secure user and machine access to the cluster is foundational to everything else. OIDC supports identity federation and SSO. Human users should authenticate via IAM or IDPs, not service account tokens.
- Why it matters
Exposed credentials or weak access controls can compromise the entire cluster.
Best Practices¶
- Use OpenID Connect (OIDC) for user auth.
- Reserve ServiceAccount tokens for workloads only.
Final Thoughts¶
Kubernetes gives you infinite flexibility, but also infinite ways to misconfigure your stack. This checklist is the result of lessons learned across many production environments.
Whether you're a startup deploying your first service or a platform team managing dozens of clusters, take time to audit your setup against these practices.
And remember: simplicity, observability, and security always win in production.