Kubernetes Production Checklist

A 2025-ready Kubernetes production checklist you can’t afford to miss, real-world tools, CNCF-backed standards, and proven strategies now featured by CNCF.

May 10, 2025

Abhimanyu Saharan

Before we jump into the checklist, let me be upfront: this is the exact list I use when reviewing or building production-grade Kubernetes setups, whether it's for an internal platform or a client-facing app. These practices aren't hypothetical or aspirational; they're grounded in real incidents, hard lessons, and the pressure to keep systems up 24/7. If you're running anything beyond toy clusters, you'll likely find something here worth applying.

Kubernetes has become the default platform for orchestrating containerized applications. But while getting your first cluster running is straightforward, taking Kubernetes to production is a different game altogether. There are countless configurations, edge cases, and behaviors that can make or break the availability, scalability, and security of your system.

This guide provides a detailed and opinionated checklist of production best practices for Kubernetes, categorized into key areas such as health checks, scaling, security, governance, logging, and more. Whether you're building new workloads or reviewing existing ones, this document helps you benchmark your setup.

Health Checks¶

Kubernetes supports three kinds of probes: readiness, liveness, and startup. Each serves a distinct purpose in the lifecycle of a pod. Readiness indicates when an app is ready to serve traffic, liveness checks whether it's still responsive, and startup allows for complex apps to boot before other checks begin.

Why it matters: Without proper health checks, Kubernetes may send traffic to containers that aren’t ready or fail to restart containers that are stuck. This leads to unreliable user experiences and degraded service availability.

Best Practices¶

Define Readiness Probes¶

Readiness probes prevent traffic from being routed to a container until it’s fully initialized.

Use HTTP, TCP, or command-based probes.
Tune initialDelaySeconds, periodSeconds, and failureThreshold for your app’s boot profile.

Note: A missing readiness probe can cause early traffic to fail, particularly for apps with slow startup logic (e.g., JIT compilation, DB initialization).

Use Liveness Probes for Recovery¶

Liveness probes detect when your application is stuck and restart it.

Don’t use liveness for fatal errors, crash the app instead.
Use lightweight endpoints that always respond 200 OK when healthy.

Note: Liveness probes are not an error handling mechanism. They are for recovery only.

Pointing both probes to the same endpoint can cause containers to be restarted before they’ve ever reported as ready.

Split endpoints or vary the logic if reusing the same one.

Consider Startup Probes for Slow Boot Apps¶

Startup probes temporarily disable liveness and readiness checks while your app initializes.

Useful for apps like JVM-based services or apps requiring DB migrations.

Note: Without startup probes, your app could be killed before it's even ready to start.

Recommended Tools:¶

Built-in Kubernetes probes
kubelet logs (kubectl describe pod)

Application Resilience¶

Application resilience ensures that your apps recover gracefully from failures and are not affected by external service dependencies. Kubernetes assumes containers can start independently. Apps must not fail if their dependencies (e.g., databases, APIs) are momentarily unavailable. Handling shutdowns properly ensures smoother deployments and rollouts.

Why it matters: Apps that fail on startup due to temporary issues or do not shut down properly can lead to service outages and cascading failures.

Best Practices¶

Gracefully handle SIGTERM using a preStop hook and drain logic.
Make readiness checks independent of external services.
Implement retry logic for startup dependencies.

Recommended Tools:¶

Chaos Mesh – Kubernetes-native fault injection using CRDs and workflows
LitmusChaos – End-to-end chaos engineering platform with observability and probe support

Scaling¶

Scaling enables Kubernetes to adjust resources based on real-time demand to maintain availability and efficiency. HPA scales workloads based on resource usage like CPU/memory. Cluster Autoscaler adjusts node counts. Local storage impedes scalability; external persistence is preferred.

Why it matters: Without effective scaling strategies, your system may underperform during spikes or over-provision during low usage, leading to poor reliability and high costs.

Best Practices¶

Use Horizontal Pod Autoscaler (HPA) to scale based on metrics.
Use Cluster Autoscaler to scale infrastructure nodes.
Avoid Vertical Pod Autoscaler (VPA) in production unless necessary.
Avoid local storage for stateful apps.

Recommended Tools:¶

HPA (Horizontal Pod Autoscaler) + metrics-server
Cluster Autoscaler
Karpenter (for AWS)
VPA (in recommendation mode)

Resource Management¶

Resource limits and requests help Kubernetes schedule workloads accurately and protect node stability. Setting resource requests informs the scheduler about minimum requirements. Limits prevent a pod from consuming too much. Use Vertical Pod Autoscaler (VPA) in recommendation mode to refine these settings.

Why it matters: Improper resource limits can lead to pod evictions, node crashes, or poor performance. They are also essential for cluster scheduling.

Best Practices¶

Always define CPU and memory requests and limits.
Use LimitRange to enforce defaults in namespaces.
Be cautious with CPU limits; prefer setting only requests.

Recommended Tools:¶

Goldilocks – Suggests right-sized resource requests using VPA
LimitRange, ResourceQuota (native Kubernetes)

Security¶

Security practices reduce the attack surface and contain any potential breaches within the cluster. SecurityContext in Kubernetes lets you define user privileges, file system access, and allowed kernel capabilities. By default, containers run with high privileges unless restricted explicitly.

Why it matters: Kubernetes runs production infrastructure. Misconfigured security policies can result in privilege escalations or data exfiltration.

Best Practices¶

Run containers as non-root using runAsUser and runAsNonRoot.
Use read-only root filesystems to prevent tampering.
Disable privilege escalation.
Drop unnecessary Linux capabilities.

Recommended Tools:¶

Trivy – Scans for CVEs and misconfigurations in containers, IaC, and clusters
Kube-bench – CIS Kubernetes benchmark scanner
Kubescape – Posture management, risk analysis, runtime scanning
Kubesec – Static analysis for Kubernetes manifests

Secrets and Configuration¶

Configuration and secrets management separates application code from environment-specific details. Separate configuration from code for better portability. Avoid using env vars for secrets because they can be accessed via process inspection. Secrets should be encrypted at rest and access-controlled.

Why it matters: Improper handling of configuration and secrets can lead to credentials exposure, leaked data, or insecure systems.

Best Practices¶

Use ConfigMaps for non-sensitive configurations.
Mount secrets as volumes instead of env vars.
Use encrypted secret management tools like Sealed Secrets, SOPS, or Vault.

Recommended Tools:¶

Sealed Secrets – Encrypt secrets into Git-safe SealedSecret resources
HashiCorp Vault – Centralized secrets management and dynamic credentials
External Secrets Operator – Sync secrets from cloud secret managers into Kubernetes

Networking¶

Networking governs communication between services and enforces access controls across the cluster. Ingress controllers manage external traffic using Layer 7 rules. NetworkPolicies are firewalls within the cluster that restrict ingress and egress between pods.

Why it matters: Unrestricted network traffic between pods can lead to lateral movement in case of a breach or cause unintentional outages.

Best Practices¶

Use an ingress controller for HTTP routing.
Define NetworkPolicies to enforce communication rules.

Tagging and Metadata¶

Metadata tagging helps with ownership tracking, auditing, cost attribution, and compliance. Proper labels enable observability, cost tracking, and automated policy enforcement. They also help tools like Prometheus or cost dashboards to group resources meaningfully.

Why it matters: Without structured tagging, it's difficult to answer basic questions about resource ownership, cost, or compliance status.

Best Practices¶

Use Kubernetes standard labels (app.kubernetes.io/*).
Add business metadata like owner, project, and cost-center.
Tag for security (confidentiality, compliance).

Recommended Tools:¶

Kubecost – Tracks cost per label, namespace, team

Observability and Logging¶

Logging and monitoring help you detect failures early and understand the system state. Passive logging keeps the app simple and defers aggregation to external tools. Node-level daemons collect logs from containers and send them to systems like Elasticsearch or Loki.

Why it matters: Without observability, diagnosing incidents becomes guesswork. Log access and retention are essential for debugging and audits.

Best Practices¶

Log to stdout/stderr following 12-factor principles.
Use a node-level logging daemon (e.g., FluentBit).
Avoid logging via sidecars unless necessary.
Plan for 30+ days of log retention.

Recommended Tools:¶

Prometheus + Grafana
Loki or Elasticseach + FluentBit/Fluentd
OpenTelemetry - Tracing and metrics instrumentation

Governance and RBAC¶

Governance ensures workloads adhere to organizational rules and limits access to what’s necessary. RBAC ensures that users and services can perform only authorized actions. Quotas and limits protect cluster stability.

Why it matters: Over-permissioned service accounts and lack of quotas can lead to resource exhaustion or security breaches.

Best Practices¶

Set ResourceQuotas and LimitRanges in all namespaces.
Disable auto-mounted default service accounts.
Use granular RBAC roles per application/service.

Recommended Tools:¶

Rakkess – Shows access matrix per resource
kubectl-who-can – Checks who has permission to perform specific actions

Policy Enforcement¶

Policies enforce organizational standards across multiple teams and namespaces. Policy agents validate Kubernetes resources at admission time and prevent violations. This improves trust, especially in shared clusters.

Why it matters: Without policy controls, different teams may unknowingly violate security, compliance, or architectural standards.

Best Practices¶

Use OPA Gatekeeper or Kyverno for image and label policies.
Restrict ingress hostname duplication.
Allow only known domains in ingress rules.

Recommended Tools:¶

OPA Gatekeeper – Constraint-based policy engine
Kyverno – Kubernetes-native policy engine with mutation and validation support

Cluster Hardening¶

Hardening the cluster base ensures your platform is built on secure, stable foundations. Cluster hardening includes setting admission controls, auditing, and avoiding insecure defaults. Tools like kube-bench simplify this process.

Why it matters: The default configuration of many clusters exposes dangerous features. Attackers often target misconfigured control planes.

Best Practices¶

Run kube-bench to check compliance with CIS benchmarks.
Disable access to cloud metadata APIs from pods.
Disable unused alpha/beta features.

Recommended Tools:¶

Kube-bench – CIS compliance scanner
Kubescape – Security posture and runtime protection

Authentication¶

Secure user and machine access to the cluster is foundational to everything else. OIDC supports identity federation and SSO. Human users should authenticate via IAM or IDPs, not service account tokens.

Why it matters: Exposed credentials or weak access controls can compromise the entire cluster.

Best Practices¶

Use OpenID Connect (OIDC) for user auth.
Reserve ServiceAccount tokens for workloads only.

Final Thoughts¶

Kubernetes gives you infinite flexibility, but also infinite ways to misconfigure your stack. This checklist is the result of lessons learned across many production environments.

Whether you're a startup deploying your first service or a platform team managing dozens of clusters, take time to audit your setup against these practices.

And remember: simplicity, observability, and security always win in production.

FAQs

What is the purpose of readiness, liveness, and startup probes in Kubernetes?

Readiness probes determine when a pod is ready to receive traffic.
Liveness probes detect when a pod is stuck and needs restarting.
Startup probes prevent premature restarts during slow application boot times.
Using all three correctly ensures robust application lifecycle management.

How should you manage secrets securely in Kubernetes?

Avoid storing secrets as environment variables. Use volumes for mounting secrets and tools like Sealed Secrets, SOPS, or HashiCorp Vault for encrypted secret management. Ensure secrets are encrypted at rest and access-controlled.

Why is it recommended to avoid local storage in production Kubernetes workloads?

Local storage limits pod rescheduling and node flexibility. For production-grade scalability and reliability, use network-attached or cloud-managed storage for stateful components to avoid data loss or scheduling issues.

What are best practices for managing access control in Kubernetes?

Use RBAC for permission scoping and ABAC for context-based policies. Disable auto-mounted service accounts, set LimitRange and ResourceQuota per namespace, and define roles per service to prevent overprivileged access.

How can you enforce policies and governance standards across Kubernetes environments?

Use OPA Gatekeeper or Kyverno to enforce rules like label requirements, domain restrictions in ingress, and image source validation. Policy agents apply constraints at admission time to prevent misconfigurations and enforce organizational standards.

Kubernetes Production Checklist

A 2025-ready Kubernetes production checklist you can’t afford to miss, real-world tools, CNCF-backed standards, and proven strategies now featured by CNCF.

May 10, 2025

Abhimanyu Saharan

Health Checks¶

Why it matters: Without proper health checks, Kubernetes may send traffic to containers that aren’t ready or fail to restart containers that are stuck. This leads to unreliable user experiences and degraded service availability.

Best Practices¶

Define Readiness Probes¶

Readiness probes prevent traffic from being routed to a container until it’s fully initialized.

Use HTTP, TCP, or command-based probes.
Tune initialDelaySeconds, periodSeconds, and failureThreshold for your app’s boot profile.

Note: A missing readiness probe can cause early traffic to fail, particularly for apps with slow startup logic (e.g., JIT compilation, DB initialization).

Use Liveness Probes for Recovery¶

Liveness probes detect when your application is stuck and restart it.

Don’t use liveness for fatal errors, crash the app instead.
Use lightweight endpoints that always respond 200 OK when healthy.

Note: Liveness probes are not an error handling mechanism. They are for recovery only.

Pointing both probes to the same endpoint can cause containers to be restarted before they’ve ever reported as ready.

Split endpoints or vary the logic if reusing the same one.

Consider Startup Probes for Slow Boot Apps¶

Startup probes temporarily disable liveness and readiness checks while your app initializes.

Useful for apps like JVM-based services or apps requiring DB migrations.

Note: Without startup probes, your app could be killed before it's even ready to start.

Recommended Tools:¶

Built-in Kubernetes probes
kubelet logs (kubectl describe pod)

Application Resilience¶

Why it matters: Apps that fail on startup due to temporary issues or do not shut down properly can lead to service outages and cascading failures.

Best Practices¶

Gracefully handle SIGTERM using a preStop hook and drain logic.
Make readiness checks independent of external services.
Implement retry logic for startup dependencies.

Recommended Tools:¶

Chaos Mesh – Kubernetes-native fault injection using CRDs and workflows
LitmusChaos – End-to-end chaos engineering platform with observability and probe support

Scaling¶

Why it matters: Without effective scaling strategies, your system may underperform during spikes or over-provision during low usage, leading to poor reliability and high costs.

Best Practices¶

Use Horizontal Pod Autoscaler (HPA) to scale based on metrics.
Use Cluster Autoscaler to scale infrastructure nodes.
Avoid Vertical Pod Autoscaler (VPA) in production unless necessary.
Avoid local storage for stateful apps.

Recommended Tools:¶

HPA (Horizontal Pod Autoscaler) + metrics-server
Cluster Autoscaler
Karpenter (for AWS)
VPA (in recommendation mode)

Resource Management¶

Why it matters: Improper resource limits can lead to pod evictions, node crashes, or poor performance. They are also essential for cluster scheduling.

Best Practices¶

Always define CPU and memory requests and limits.
Use LimitRange to enforce defaults in namespaces.
Be cautious with CPU limits; prefer setting only requests.

Recommended Tools:¶

Goldilocks – Suggests right-sized resource requests using VPA
LimitRange, ResourceQuota (native Kubernetes)

Security¶

Why it matters: Kubernetes runs production infrastructure. Misconfigured security policies can result in privilege escalations or data exfiltration.

Best Practices¶

Run containers as non-root using runAsUser and runAsNonRoot.
Use read-only root filesystems to prevent tampering.
Disable privilege escalation.
Drop unnecessary Linux capabilities.

Recommended Tools:¶

Trivy – Scans for CVEs and misconfigurations in containers, IaC, and clusters
Kube-bench – CIS Kubernetes benchmark scanner
Kubescape – Posture management, risk analysis, runtime scanning
Kubesec – Static analysis for Kubernetes manifests

Secrets and Configuration¶

Why it matters: Improper handling of configuration and secrets can lead to credentials exposure, leaked data, or insecure systems.

Best Practices¶

Use ConfigMaps for non-sensitive configurations.
Mount secrets as volumes instead of env vars.
Use encrypted secret management tools like Sealed Secrets, SOPS, or Vault.

Recommended Tools:¶

Sealed Secrets – Encrypt secrets into Git-safe SealedSecret resources
HashiCorp Vault – Centralized secrets management and dynamic credentials
External Secrets Operator – Sync secrets from cloud secret managers into Kubernetes

Networking¶

Why it matters: Unrestricted network traffic between pods can lead to lateral movement in case of a breach or cause unintentional outages.

Best Practices¶

Use an ingress controller for HTTP routing.
Define NetworkPolicies to enforce communication rules.

Tagging and Metadata¶

Why it matters: Without structured tagging, it's difficult to answer basic questions about resource ownership, cost, or compliance status.

Best Practices¶

Use Kubernetes standard labels (app.kubernetes.io/*).
Add business metadata like owner, project, and cost-center.
Tag for security (confidentiality, compliance).

Recommended Tools:¶

Kubecost – Tracks cost per label, namespace, team

Observability and Logging¶

Why it matters: Without observability, diagnosing incidents becomes guesswork. Log access and retention are essential for debugging and audits.

Best Practices¶

Log to stdout/stderr following 12-factor principles.
Use a node-level logging daemon (e.g., FluentBit).
Avoid logging via sidecars unless necessary.
Plan for 30+ days of log retention.

Recommended Tools:¶

Prometheus + Grafana
Loki or Elasticseach + FluentBit/Fluentd
OpenTelemetry - Tracing and metrics instrumentation

Governance and RBAC¶

Why it matters: Over-permissioned service accounts and lack of quotas can lead to resource exhaustion or security breaches.

Best Practices¶

Set ResourceQuotas and LimitRanges in all namespaces.
Disable auto-mounted default service accounts.
Use granular RBAC roles per application/service.

Recommended Tools:¶

Rakkess – Shows access matrix per resource
kubectl-who-can – Checks who has permission to perform specific actions

Policy Enforcement¶

Why it matters: Without policy controls, different teams may unknowingly violate security, compliance, or architectural standards.

Best Practices¶

Use OPA Gatekeeper or Kyverno for image and label policies.
Restrict ingress hostname duplication.
Allow only known domains in ingress rules.

Recommended Tools:¶

OPA Gatekeeper – Constraint-based policy engine
Kyverno – Kubernetes-native policy engine with mutation and validation support

Cluster Hardening¶

Why it matters: The default configuration of many clusters exposes dangerous features. Attackers often target misconfigured control planes.

Best Practices¶

Run kube-bench to check compliance with CIS benchmarks.
Disable access to cloud metadata APIs from pods.
Disable unused alpha/beta features.

Recommended Tools:¶

Kube-bench – CIS compliance scanner
Kubescape – Security posture and runtime protection

Authentication¶

Why it matters: Exposed credentials or weak access controls can compromise the entire cluster.

Best Practices¶

Use OpenID Connect (OIDC) for user auth.
Reserve ServiceAccount tokens for workloads only.

Final Thoughts¶

Kubernetes gives you infinite flexibility, but also infinite ways to misconfigure your stack. This checklist is the result of lessons learned across many production environments.

Whether you're a startup deploying your first service or a platform team managing dozens of clusters, take time to audit your setup against these practices.

And remember: simplicity, observability, and security always win in production.

FAQs

What is the purpose of readiness, liveness, and startup probes in Kubernetes?

Readiness probes determine when a pod is ready to receive traffic.
Liveness probes detect when a pod is stuck and needs restarting.
Startup probes prevent premature restarts during slow application boot times.
Using all three correctly ensures robust application lifecycle management.

How should you manage secrets securely in Kubernetes?

Why is it recommended to avoid local storage in production Kubernetes workloads?

What are best practices for managing access control in Kubernetes?

How can you enforce policies and governance standards across Kubernetes environments?

Health Checks¶

Best Practices¶

Define Readiness Probes¶

Use Liveness Probes for Recovery¶

Avoid Sharing Readiness and Liveness Endpoints¶

Consider Startup Probes for Slow Boot Apps¶

Recommended Tools:¶

Application Resilience¶

Best Practices¶

Recommended Tools:¶

Scaling¶

Best Practices¶

Recommended Tools:¶

Resource Management¶

Best Practices¶

Recommended Tools:¶

Security¶

Best Practices¶

Recommended Tools:¶

Secrets and Configuration¶

Best Practices¶

Recommended Tools:¶

Networking¶

Best Practices¶

Tagging and Metadata¶

Best Practices¶

Recommended Tools:¶

Observability and Logging¶

Best Practices¶

Recommended Tools:¶

Governance and RBAC¶

Best Practices¶

Recommended Tools:¶

Policy Enforcement¶

Best Practices¶

Recommended Tools:¶

Cluster Hardening¶

Best Practices¶

Recommended Tools:¶

Authentication¶

Best Practices¶

Final Thoughts¶

Health Checks¶

Best Practices¶

Define Readiness Probes¶

Use Liveness Probes for Recovery¶

Avoid Sharing Readiness and Liveness Endpoints¶

Consider Startup Probes for Slow Boot Apps¶

Recommended Tools:¶

Application Resilience¶

Best Practices¶

Recommended Tools:¶

Scaling¶

Best Practices¶

Recommended Tools:¶

Resource Management¶

Best Practices¶

Recommended Tools:¶

Security¶

Best Practices¶

Recommended Tools:¶

Secrets and Configuration¶

Best Practices¶

Recommended Tools:¶

Networking¶

Best Practices¶

Tagging and Metadata¶

Best Practices¶

Recommended Tools:¶

Observability and Logging¶

Best Practices¶

Recommended Tools:¶

Governance and RBAC¶

Best Practices¶

Recommended Tools:¶

Policy Enforcement¶

Best Practices¶

Recommended Tools:¶

Cluster Hardening¶

Best Practices¶