Skip to content
Profile PictureBlog | Abhimanyu Saharan
Home
Series
  • Zero to Hero: Rancher
YouTubeTwitterRSS Feed
Profile Picture

Abhimanyu Saharan

Home
Series
  • Zero to Hero: Rancher
RSS Feed

© Abhimanyu Saharan. All rights reserved.

Unauthorized use and/or duplication of this material without express and written permission from this site’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Abhimanyu Saharan with appropriate and specific direction to the original content.

  1. Home
  2. /
  3. Kubernetes Production Checklist

Kubernetes Production Checklist

May 10, 2025
  • Kubernetes
Read time: 6 minutes
Abhimanyu Saharan
Abhimanyu Saharan
Table of Contents
  1. Health Checks
  2. Best Practices
  3. Application Resilience
  4. Best Practices
  5. Scaling
  6. Best Practices
  7. Resource Management
  8. Best Practices
  9. Security
  10. Best Practices
  11. Secrets and Configuration
  12. Best Practices
  13. Networking
  14. Best Practices
  15. Tagging and Metadata
  16. Best Practices
  17. Observability and Logging
  18. Best Practices
  19. Governance and RBAC
  20. Best Practices
  21. Policy Enforcement
  22. Best Practices
  23. Cluster Hardening
  24. Best Practices
  25. Authentication
  26. Best Practices
  27. Final Thoughts

Share this post

Before we jump into the checklist, let me be upfront: this is the exact list I use when reviewing or building production-grade Kubernetes setups—whether it's for an internal platform or a client-facing app. These practices aren't hypothetical or aspirational; they're grounded in real incidents, hard lessons, and the pressure to keep systems up 24/7. If you're running anything beyond toy clusters, you'll likely find something here worth applying.

Kubernetes has become the default platform for orchestrating containerized applications. But while getting your first cluster running is straightforward, taking Kubernetes to production is a different game altogether. There are countless configurations, edge cases, and behaviors that can make or break the availability, scalability, and security of your system.

This guide provides a detailed and opinionated checklist of production best practices for Kubernetes, categorized into key areas such as health checks, scaling, security, governance, logging, and more. Whether you're building new workloads or reviewing existing ones, this document helps you benchmark your setup.

Health Checks¶

Kubernetes supports three kinds of probes: readiness, liveness, and startup. Each serves a distinct purpose in the lifecycle of a pod. Readiness indicates when an app is ready to serve traffic, liveness checks whether it's still responsive, and startup allows for complex apps to boot before other checks begin.

Why it matters

Without proper health checks, Kubernetes may send traffic to containers that aren’t ready or fail to restart containers that are stuck. This leads to unreliable user experiences and degraded service availability.

Best Practices¶

Define Readiness Probes¶

Readiness probes prevent traffic from being routed to a container until it’s fully initialized.

  • Use HTTP, TCP, or command-based probes.
  • Tune initialDelaySeconds, periodSeconds, and failureThreshold for your app’s boot profile.
Note

A missing readiness probe can cause early traffic to fail, particularly for apps with slow startup logic (e.g., JIT compilation, DB initialization).

Use Liveness Probes for Recovery¶

Liveness probes detect when your application is stuck and restart it.

  • Don’t use liveness for fatal errors—crash the app instead.
  • Use lightweight endpoints that always respond 200 OK when healthy.
Note

Liveness probes are not an error handling mechanism. They are for recovery only.

Avoid Sharing Readiness and Liveness Endpoints¶

Pointing both probes to the same endpoint can cause containers to be restarted before they’ve ever reported as ready.

  • Split endpoints or vary the logic if reusing the same one.

Consider Startup Probes for Slow Boot Apps¶

Startup probes temporarily disable liveness and readiness checks while your app initializes.

  • Useful for apps like JVM-based services or apps requiring DB migrations.
Note

Without startup probes, your app could be killed before it's even ready to start.

Application Resilience¶

Application resilience ensures that your apps recover gracefully from failures and are not affected by external service dependencies. Kubernetes assumes containers can start independently. Apps must not fail if their dependencies (e.g., databases, APIs) are momentarily unavailable. Handling shutdowns properly ensures smoother deployments and rollouts.

Why it matters

Apps that fail on startup due to temporary issues or do not shut down properly can lead to service outages and cascading failures.

Best Practices¶

  • Gracefully handle SIGTERM using a preStop hook and drain logic.
  • Make readiness checks independent of external services.
  • Implement retry logic for startup dependencies.

Scaling¶

Scaling enables Kubernetes to adjust resources based on real-time demand to maintain availability and efficiency. HPA scales workloads based on resource usage like CPU/memory. Cluster Autoscaler adjusts node counts. Local storage impedes scalability; external persistence is preferred.

Why it matters

Without effective scaling strategies, your system may underperform during spikes or over-provision during low usage, leading to poor reliability and high costs.

Best Practices¶

  • Use Horizontal Pod Autoscaler (HPA) to scale based on metrics.
  • Use Cluster Autoscaler to scale infrastructure nodes.
  • Avoid Vertical Pod Autoscaler (VPA) in production unless necessary.
  • Avoid local storage for stateful apps.

Resource Management¶

Resource limits and requests help Kubernetes schedule workloads accurately and protect node stability. Setting resource requests informs the scheduler about minimum requirements. Limits prevent a pod from consuming too much. Use Vertical Pod Autoscaler (VPA) in recommendation mode to refine these settings.

Why it matters

Improper resource limits can lead to pod evictions, node crashes, or poor performance. They are also essential for cluster scheduling.

Best Practices¶

  • Always define CPU and memory requests and limits.
  • Use LimitRange to enforce defaults in namespaces.
  • Be cautious with CPU limits; prefer setting only requests.

Security¶

Security practices reduce the attack surface and contain any potential breaches within the cluster. SecurityContext in Kubernetes lets you define user privileges, file system access, and allowed kernel capabilities. By default, containers run with high privileges unless restricted explicitly.

Why it matters

Kubernetes runs production infrastructure. Misconfigured security policies can result in privilege escalations or data exfiltration.

Best Practices¶

  • Run containers as non-root using runAsUser and runAsNonRoot.
  • Use read-only root filesystems to prevent tampering.
  • Disable privilege escalation.
  • Drop unnecessary Linux capabilities.

Secrets and Configuration¶

Configuration and secrets management separates application code from environment-specific details. Separate configuration from code for better portability. Avoid using env vars for secrets because they can be accessed via process inspection. Secrets should be encrypted at rest and access-controlled.

Why it matters

Improper handling of configuration and secrets can lead to credentials exposure, leaked data, or insecure systems.

Best Practices¶

  • Use ConfigMaps for non-sensitive configurations.
  • Mount secrets as volumes instead of env vars.
  • Use encrypted secret management tools like Sealed Secrets, SOPS, or Vault.

Networking¶

Networking governs communication between services and enforces access controls across the cluster. Ingress controllers manage external traffic using Layer 7 rules. NetworkPolicies are firewalls within the cluster that restrict ingress and egress between pods.

Why it matters

Unrestricted network traffic between pods can lead to lateral movement in case of a breach or cause unintentional outages.

Best Practices¶

  • Use an ingress controller for HTTP routing.
  • Define NetworkPolicies to enforce communication rules.

Tagging and Metadata¶

Metadata tagging helps with ownership tracking, auditing, cost attribution, and compliance. Proper labels enable observability, cost tracking, and automated policy enforcement. They also help tools like Prometheus or cost dashboards to group resources meaningfully.

Why it matters

Without structured tagging, it's difficult to answer basic questions about resource ownership, cost, or compliance status.

Best Practices¶

  • Use Kubernetes standard labels (app.kubernetes.io/*).
  • Add business metadata like owner, project, and cost-center.
  • Tag for security (confidentiality, compliance).

Observability and Logging¶

Logging and monitoring help you detect failures early and understand the system state. Passive logging keeps the app simple and defers aggregation to external tools. Node-level daemons collect logs from containers and send them to systems like Elasticsearch or Loki.

Why it matters

Without observability, diagnosing incidents becomes guesswork. Log access and retention are essential for debugging and audits.

Best Practices¶

  • Log to stdout/stderr following 12-factor principles.
  • Use a node-level logging daemon (e.g., FluentBit).
  • Avoid logging via sidecars unless necessary.
  • Plan for 30+ days of log retention.

Governance and RBAC¶

Governance ensures workloads adhere to organizational rules and limits access to what’s necessary. RBAC ensures that users and services can perform only authorized actions. Quotas and limits protect cluster stability.

Why it matters

Over-permissioned service accounts and lack of quotas can lead to resource exhaustion or security breaches.

Best Practices¶

  • Set ResourceQuotas and LimitRanges in all namespaces.
  • Disable auto-mounted default service accounts.
  • Use granular RBAC roles per application/service.

Policy Enforcement¶

Policies enforce organizational standards across multiple teams and namespaces. Policy agents validate Kubernetes resources at admission time and prevent violations. This improves trust, especially in shared clusters.

Why it matters

Without policy controls, different teams may unknowingly violate security, compliance, or architectural standards.

Best Practices¶

  • Use OPA Gatekeeper or Kyverno for image and label policies.
  • Restrict ingress hostname duplication.
  • Allow only known domains in ingress rules.

Cluster Hardening¶

Hardening the cluster base ensures your platform is built on secure, stable foundations. Cluster hardening includes setting admission controls, auditing, and avoiding insecure defaults. Tools like kube-bench simplify this process.

Why it matters

The default configuration of many clusters exposes dangerous features. Attackers often target misconfigured control planes.

Best Practices¶

  • Run kube-bench to check compliance with CIS benchmarks.
  • Disable access to cloud metadata APIs from pods.
  • Disable unused alpha/beta features.

Authentication¶

Secure user and machine access to the cluster is foundational to everything else. OIDC supports identity federation and SSO. Human users should authenticate via IAM or IDPs, not service account tokens.

Why it matters

Exposed credentials or weak access controls can compromise the entire cluster.

Best Practices¶

  • Use OpenID Connect (OIDC) for user auth.
  • Reserve ServiceAccount tokens for workloads only.

Final Thoughts¶

Kubernetes gives you infinite flexibility, but also infinite ways to misconfigure your stack. This checklist is the result of lessons learned across many production environments.

Whether you're a startup deploying your first service or a platform team managing dozens of clusters, take time to audit your setup against these practices.

And remember: simplicity, observability, and security always win in production.

Kubernetes Production Checklist

You Might Also Like

  • Cutting Kubernetes Costs with kube-downscaler
    Cutting Kubernetes Costs with kube-downscaler

    Schedule pod downscaling in Kubernetes with kube-downscaler to cut costs during off-hours—my experience, setup, and where it fits best.

    Abhimanyu Saharan
    Abhimanyu Saharan

    May 10, 2025
    • Kubernetes
  • Kubernetes Production Checklist
    Kubernetes Production Checklist

    Abhimanyu Saharan
    Abhimanyu Saharan

    May 10, 2025
    • Kubernetes
  • 10 Practical Tips to Tame Kubernetes
    10 Practical Tips to Tame Kubernetes

    Struggling with Kubernetes complexity? These 10+1 practical tips help you streamline scaling, security, monitoring, and day-to-day cluster operations.

    Abhimanyu Saharan
    Abhimanyu Saharan

    May 6, 2025
    • Kubernetes