Cluster Autoscaler on Rancher RKE2

Step-by-step guide to RKE2 autoscaler setup with Rancher. Learn cluster autoscaler Helm deployment, scaling benefits, limits & troubleshooting.

September 6, 2025

Abhimanyu Saharan

Modern Kubernetes clusters need to scale on demand to handle varying workloads. Cluster Autoscaler (CA) is a Kubernetes component that automatically adjusts the number of nodes in your cluster by adding or removing worker nodes. It scales up when pods cannot be scheduled due to insufficient resources, and scales down when nodes are underutilized and their pods can be rescheduled elsewhere. In a Rancher RKE2 environment, the Cluster Autoscaler integrates with Rancher's provisioning system to manage node pools dynamically, ensuring your cluster is both elastic and efficient. This article will guide DevOps engineers, Kubernetes admins, and Rancher users through setting up the Cluster Autoscaler on RKE2 using Rancher’s UI, CLI, or Helm, and discuss its benefits, limitations, troubleshooting, and tuning.

How the Cluster Autoscaler Works in RKE2¶

The Cluster Autoscaler watches for pods in a Pending state that cannot be scheduled due to insufficient cluster capacity. It checks every 10 seconds by default (configurable via --scan-interval) for unschedulable pods. If pending pods are detected, CA will scale up the cluster by requesting a new node (within the limits you set for the node pool). Kubernetes will then register the new node and schedule the pending pods on it. Conversely, if a node has been underutilized for a while (no critical workloads, and its pods can fit on other nodes), CA may scale down (remove) that node. This ensures you’re not paying for idle resources. Importantly, CA bases decisions on pod resource requests (not actual usage), so proper resource requests/limits on pods are crucial.

In Rancher RKE2 clusters, the autoscaler uses Rancher as its “cloud provider.” Rancher-managed clusters use node pools (via node drivers or Cloud Credentials) to provision nodes. The autoscaler will communicate with the Rancher server API to create or delete nodes in a node pool when scaling events occur. This means your cluster must be launched with Rancher’s node drivers (e.g. using an infrastructure provider like AWS, vSphere, etc. through Rancher). If you imported a custom cluster or manually provisioned nodes, the Rancher autoscaler provider won’t have an API to create new nodes. Assuming a Rancher-provisioned RKE2 cluster, the autoscaler will interact with Rancher to adjust the node pool sizes on demand.

By design, the Cluster Autoscaler runs on a control-plane (master) node for stability. Rancher RKE2 clusters taint control-plane nodes (so regular workloads don't run there), but we configure the autoscaler pod to tolerate master node taints and use a node selector to schedule on a control-plane node. This ensures the autoscaler isn't itself running on a worker node that it might scale down. Kubernetes best practices also recommend marking the autoscaler pod as a critical add-on (using priorityClassName: system-cluster-critical) so it won’t be evicted during resource pressure. Additionally, by default CA will not scale down any node hosting certain system pods (non-mirrored pods in the kube-system namespace) to avoid disrupting core services. This behavior can be tuned (as we’ll see later), but the default adds a layer of safety.

Preparing Rancher API Access for Autoscaler¶

Before deploying the autoscaler, you need to provide it with credentials and information to access the Rancher API. The autoscaler will use these details to invoke Rancher’s cluster provisioning endpoints to add or remove nodes. Specifically, you should prepare:

Rancher API URL: The URL of your Rancher server (e.g. https://<rancher-server>).
API Token: A Rancher API access token with permissions to manage clusters/nodes. It’s easiest to generate this in Rancher UI under Account & API Keys (choose Create API Key, no scope). Using an admin-level token is simplest (the autoscaler in our examples uses an admin token), though you can also create a restricted token with specific roles for better security.
Cluster Identification: The autoscaler needs to know which cluster to scale. For the Rancher provider, you typically provide the cluster name and cluster namespace (the namespace of the cluster’s provisioning object in Rancher). In newer Rancher (Cluster API-driven) provisioning, the cluster’s name (as seen in Rancher UI) and the namespace (often fleet-default for Rancher’s default project) are used.

These details are passed to the autoscaler via a cloud-config file. You can create a Kubernetes Secret or ConfigMap containing the Rancher connection info. For example, a secret manifest might look like:

In this file, url is the Rancher server URL, token is the API token, and clusterName/clusterNamespace refer to the target RKE2 cluster managed by Rancher. Once this secret (or config map) is created in the cluster, the autoscaler will mount it and use it to authenticate to Rancher.

If your Rancher server uses a self-signed certificate (common in lab setups), you should also provide the Rancher server’s CA certificate to the autoscaler, so it can trust the TLS connection. This can be done by creating a ConfigMap with the CA cert and mounting it in the autoscaler pod (or by adding the CA to the container’s trusted store). For simplicity, using a trusted SSL cert for Rancher or adding the cert to the autoscaler container is recommended to avoid TLS errors when the autoscaler connects to Rancher.

With credentials and cluster info ready, we can proceed to deploy the Cluster Autoscaler.

Installing Cluster Autoscaler via Rancher UI¶

Rancher’s Cluster Explorer UI makes it straightforward to deploy the Cluster Autoscaler using Helm charts:

Add the Autoscaler Helm repository: In the Rancher UI, go to your RKE2 cluster and navigate to Apps & Marketplace (in older versions, "Apps" or "Catalog"). Click Repositories and add the official Kubernetes Autoscaler Helm repo:
- Name: autoscaler (or any name you prefer)
- URL: https://kubernetes.github.io/autoscaler
  This repository hosts the cluster-autoscaler Helm chart.
Install the Cluster Autoscaler chart: Still in Apps & Marketplace, click Charts (or Launch from the repo) and find cluster-autoscaler. Choose to install it in the kube-system namespace (this is a common practice for cluster-wide add-ons). In Rancher, select the System project and kube-system for the namespace.
Configure chart values: The Helm chart requires certain values to be set for our RKE2 use case. In the UI, you'll be presented with default values which you can override. Switch to Edit as YAML for easier editing. Provide the following key overrides (merging into the existing values):
- autoDiscovery.clusterName: set this to your cluster’s name (e.g. production-apps). This labels the autoscaler to auto-discover nodes belonging to the cluster’s node group.
- cloudProvider: set to "rancher" so that the autoscaler uses the Rancher provider logic.
- cloudConfigPath: set to the path where the Rancher config will be mounted (e.g. /config/cloud-config). This path is inside the autoscaler container.
- Mount the Rancher config secret: Use the chart’s options to mount the secret we created. The official chart supports mounting extra secrets via extraVolumeSecrets. For example, you can add:

This will mount our secret at /config in the container, and the file will be available as /config/cloud-config (since our secret key is cloud-config). Ensure cloudConfigPath matches this (i.e. /config/cloud-config)

extraArgs: include any custom flags for the autoscaler. At minimum, you'll want to set:

--v=4 (or higher) for verbose logging (useful for troubleshooting).
--stderrthreshold=info and --logtostderr=true to send logs to pod stdout.
--balance-similar-node-groups=true to balance scale-out across similar pools.
--skip-nodes-with-system-pods=false if you want to allow scaling down nodes even if they run non-critical kube-system pods (by default the autoscaler skips such nodes; setting false means it will consider removing nodes running system pods that are replaceable, like DNS).
--skip-nodes-with-local-storage=false to allow removing nodes even if they have pods with local storage (default true would skip those nodes). Be cautious with this: turning it off can evict pods with local storage (data may be lost unless pods handle it).
Other tuning flags as needed (we'll discuss in Performance Tuning below). For example, --scale-down-utilization-threshold=0.6 (60% utilization) and --scale-down-unneeded-time=10m (how long a node should be underutilized before removal) can be set as in our example.

nodeSelector and tolerations: configure the autoscaler to run on control-plane nodes. For RKE2, you can add:

Launch the application: After setting the values, deploy the chart. Rancher will install the cluster-autoscaler Deployment in your cluster. Verify that the cluster-autoscaler pod is running in the kube-system namespace. It should register as running on a master node (check with kubectl get pods -n kube-system -o wide to see the node).

At this point, the autoscaler is running but not yet actively managing any node pool until we configure the node pool scaling ranges (next section). The UI method conveniently uses the Helm chart’s templates, which include creating the necessary RBAC (ServiceAccount and ClusterRole) for the autoscaler, so you typically don’t need to apply those manually – but if something went wrong, ensure the ServiceAccount cluster-autoscaler exists in kube-system and has the proper ClusterRole/Binding (the chart usually creates these).

Installing Cluster Autoscaler via Helm CLI or Manifest¶

If you prefer using the command line or an Infrastructure-as-Code approach, you can deploy the autoscaler without Rancher’s UI:

Using Helm CLI: First, ensure you have kubectl access to the cluster (e.g. via the kubeconfig from Rancher). Add the autoscaler Helm repo and install the chart:

This long command adds the necessary overrides (it’s equivalent to what we did in the UI). It assumes you already created the secret cluster-autoscaler-cloud-config in kube-system as described earlier. The extraVolumeSecrets values tell Helm to mount that secret. We also explicitly set various extraArgs and scheduling constraints via --set. (In practice, you may use a values file instead for cleanliness).

Using RKE2 HelmChart Manifest: RKE2 clusters come with a Helm Controller that can deploy charts based on custom resources. You can drop a HelmChart manifest into RKE2’s manifest directory or apply it with kubectl. Below is an example HelmChart custom resource for the Cluster Autoscaler:

Regardless of the method (UI, Helm CLI, or HelmChart manifest), after deployment you should have a running cluster-autoscaler pod. You can confirm it's running and then proceed to enable autoscaling on your cluster’s node pools.

Enabling Node Pool Autoscaling in Rancher¶

Deployment alone is not enough, we must tell the autoscaler which node pools it can scale and within what range. In Rancher, each RKE2 cluster has one or more machine pools (node pools). We enable autoscaling on a pool by annotating the cluster’s machine pool configuration with a min and max size.

To do this via Rancher UI:

Go to Cluster Management in Rancher and edit your RKE2 cluster (click ⋮ -> Edit Config, then switch to the YAML view).
In the cluster YAML, locate the section for machinePools. Identify the machine pool that you want the autoscaler to manage (e.g. your worker pool). Under that machine pool, add the following annotations:

Replace the values with the desired minimum and maximum node counts for that pool. For example, if you want the pool to scale down to no fewer than 1 node and up to 3 nodes, use "1" and "3" as above. Each pool can have different limits.

Save the changes. Rancher will update the cluster’s configuration. The autoscaler will detect these annotations via the Rancher API and know it is allowed to scale that node group between the given bounds.

If you prefer kubectl, you can also patch the MachineDeployment or the Cluster custom resource with these annotations. The annotations actually live on the MachineDeployment object in the downstream cluster (which corresponds to your node pool). The Rancher autoscaler provider reads cluster.provisioning.cattle.io/autoscaler-min-size and ...max-size to determine limits. Any pool without these annotations will be ignored by the autoscaler (it won’t scale it).

Only worker node pools should be scaled. Do not put these annotations on your control-plane pool. Master/etcd nodes are not meant to be auto-scaled by CA, and doing so could destabilize your cluster. Stick to worker pools (Rancher will typically name them or you can identify by the roles in the YAML).

At this stage, the Cluster Autoscaler is fully configured and ready to operate. It runs in the background, monitoring your cluster’s pods and node utilization.

Testing the Autoscaler¶

To verify that everything is working, you can simulate a workload that triggers scaling:

Scale Up Test: Deploy a workload that requests more resources than currently available. For example, if you have 1 small worker node, run a deployment with a couple of pods each requesting significant CPU or memory (so that not all pods can schedule at once). For instance:

If your single node has less than 2 CPUs, one of the pods will remain pending. Within ~10–30 seconds, the autoscaler should detect the unschedulable pod and increment the node pool size. In Rancher UI, you’ll see a new node provisioning, and in kubectl get nodes a new node will appear once ready. The pending pod will then schedule onto the new node.

Scale Down Test: After the new node is added and pods are running, reduce the load. You can delete the deployment or scale it down to 0 replicas. The autoscaler will observe that a node is now underutilized (perhaps completely empty) for a period of time (by default, 10 minutes). After that time, it should remove the extra node. Watch the Rancher UI or kubectl get nodes for one node to go away. The autoscaler will respect the autoser-min-size (so it won’t go below 1 node in our example). Any pods on the removed node will be rescheduled onto remaining nodes automatically.

During tests, you can observe the autoscaler’s decisions by checking its logs:

The logs will indicate why it’s scaling up or down, which pool it chose, or any blockers. For example, you might see messages like “Scaled up node group X to 2” or reasons for not scaling down a node. Monitoring the logs is a great way to troubleshoot issues or confirm behavior.

Benefits of Using Cluster Autoscaler in RKE2¶

Using the Cluster Autoscaler with Rancher RKE2 offers several benefits:

Automatic Right-Sizing: The autoscaler ensures your cluster always has the “right” amount of resources. It adds nodes when demand increases (ensuring critical applications aren’t starved for capacity) and removes nodes when demand drops (so you don’t pay for idle VMs). This dynamic adjustment can greatly improve cluster efficiency and uptime.
Cost Efficiency: Especially in cloud environments, automatically removing underutilized nodes saves costs during off-peak times. Conversely, adding nodes on-demand prevents over-provisioning resources for peak capacity that might rarely be used. The combination leads to an elastic, cost-optimized infrastructure.
Improved Resilience and User Experience: By scaling out when needed, autoscaling can handle sudden spikes in workload without manual intervention. Your users experience fewer slowdowns or failures due to resource exhaustion. The cluster can react faster than human operators in many cases, adding capacity within minutes of detecting a need.
Synergy with Pod Autoscalers: When used alongside Horizontal Pod Autoscalers (HPA) or other scaling mechanisms, the Cluster Autoscaler completes the picture for full-stack scalability. For example, HPA might increase the replicas of a deployment, and if those new pods can’t fit on existing nodes, the Cluster Autoscaler will kick in to provide new nodes for them. This synergy allows truly hands-off scaling for your applications.
Rancher Integration: In an RKE2 cluster managed by Rancher, the autoscaler leverages Rancher’s robust provisioning system. This means you can use it across multiple infrastructure providers (any supported node driver) with a consistent experience. Rancher handles the low-level provisioning (creating VMs on AWS, vSphere, etc.), while CA simply requests more or fewer nodes. The integration abstracts away cloud-specific autoscaling (no need to manually manage AWS Auto Scaling Groups, for instance, when using Rancher’s provider).

Overall, cluster autoscaling “greatly enhances your cluster health by adding more nodes if needed” and is quite straightforward to set up on RKE2 clusters. It reduces the need for manual capacity planning and can provide a more resilient, cost-effective environment.

Limitations and Considerations¶

Despite its advantages, it’s important to understand the limitations and nuances of Cluster Autoscaler in a Rancher RKE2 context:

Not based on actual utilization: The autoscaler makes decisions based on Kubernetes scheduler state (pending pods and requests), not live CPU/memory usage metrics. For example, if you have a node running at 10% CPU but all pods on it have large resource requests reserved, the autoscaler may consider that node fully utilized (because from the scheduler’s perspective those resources are “claimed”). This can lead to less aggressive downscaling in environments where pods over-request resources.
Scale-down is cautious: By default, CA won’t remove a node if it would violate certain safety checks. It avoids scaling down nodes that have: These safeguards mean scale-down might not happen even when a node looks underutilized, if any unmovable pods are present. In Rancher RKE2, typically core components (cattle agents, etc.) run on master nodes or as DaemonSets on workers, so it usually can move things. But if you run custom system pods or have local storage volumes, be aware of this limitation.
- Pods that cannot be moved (e.g. pods with PersistentVolume local to the node, or pods with local storage like emptyDir if skip-nodes-with-local-storage=true).
- System pods that are not mirrored by a DaemonSet (unless you set --skip-nodes-with-system-pods=false as we did).
- Recently started or heavily utilized nodes (it gives new nodes a stabilization period and respects Pod Disruption Budgets).
Delay in provisioning: When a scale-up is triggered, the autoscaler requests Rancher to add a node. The reaction time includes both the autoscaler’s scan interval (up to 10 seconds by default) and the time for Rancher and the cloud provider to actually create the VM, install RKE2, and join it to the cluster. This can take a few minutes depending on your cloud/infrastructure. During this time, pending pods remain unscheduled. So, autoscaler is not instantaneous – there will be a short period where workload demand outpaces supply until the new node is ready. Plan for this in latency-sensitive scenarios (you might over-provision a bit or use faster node provisioning if possible).
Requires Rancher-provisioned nodes: As mentioned, the Rancher cloud provider integration only works if Rancher can create/delete nodes. If your RKE2 cluster was created with custom nodes (e.g. you manually installed RKE2 on some servers and imported to Rancher), the autoscaler cannot add more because there’s no node driver to call. Similarly, if using an unsupported node driver or cloud, it may not work. Check that your environment is supported (most major cloud providers and vSphere via Rancher node drivers are supported).
No control-plane scaling: The Cluster Autoscaler (and Rancher) will not automatically scale your control-plane/etcd nodes. High-availability masters must be planned and added manually. Autoscaler focuses on worker nodes only. Ensure your control plane has enough capacity to handle increased load (API, controller, etc.) when workers scale up.

By understanding these considerations, you can better plan your cluster capacity and know what to expect from the autoscaler’s behavior.

Troubleshooting Tips¶

Setting up the Cluster Autoscaler can involve multiple components (Rancher, cloud provider, Kubernetes). If things aren’t working as expected, here are some troubleshooting tips:

Check the Autoscaler Pod Status: Make sure the cluster-autoscaler pod is running (kubectl -n kube-system get pods). If it’s CrashLooping or Error, describe the pod to see why. Common issues include missing volume mounts (e.g. if the secret or config map for cloud-config wasn’t created or named correctly, the pod might fail to start due to missing file) or lack of permissions.
View Logs for Clues: The autoscaler’s logs are very informative. Look for lines indicating it recognized your node group and any errors. For example, if misconfigured, you might see errors about failing to contact Rancher or missing credentials. If the autoscaler is running but not scaling up when you expect, the logs might say something like “No unschedulable pods” or “max node limit reached for node group X” etc. This can tell you whether it sees the pending pods and what decision it made. Continuously watching the logs is a good way to verify it's actively monitoring the cluster.
Verify Rancher Config: If scaling isn’t happening, double-check the content of your cloud-config. Is the URL correct (including the /v3 path if needed, depending on Rancher version)? Is the token valid and not expired? Are clusterName and clusterNamespace exactly matching the cluster? (They are case-sensitive and must match the Rancher cluster’s name and namespace as seen in Rancher’s cluster list or the provisioning CR). If these are wrong, the autoscaler may silently do nothing or log authentication errors.
Ensure Annotations are in Place: A very common oversight is forgetting to put the autoscaler-min-size/max-size annotations on the machine pool. Without those, the autoscaler will not consider any node group for scaling (it thinks everything has size == min == max, essentially fixed). Confirm via Rancher UI (Cluster YAML) or via kubectl get clusters.provisioning.cattle.io <cluster-name> -o yaml that your worker pool has those annotations set. Also, ensure the values make sense (min < max, etc.). If you update annotations, the change should be picked up on the next autoscaler loop.
RBAC and Permissions: If you deployed the autoscaler via Helm, it should have created a ClusterRole and bound it to the autoscaler ServiceAccount. If you applied manifests manually, ensure you also applied the provided RBAC manifest (as in the official docs or examples). The autoscaler needs permissions to list nodes, pods, etc., and to create events. If it can’t, it might log errors about missing permissions. Rancher’s cloud provider also might require permissions on Rancher-related API groups (like provisioning.cattle.io and ). If you see errors referencing those, you might need to adjust RBAC (the Rancher docs or community guides provide the needed rules).

By following these tips, you can usually pinpoint why the autoscaler isn’t behaving as expected. In many cases, it’s a configuration detail or a safety mechanism that can be adjusted once understood.

Performance Tuning and Advanced Settings¶

The default settings of Cluster Autoscaler are conservative to fit general use cases, but you might tune them for your environment. Here are some key parameters (configured via extraArgs on the deployment) and how to use them:

Scale Down Delay and Utilization Threshold: --scale-down-unneeded-time and --scale-down-utilization-threshold control how aggressively to remove nodes. For example, by setting scale-down-unneeded-time=10m, we specify that a node must be underutilized for 10 minutes before being considered for removal (the default is 10 minutes, which we kept in our example). The utilization-threshold=0.6 means if a node’s usage is below 60% (based on requested resources) and its pods can fit elsewhere, it’s eligible to remove (default ~50%). Raising this threshold makes the autoscaler more eager to remove nodes (even if they are 50% utilized, for instance), while lowering it makes it more cautious.
Max Empty Bulk Delete: --scale-down-non-empty-candidates-count (we set "60" in our example) adjusts how many nodes can be considered simultaneously for scale-down. In very large clusters, you might increase this to let the autoscaler evaluate more nodes at once for potential removal. The default is smaller (perhaps 30); we raised it to 60 to be safe in case of large cluster scenarios.
Expander Strategy: The autoscaler’s expander setting decides which node group to scale when multiple groups could fit a pending pod. We used --expander=least-waste, which means it will choose the node group that leaves the least idle resources after scheduling the pending pods (often a good default for cost efficiency). Other options include most-pods (chooses the group that would accommodate the most pods) and random. If you have multiple node types (e.g., some pools with GPU nodes, some with high-memory nodes), consider how you want autoscaler to pick between them. There’s also price if you supply node costs, to choose the cheapest option.
Balancing Similar Node Groups: We enabled --balance-similar-node-groups=true. This helps if you have identically configured node pools (for example, two pools in different AZs). The autoscaler will try to keep their sizes balanced when scaling up/down. This prevents situations where one pool grows large while another stays small, which can be important for availability across AZs or just uniform usage.
Skipping vs. Considering Nodes for Scale-down: We deliberately set --skip-nodes-with-system-pods=false and --skip-nodes-with-local-storage=false in our configuration to allow maximum flexibility in scaling down. By default, CA would skip any node that has non-daemonset kube-system pods (like a DNS or metrics server pod) or any pod with local storage. By turning these to false, we told CA to consider removing such nodes anyway. This can improve downscaling in clusters where every node always has some system pod (which is often true). The risk is that those pods will be terminated; in most cases, Deployments like CoreDNS will reschedule on other nodes, so it's fine. But use caution: if you have something stateful with local storage, it could be lost. Tune these flags based on how your workloads use local storage and how your system pods are deployed. If unsure, leave them as true (skip) to be safe, at the cost of possibly keeping an extra node around.

In our example values, we included many of these tunings. You can see how they were added as extraArgs in the manifest, e.g., --cloud-provider=rancher, --cloud-config=/mnt/config.yaml, verbosity -v=4, and the various scale-down flags. Adjusting these can tailor the autoscaler’s behavior to your requirements. For instance, in a dev/test cluster you might want faster scale-down to save cost (shorter unneeded-time), whereas in production you might keep it a bit longer to avoid thrashing nodes for short traffic dips.

Always monitor the impact of any tuning in a non-prod environment if possible. The defaults are usually safe; only tune when you have a clear need and understanding of the parameter.

Conclusion¶

The Cluster Autoscaler is a powerful addition to any Kubernetes cluster that experiences variable workloads. In Rancher RKE2 clusters, it brings cloud-like elasticity by leveraging Rancher’s provisioning APIs. We covered how to deploy it via Rancher’s UI for an easy point-and-click setup, as well as via Helm or manifests for those who prefer automation and GitOps. We also walked through configuring the necessary Rancher API access, enabling autoscaling on node pools, and testing the behavior.

By automatically scaling nodes up when resources are tight and scaling down when there’s excess capacity, the autoscaler helps ensure your cluster is always right-sized. This leads to better resilience (pods get the capacity they need) and cost savings (unused nodes don’t stick around for long). As with any automation, it’s important to understand its parameters and limits – we discussed common gotchas like delayed provisioning and the need to properly annotate node pools.

With the Cluster Autoscaler on RKE2, your Kubernetes cluster can truly grow and shrink on-demand, hands-free. This not only reduces the operational burden on DevOps teams but also optimizes the infrastructure usage over time. Many users find that it is “fast and easy to set up for RKE2 clusters” and greatly enhances the cluster's ability to self-manage. By following the best practices and tips outlined above, you can confidently implement autoscaling in your Rancher-managed Kubernetes environment and reap the benefits of a dynamic, automated infrastructure.