Fine-Grained Control with Configurable HPA Tolerance

May 5, 2025

Read time: 3 minutes

Abhimanyu Saharan

Kubernetes v1.33 introduces a long-awaited enhancement to Horizontal Pod Autoscaler (HPA): configurable tolerance values. Previously, all HPAs across a cluster used a globally set tolerance of 10% to avoid flapping and limit unnecessary scaling. With this new feature, you can fine-tune scaling sensitivity per workload, giving you more control over responsiveness and resource efficiency.

The Problem with One-Size-Fits-All Tolerance¶

The HPA works by comparing the current usage of a metric (like CPU) against a desired target. It calculates the number of replicas required by applying the usage ratio to the current replica count. However, to avoid constant scaling due to minor fluctuations, Kubernetes uses a tolerance value—a buffer zone where no scaling occurs if the metric ratio is close to 1.

Until now, this tolerance was fixed at the cluster level, usually 10%. While that works for smaller workloads, it becomes problematic when you're running large-scale applications:

A 10% tolerance on a 1000-replica deployment could block scaling even when hundreds of pods are under strain.
It delays scale-up during sudden load increases, reducing responsiveness.
Different workloads often require different sensitivities for scale-up vs. scale-down.

The lack of granularity has been a recurring complaint in GitHub issues and user feedback.

What’s New in Kubernetes v1.33¶

Starting with Kubernetes 1.33, users can define tolerance values per HPA, separately for scale-up and scale-down. This change is backward-compatible and does not affect any HPAs unless the new field is explicitly set.

✅ Key Changes:¶

A new optional field tolerance has been added under spec.behavior.scaleUp and scaleDown in the HPA v2 API.
The default cluster-wide tolerance still applies if no per-HPA tolerance is set.
You can now:
- Make your workloads scale faster by lowering the tolerance
- Avoid unnecessary scale-downs by increasing it

Example: Reacting Faster to Spikes¶

Let’s say you have a workload running 50 replicas with CPU target utilization set at 75%. The actual usage spikes to 90%. Normally, the HPA would scale up to:

desiredReplicas = ceil(50 × 90 / 75) = 60

However, if the usage ratio is within the default 10% tolerance (i.e., 0.9 to 1.1), no action is taken.

If you want more responsiveness, you can configure a 5% tolerance on scale-up. That way, even smaller increases in CPU load will trigger a scaling decision:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app
spec:
  behavior:
    scaleUp:
      tolerance: 0.05

You can also use different values for scale-down:

  behavior:
    scaleUp:
      tolerance: 0.05
    scaleDown:
      tolerance: 0.15

How It Works Internally¶

Under the hood, this enhancement doesn’t alter the scaling algorithm. It only overrides the default tolerance used when comparing the usage ratio against 1.0:

When the feature is enabled, the HPA controller checks the tolerance field in the respective scaling rule and uses it instead of the global value.

How to Enable It¶

This feature is currently in alpha and requires the HPAConfigurableTolerance feature gate to be enabled in both the kube-apiserver and kube-controller-manager:

--feature-gates=HPAConfigurableTolerance=true

The feature is only available in the autoscaling/v2 API version and is ignored in earlier versions.

Rollback & Compatibility¶

If the feature is disabled, any configured tolerance field will be ignored.
Existing HPAs will continue to function as they did before using the global default.
Downgrading is safe—workloads will revert to using the cluster-wide tolerance.

Why This Matters¶

Scaling behavior is often workload-specific. ML inference servers, web frontends, and CI runners all have different performance and initialization profiles. With this update:

You gain per-resource control over sensitivity to metric fluctuations.
You avoid over-scaling large workloads unnecessarily.
You can react faster to traffic spikes by setting a lower tolerance on scale-up.

This is a small change in terms of API surface, but it offers much-needed flexibility in real-world autoscaling behavior.

Final Thoughts¶

With v1.33, Kubernetes continues to evolve toward more intelligent and adaptive autoscaling. Configurable tolerance empowers operators to tune workload behavior without cluster-wide compromises.

Start experimenting with this feature if you're managing large-scale deployments, latency-sensitive applications, or simply want better control over your autoscaling logic.

FAQs

What does the new configurable HPA tolerance feature in Kubernetes v1.33 enable?

It allows setting per-HPA tolerance values for scale-up and scale-down decisions, overriding the fixed 10% cluster-wide default. This enables finer control over autoscaling sensitivity for each workload.

Why is a fixed tolerance value problematic for large-scale workloads?

A global 10% tolerance may prevent scaling even when a significant number of pods are under strain. This can delay responsiveness during traffic spikes and cause inefficient resource usage in large deployments.

How do I configure different tolerance values for scale-up and scale-down in an HPA?

In the autoscaling/v2 API, set the optional tolerance field under spec.behavior.scaleUp and scaleDown. If omitted, the default cluster-wide tolerance applies.

Is this feature enabled by default in Kubernetes v1.33?

No. It is currently in alpha and requires the HPAConfigurableTolerance feature gate to be enabled in both the API server and controller manager.

What happens if the feature is disabled or the cluster is downgraded?

Any configured tolerance values will be ignored, and HPAs will revert to using the global default tolerance. This behavior is backward-compatible and safe for rollback.