Fine-Grained Control with Configurable HPA Tolerance
Kubernetes v1.33 introduces a long-awaited enhancement to Horizontal Pod Autoscaler (HPA): configurable tolerance values. Previously, all HPAs across a cluster used a globally set tolerance of 10% to avoid flapping and limit unnecessary scaling. With this new feature, you can fine-tune scaling sensitivity per workload, giving you more control over responsiveness and resource efficiency.
The Problem with One-Size-Fits-All Tolerance¶
The HPA works by comparing the current usage of a metric (like CPU) against a desired target. It calculates the number of replicas required by applying the usage ratio to the current replica count. However, to avoid constant scaling due to minor fluctuations, Kubernetes uses a tolerance value—a buffer zone where no scaling occurs if the metric ratio is close to 1.
Until now, this tolerance was fixed at the cluster level, usually 10%. While that works for smaller workloads, it becomes problematic when you're running large-scale applications:
- A 10% tolerance on a 1000-replica deployment could block scaling even when hundreds of pods are under strain.
- It delays scale-up during sudden load increases, reducing responsiveness.
- Different workloads often require different sensitivities for scale-up vs. scale-down.
The lack of granularity has been a recurring complaint in GitHub issues and user feedback.
What’s New in Kubernetes v1.33¶
Starting with Kubernetes 1.33, users can define tolerance values per HPA, separately for scale-up and scale-down. This change is backward-compatible and does not affect any HPAs unless the new field is explicitly set.
✅ Key Changes:¶
- A new optional field
tolerance
has been added underspec.behavior.scaleUp
andscaleDown
in the HPA v2 API. - The default cluster-wide tolerance still applies if no per-HPA tolerance is set.
- You can now:
- Make your workloads scale faster by lowering the tolerance
- Avoid unnecessary scale-downs by increasing it
Example: Reacting Faster to Spikes¶
Let’s say you have a workload running 50 replicas with CPU target utilization set at 75%. The actual usage spikes to 90%. Normally, the HPA would scale up to:
desiredReplicas = ceil(50 × 90 / 75) = 60
However, if the usage ratio is within the default 10% tolerance (i.e., 0.9 to 1.1), no action is taken.
If you want more responsiveness, you can configure a 5% tolerance on scale-up. That way, even smaller increases in CPU load will trigger a scaling decision:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app
spec:
behavior:
scaleUp:
tolerance: 0.05
You can also use different values for scale-down:
behavior:
scaleUp:
tolerance: 0.05
scaleDown:
tolerance: 0.15
How It Works Internally¶
Under the hood, this enhancement doesn’t alter the scaling algorithm. It only overrides the default tolerance used when comparing the usage ratio against 1.0:
if abs(1.0 - usageRatio) <= tolerance {
// skip scaling
}
When the feature is enabled, the HPA controller checks the tolerance
field in the respective scaling rule and uses it instead of the global value.
How to Enable It¶
This feature is currently in alpha and requires the HPAConfigurableTolerance
feature gate to be enabled in both the kube-apiserver and kube-controller-manager:
--feature-gates=HPAConfigurableTolerance=true
The feature is only available in the autoscaling/v2
API version and is ignored in earlier versions.
Rollback & Compatibility¶
- If the feature is disabled, any configured
tolerance
field will be ignored. - Existing HPAs will continue to function as they did before using the global default.
- Downgrading is safe—workloads will revert to using the cluster-wide tolerance.
Why This Matters¶
Scaling behavior is often workload-specific. ML inference servers, web frontends, and CI runners all have different performance and initialization profiles. With this update:
- You gain per-resource control over sensitivity to metric fluctuations.
- You avoid over-scaling large workloads unnecessarily.
- You can react faster to traffic spikes by setting a lower tolerance on scale-up.
This is a small change in terms of API surface, but it offers much-needed flexibility in real-world autoscaling behavior.
Final Thoughts¶
With v1.33, Kubernetes continues to evolve toward more intelligent and adaptive autoscaling. Configurable tolerance empowers operators to tune workload behavior without cluster-wide compromises.
Start experimenting with this feature if you're managing large-scale deployments, latency-sensitive applications, or simply want better control over your autoscaling logic.