Keeping a Raspberry Pi Cluster Cool
Why temperature thresholds weren't enough, and how topology-aware automation improved thermal management in my cluster.
I originally built a Home Assistant automation that would shut down Raspberry Pi nodes when they got too hot.
At first, it seemed like a reasonable solution. If a node crosses a temperature threshold, turn it off and let it cool down.
The problem was that the automation had no understanding of why the node was hot in the first place.
That realization eventually led me to redesign thermal management across my Raspberry Pi Kubernetes cluster and turn what started as a simple shutdown rule into a topology-aware operations controller.
The Cluster¶
My home lab runs a small Kubernetes cluster built on Raspberry Pis. It hosts a mix of workloads including Jellyfin, qBittorrent, Sonarr, Radarr, Longhorn, monitoring, and several automation services.
One of those services is qbittorrent-smart-queues.
Originally, Smart Queues existed solely to manage download prioritization and media automation. Over time, it became clear that it was also in the best position to make workload decisions when the cluster was under thermal pressure.
qBittorrent is one of the few workloads capable of generating sustained disk and network activity. If temperatures start rising across the cluster, download activity is often one of the easiest places to reduce load without affecting user-facing services.
The challenge was deciding when reducing qBittorrent activity actually made sense.
The First Design Was Wrong¶
My original thinking was simple:
If a Raspberry Pi gets hot, throttle qBittorrent.
The logic sounds reasonable until storage enters the picture.
In Kubernetes, the workload creating I/O and the node experiencing that I/O are not always the same thing.
At one point in my cluster:
- qBittorrent was running on rpi3
- the download PVC was backed by Longhorn
- the Longhorn share-manager was running on rpi1
- the active replica was on rpi2
When rpi1 became the hottest node, reducing qBittorrent activity helped because download traffic was still flowing through that node.
The problem was that my controller did not understand any of that.
It simply reacted to the hottest node.
In this particular case, it accidentally made the correct decision.
A controller that accidentally does the right thing is still fundamentally broken.
Why Shutdown Became a Problem¶
The original Home Assistant automation used shutdown as the primary mitigation strategy.
While effective from a thermal perspective, it created operational problems:
- workloads disappeared until rescheduled
- Longhorn placement changed unexpectedly
- load shifted to already warm nodes
- recovery often took longer than the thermal event itself
- controller state had to survive node loss
I also experimented with cordon and drain operations.
Those approaches were less destructive than shutdown, but still introduced unnecessary disruption for a small cluster where storage and workloads are tightly coupled.
Eventually I stopped thinking about thermal management as a node problem and started thinking about it as a workload problem.
The goal changed from:
Shut down the hot node.
to:
Reduce heat with the least amount of service disruption.
Making Thermal Decisions Topology-Aware¶
The most important improvement was teaching Smart Queues to understand the workload path before taking action.
Instead of asking:
Which node is hot?
The controller now asks:
Is the hot node actually involved in qBittorrent's execution or storage path?
To answer that question it discovers:
- where the qBittorrent pod is running
- which PVC is being used for downloads
- which Longhorn volume backs that PVC
- where the Longhorn share-manager is running
- where active replicas are located
Only then does it decide whether qBittorrent should be throttled or paused.
This prevents unrelated thermal events from triggering unnecessary workload changes.
A hot node can still trigger cluster-level mitigation, but qBittorrent is only affected when it is genuinely contributing to the thermal pressure.
Additional Refinements¶
As the controller evolved, a few additional problems became obvious.
Avoid touching idle workloads¶
The controller was still updating qBittorrent limits even when no downloads were active.
That produced noise without changing cluster behavior.
Now the controller checks for active downloads first.
If qBittorrent is idle:
- no limits are rewritten
- no unnecessary API calls occur
- thermal protection remains active
- new downloads remain blocked until temperatures recover
Use the Right NVMe Temperature¶
Initially I used the hottest available NVMe sensor.
That turned out to be too aggressive because internal controller temperatures can be significantly higher than the temperature used for normal device health reporting.
For policy decisions I switched to the NVMe Composite temperature while continuing to monitor CPU temperatures independently.
The result is a more stable and predictable signal for automation.

Results¶
After moving from simple threshold-based actions to topology-aware thermal management, average CPU temperatures dropped across all nodes.

| Node | Avg CPU Before | Avg CPU After |
|---|---|---|
| rpi1 | 71.9°C | 68.4°C |
| rpi2 | 74.4°C | 67.8°C |
| rpi3 | 69.0°C | 65.8°C |
The temperature reduction itself is useful, but it is not the most important outcome.
The real improvement is decision quality.
The cluster now:
- avoids unnecessary shutdowns
- avoids unnecessary qBittorrent changes
- understands workload placement
- understands storage topology
- applies the least disruptive mitigation first
Lessons Learned¶
The biggest lesson from this project is that thermal automation is not really about temperature.
It is about context.
A hot node may be experiencing load because of:
- a workload scheduled there
- a storage frontend
- a Longhorn replica
- background jobs
- network activity
- physical cooling limitations
- unrelated system behavior
Each cause requires a different response.
Treating all thermal events as identical leads to bad automation.
Final Thoughts¶
The most dangerous automation is automation that lacks context.
My original solution reacted to temperatures.
The current solution reacts to the reason behind those temperatures.
For a small Raspberry Pi Kubernetes cluster, that difference matters. Understanding the workload path before taking action produces better outcomes than simply reacting to thresholds.
Thermal management is ultimately an operations problem, not a temperature problem.






