How to prevent GPU from overheating and auto turning off

I was wondering how Linux could handle a Gamer Computer, so I have built one, but as we know GeForce does not like Linux so much as AMD, that is why I choose the last.

I built up a computer with AMD Ryzen 7 1800X CPU and Radeon RX 560D GPU, as the Vega is too expensive for me to purchase, and the benchmarking said 560 is the best cost-benefit ratio currently.

After some research I discovered the suffix D means it has slightly less clock speed in order to save some power consumption in comparison with RX560 without D.

After countless crashes during random gaming I finally found out the problem is the GPU overheating, it’s fan speed tends to follow the CPU fan speed, but of course the CPU is much less required than the GPU in some games.

I partially solved the problem by customizing the fan speed based on GPU temperature instead of CPU, it is now growing gradually, and achieves the maximum speed on 50 Celsius degrees, but the problem is: on some games it holds on maximum speed all the time, and eventually still crashes.

Describing the crash: the screen blinks and then became black, GPU fan stops, keyboard led blinks and then turn off, mouse the same, other CPU fan keeps, sometimes the system keeps frozen forever, sometimes the system auto reboot.

As a reboot is required I could not find any tip on system logs, initially I though it was a kernel panic, but even using kdump and duplicating the kernel the system stills crashes the way I could not recover it.

I do not know if Windows would have the same problem, but I strongly believe does not, I have never seen someone with the same problem on Windows, so my question is: there is some way to tell the kernel to make GPU take it easy when it is about to overheat, maybe just auto reducing the GPU clock speed?

Answer

I found the solution, there are some files on /sys/class/drm/card0/device the file pp_dpm_mclk indicates GPU memory clock, and the file pp_dpm_sclk indicates GPU core clock, mine:

$ egrep -H . /sys/class/drm/card0/device/pp_dpm_*
/sys/class/drm/card0/device/pp_dpm_mclk:0: 300Mhz 
/sys/class/drm/card0/device/pp_dpm_mclk:1: 1500Mhz *
/sys/class/drm/card0/device/pp_dpm_pcie:0: 2.5GB, x8 *
/sys/class/drm/card0/device/pp_dpm_pcie:1: 8.0GB, x16 
/sys/class/drm/card0/device/pp_dpm_sclk:0: 214Mhz *
/sys/class/drm/card0/device/pp_dpm_sclk:1: 481Mhz 
/sys/class/drm/card0/device/pp_dpm_sclk:2: 760Mhz 
/sys/class/drm/card0/device/pp_dpm_sclk:3: 1000Mhz 
/sys/class/drm/card0/device/pp_dpm_sclk:4: 1050Mhz 
/sys/class/drm/card0/device/pp_dpm_sclk:5: 1100Mhz 
/sys/class/drm/card0/device/pp_dpm_sclk:6: 1150Mhz 
/sys/class/drm/card0/device/pp_dpm_sclk:7: 1196Mhz 

And the file power_dpm_force_performance_level indicates the profile, which can be low, auto or manual, the default is auto, when low it runs always on lowest clock, which is not exactly what I want, so I set it to manual and made a script that keeps changing the clock according the GPU temperature, voilà, it worked!

To change the clock on manual profile just write a number to file pp_dpm_sclk that represents the line, starting with 0, in my case till 7.

If you are interested on my script here is it.

Attribution
Source : Link , Question Author : Tiago Pimenta , Answer Author : Tiago Pimenta

Leave a Comment