new post: Fixing stability issues with 1st generation Ryzen chips on Debian

parent d0bef0b1
Title: Fixing stability issues with 1<sup>st</sup> generation Ryzen chips on Debian
Tags: debian, amd
I was an early adopter when Ryzen - AMD's latest CPU line - came out. The prices
were very good, the chips had a lot of cores and they ran pretty fast. At the
time I thought the Ryzen 1600 CPU with its 6 core and 12 threads all running
at 3.4 GHz with a TDP of 65W (with support for ECC RAM) made the perfect
homeserver chip.
Fast forward two years: I've finally got around the stability issues I was
having that hung my server at random intervals. Sometimes, everything was fine
for months, but I also experienced random system freezes twice in a week.
Since I'm using full disk encryption on all the drives in my server, a whole
system freeze meant I had to go back home and reboot the server manually.
I first thought I was affected by a ["rare" bug][RMA] that touched the first
batch of Ryzen CPUs so I RMAed mine and had to handle nearly a month of
downtime. Sadly, it didn't solve my problem. Two weeks ago I decided I was tired
of this whole reboot cycle and tried to see if upgrading to a more recent kernel
(4.9 -> 4.18) did the trick. The problem only got worse and my server ended up
freezing each and every night. As always, no errors showed up anywhere in the
logs.
With the 4.18 kernel, the timing of the system freezes got me thinking and I
found this [bug report][bug] in Launchpad. Turns out the problem is caused by
bad low-power handling. When the CPU idles for a long time, it enventually
freezes and hangs the whole system. This is corroborated by this
[AMD report][AMD] that states:
```
1109 MWAIT Instruction May Hang a Thread
Description: Under a highly specific and detailed set of internal timing
conditions, the MWAIT instruction may cause a thread to
hang in SMT (Simultaneous Multithreading) Mode.
Potential Effect on System: The system may hang or reset.
Suggested Workaround: System software may contain the workaround for
this erratum.
Fix Planned: No fix planned
```
To fix the problem I've:
* disabled SMT in the BIOS
* disabled "Cool 'n Quiet" in the BIOS
* disabled "Global C-states" in the BIOS
* set "Power Supply Idle Control" to "Common current idle" in the BIOS
* set `idle=nomwait` in the kernel
* set `processor.max_cstate=5` in the kernel
Disabling C-States means that the CPU cores always run at 3.4 GHz and the chip
consumes 50W at idle instead of 30W, but that's a price I'm willing to pay to
have a stable server.
Note that from what I've read online, the Ryzen 2 chips aren't affected by this.
Don't take my word for it though. I guess I've learnt the hard way that trying
to build a stable system out of a bleeding edge platform is a bad idea.
[RMA]: https://www.extremetech.com/computing/254750-amd-replaces-ryzen-cpus-users-affected-rare-linux-bug
[bug]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085
[AMD]: https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment