AMD Epyc 7002 (“Rome”), a series of server processors built on the “Zen 2” microarchitecture, has been found to have an error causing the processor to freeze after 1044 days of continuous work. The issue was identified by AMD, who recommends disconnecting the support of the energy-saving CC6 regime or rebooting the server to bypass the problem before 1044 days of continuous operation, which is roughly 2 years and 10 months.
According to AMD’s information published in a PDF document, the freeze is caused by an attempt to exit the processor nucleus from the energy-saving mode CC6 when the timer reaches the value of 1044 days after the last discharge of the CPU state. The time of manifestation may vary depending on the frequency of RefClk.
Although AMD did not give a more detailed explanation of the cause of the failure, a counter TSC register (Time Stamp Counter) is believed to be the culprit. The counter counts the number of working cycles after reset at a frequency of 2800 MHz reaches 0x3800000000000. This happens after 1042 days and 12 hours.
AMD has no plans to publish a correction for the error. The problem has remained unnoticed for a long time due to perennial uptime not being typical of servers. However, the development in Linux distributions of the nucleus renewal methods in live mode, as well as long-term escort cycles, can lead to long-term use of servers without reboot. Ubuntu, Rhel, and Suse are supported for 10 years, and Debian for 5 years.
It is important to keep the information in mind for those running servers with the AMD Epyc 7002 processor to prevent any issues that may arise due to the error.