There are two R5300 G5 servers. Server A experienced an abnormal reboot on the afternoon of the 21st, with out-of-band (OOB) reporting numerous bus uncorrectable errors pointing to the GPU. In the same cluster, server B also had a large number of bus uncorrectable errors pointing to the GPU on the afternoon of the 21st.
1. syslog print:
Server A:
1. The restart time in sds was February 21 14:12:54:
Informational System ACPI Power State ACPI_State Assertion event From BMC 2025-02-21 14:12:54 CUSTOMER LPC Reset occurred
Before the restart, there was a large amount of slot12 UCE flooding, which was resolved after the restart.
1023 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:38 ENGINEER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4
1025 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:39 CUSTOMER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4
1026 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:40 ENGINEER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4
2. In the syslog, the restart time was Feb 21 14:10:02
Feb 21 14:10:02 sna-12f-b-03-h5300-03-4u12 kernel: Linux version 3.10.0-957.27.8.2.g295089a.el7.x86_64 (root@172-20-53-23) (gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) ) #1 SMP Mon Nov 14 04:25:17 EST 2022
Before the restart, there was a large amount of the following prints:
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Dazed and confused, but trying to continue
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: sched: RT throttling activated
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Dazed and confused, but trying to continue
There was also a restart on the 14th, with sds logs and syslog prints being largely the same as on the 21st.
Server B:
1. There is a large amount of slot10 UCE in the sds logs, unresolved.
Warning Critical Interrupt PCIE10_GPU Assertion event From BIOS 2025-02-21 14:09:59 CUSTOMER Bus Uncorrectable Error---Slot 10---PCIE Name: Tesla T4
2. Syslog: No restart record on the 21st, but there was a processor softlock, along with the following prints:
Feb 21 14:04:17 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:04:23 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:04:23 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue
Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: sched: RT throttling activated
Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue
Feb 21 14:04:34 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.
Feb 21 14:04:34 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?
Feb 21 14:04:50 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue
2. Further analyze the status value when reporting Bus Uncorrectable Error. The status values are identical each time UCE is reported on both devices, as shown in the following example:
The status is 0x00100000, indicating that bit20 is set to 1. The status for out-of-band (OOB) UCE alarm on device 161 is exactly the same as on device 162, also 0x00100000. As illustrated below, this represents an unsupported request response (UR) from the T4 GPU. This error triggers a non-maskable interrupt (NMI) on the system processor via the PCIe RootPort, resulting in an unrecoverable system error.
The out-of-band (OOB) alarm was triggered because the T4 GPU received an unsupported solicited response, resulting in an out-of-band (OOB) UCE and server restart. Subsequent troubleshooting and adjustments will be conducted at the system and service levels.