R5300 G5 T4 GPU UCE issue

2025-12-18 14:47:54 Published

0 Followed
0Collected ，2257Browsed

Problem Description

There are two R5300 G5 servers. Server A experienced an abnormal reboot on the afternoon of the 21st, with out-of-band (OOB) reporting numerous bus uncorrectable errors pointing to the GPU. In the same cluster, server B also had a large number of bus uncorrectable errors pointing to the GPU on the afternoon of the 21st.

Process Analysis

1. syslog print:

Server A:

1. The restart time in sds was February 21 14:12:54:

Informational System ACPI Power State ACPI_State Assertion event From BMC 2025-02-21 14:12:54 CUSTOMER LPC Reset occurred

Before the restart, there was a large amount of slot12 UCE flooding, which was resolved after the restart.

1023 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:38 ENGINEER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4

1025 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:39 CUSTOMER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4

1026 Warning Critical Interrupt PCIE12_GPU Assertion event From BIOS 2025-02-21 14:10:40 ENGINEER Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4

2. In the syslog, the restart time was Feb 21 14:10:02

Feb 21 14:10:02 sna-12f-b-03-h5300-03-4u12 kernel: Linux version 3.10.0-957.27.8.2.g295089a.el7.x86_64 (root@172-20-53-23) (gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) ) #1 SMP Mon Nov 14 04:25:17 EST 2022

Before the restart, there was a large amount of the following prints:

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Dazed and confused, but trying to continue

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: sched: RT throttling activated

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Dazed and confused, but trying to continue

There was also a restart on the 14th, with sds logs and syslog prints being largely the same as on the 21st.

Server B:

1. There is a large amount of slot10 UCE in the sds logs, unresolved.

Warning Critical Interrupt PCIE10_GPU Assertion event From BIOS 2025-02-21 14:09:59 CUSTOMER Bus Uncorrectable Error---Slot 10---PCIE Name: Tesla T4

2. Syslog: No restart record on the 21st, but there was a processor softlock, along with the following prints:

Feb 21 14:04:17 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:04:23 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:04:23 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue

Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: sched: RT throttling activated

Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue

Feb 21 14:04:34 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:04:34 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:04:50 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue

2. Further analyze the status value when reporting Bus Uncorrectable Error. The status values are identical each time UCE is reported on both devices, as shown in the following example:

The status is 0x00100000, indicating that bit20 is set to 1. The status for out-of-band (OOB) UCE alarm on device 161 is exactly the same as on device 162, also 0x00100000. As illustrated below, this represents an unsupported request response (UR) from the T4 GPU. This error triggers a non-maskable interrupt (NMI) on the system processor via the PCIe RootPort, resulting in an unrecoverable system error.

Solution

The out-of-band (OOB) alarm was triggered because the T4 GPU received an unsupported solicited response, resulting in an out-of-band (OOB) UCE and server restart. Subsequent troubleshooting and adjustments will be conducted at the system and service levels.

Please rate this case:

0 Comments

No Comments

R5300 G5 T4 GPU UCE issue

Problem Description

Process Analysis

Solution

Add Comments: