R5300 G5 T4 GPU UCE issue

2025-12-18 14:47:54 Published
  • 0 Followed
  • 0Collected ,1Browsed

Problem Description

There are two R5300 G5 servers. Server A experienced an abnormal reboot on the afternoon of the 21st, with out-of-band (OOB) reporting numerous bus uncorrectable errors pointing to the GPU. In the same cluster, server B also had a large number of bus uncorrectable errors pointing to the GPU on the afternoon of the 21st.

Process Analysis

1. syslog print:

Server A:

1. The restart time in sds was February 21 14:12:54:

Informational  System ACPI Power State        ACPI_State      Assertion event       From BMC         2025-02-21 14:12:54      CUSTOMER     LPC Reset occurred

 

Before the restart, there was a large amount of slot12 UCE flooding, which was resolved after the restart.

1023         Warning Critical Interrupt      PCIE12_GPU   Assertion event       From BIOS       2025-02-21 14:10:38        ENGINEER       Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4     

1025         Warning Critical Interrupt      PCIE12_GPU   Assertion event       From BIOS       2025-02-21 14:10:39        CUSTOMER     Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4     

1026         Warning Critical Interrupt      PCIE12_GPU   Assertion event       From BIOS       2025-02-21 14:10:40        ENGINEER       Bus Uncorrectable Error---Slot 12---PCIE Name: Tesla T4

 

2. In the syslog, the restart time was Feb 21 14:10:02

Feb 21 14:10:02 sna-12f-b-03-h5300-03-4u12 kernel: Linux version 3.10.0-957.27.8.2.g295089a.el7.x86_64 (root@172-20-53-23) (gcc version 8.3.1 20190311 (Red Hat 8.3.1-3) (GCC) ) #1 SMP Mon Nov 14 04:25:17 EST 2022

 

Before the restart, there was a large amount of the following prints:

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Dazed and confused, but trying to continue

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: sched: RT throttling activated

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:06:21 sna-12f-b-03-h5300-03-4u12 kernel: Dazed and confused, but trying to continue

 

There was also a restart on the 14th, with sds logs and syslog prints being largely the same as on the 21st.

Server B:

1. There is a large amount of slot10 UCE in the sds logs, unresolved.

Warning      Critical Interrupt PCIE10_GPU Assertion event  From BIOS   2025-02-21 14:09:59      CUSTOMER Bus Uncorrectable Error---Slot 10---PCIE Name: Tesla T4

 

2. Syslog: No restart record on the 21st, but there was a processor softlock, along with the following prints:

Feb 21 14:04:17 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:04:23 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:04:23 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue

Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: sched: RT throttling activated

Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:04:28 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue

Feb 21 14:04:34 sna-12f-b-03-h5300-03-7u4 kernel: Uhhuh. NMI received for unknown reason 2c on CPU 0.

Feb 21 14:04:34 sna-12f-b-03-h5300-03-7u4 kernel: Do you have a strange power saving mode enabled?

Feb 21 14:04:50 sna-12f-b-03-h5300-03-7u4 kernel: Dazed and confused, but trying to continue

2. Further analyze the status value when reporting Bus Uncorrectable Error. The status values are identical each time UCE is reported on both devices, as shown in the following example:

The status is 0x00100000, indicating that bit20 is set to 1. The status for out-of-band (OOB) UCE alarm on device 161 is exactly the same as on device 162, also 0x00100000. As illustrated below, this represents an unsupported request response (UR) from the T4 GPU. This error triggers a non-maskable interrupt (NMI) on the system processor via the PCIe RootPort, resulting in an unrecoverable system error.

Solution

The out-of-band (OOB) alarm was triggered because the T4 GPU received an unsupported solicited response, resulting in an out-of-band (OOB) UCE and server restart. Subsequent troubleshooting and adjustments will be conducted at the system and service levels.

Please rate this case:   
0 Comments

No Comments

Add Comments: