Devices and versions: S6805 F6631
Network: leaf-3 and leaf-4 act as a group of leaf devices with M-LAG, connecting to the spine and remote leaf through tunnels, and the M-LAG interfaces connect to the servers. The following diagram shows a partial network diagram:
In an on-site M-LAG+EVPN network, when leaf-3 and leaf-4 act as a group of leaf devices with M-LAG, it is found that there is packet loss between the terminals under the remote leaf and the terminals under the local leaf. The on-site engineer confirmed that the packets were sent to leaf-3/4 when packet loss occurred, but leaf-3/4 did not forward them to the downstream terminals.
We found a faulty flow for detailed analysis. The destination MAC of this faulty flow is fa16-3e9c-b63c, which is a terminal connected to the downstream of leaf-3/4 through M-LAG group BAGG2. Normally, this MAC entry should be learned on the M-LAG interfaces of the two M-LAG devices. However, when checking on-site devices, we found that the MAC was learned on Tunnel2 on leaf-4:
What is Tunnel2? When we checked its detailed information on leaf-4, we found that its source and destination IP addresses are 20.3.16.5 and 20.3.16.6, respectively:
The source IP of Tunnel2 is the virtual VTEP address on leaf-4, and the destination IP is the virtual VTEP address configured on Leaf-3. There is also a corresponding Tunnel1 configured on Leaf-3:
In other words, the two M-LAG devices, leaf-3 and leaf-4, have established a tunnel between each other using the virtual VTEP addresses configured on each device. I believe that clever friends have already noticed something wrong. Don't worry, let's continue.
During the troubleshooting process, the on-site engineer checked the MAC address table multiple times and found that the MAC was constantly drifting. When checked again, the MAC was learned correctly on leaf-4, but on leaf-3, it was learned on Tunnel1:
At this point, the clues to the problem are already obvious, and we can basically solve the case. The VTEP devices in an M-LAG group need to be configured with the same VTEP virtual address to establish a tunnel with remote devices. However, on-site devices were configured with different virtual VTEP addresses on the two devices, resulting in a virtual address tunnel between this M-LAG group of devices. Under normal circumstances, the MAC addresses of terminals on the two M-LAG devices should be learned on the AC ports (i.e., the M-LAG interfaces). However, due to the abnormal tunnel between the M-LAG devices, the terminal MAC addresses can be synchronized through EVPN, causing MAC addresses to drift between the AC ports and the tunnel ports, resulting in packet loss on this group of leaf devices for downstream traffic.
After correcting the VTEP virtual addresses to the same address on leaf-3 and leaf-4, the problem was resolved.