Customer uses wireless AC WX3540H+AP WA6330, with local forwarding enabled and only PSK encryption method turned on. The deployment is in a high-density warehouse environment. The PDA is from SUNMI brand, with the network card mode of the barcode scanner being 802.11N, and it has an Android operating system.
The customer has reported a problem with a particular warehouse where PDA roams and experiences packet loss of 3-5 seconds, and the Wi-Fi signal disappears for 3-5 seconds. This problem has been there since the initial construction of the warehouse, and it has not been resolved yet. However, the issue is not observed with regular terminals such as mobile phones and laptops.
Other warehouses that have the same network configuration do not experience this issue, and suspicion arises regarding the software version and configuration of the AC. However, after multiple checks, no differences have been found.
Therefore, this problem seems to be unusual, and there might be subtle clues that we have overlooked.
The approach to this problem is to start with packet capturing on the air interface. The reason for the peculiarity of PDA behavior is due to its unique mechanism. Without understanding the PDA's working mechanism (most PDA manufacturers do not provide sufficient support for network issues), it can only be learned that the behavior of the PDA is different from that of other terminals through comparison. So logically speaking, it can be speculated that the PDA has its own set of processing logic.
Packet capturing on the air interface is initiated as shown below:
During the analysis process, we looked at the time dimension changes, and every step appeared to be compressed and normal, and we did not identify any unusual behavior. However, after the analysis, the terminal went offline directly.
The offline behavior seemed inexplicable. After the terminal went offline, it took almost 3 seconds to begin the new authentication and association process. This process appeared to have caused the 3-5 seconds delay that the wireless terminal experienced on-site, and during this time, the Wi-Fi icon disappeared as well, indicating that the Wi-Fi was disassociated for more than 3 seconds. In comparison, there was no abnormal offline behavior in a normal environment.
Let us continue to think: What causes the terminal to believe that it needs to be Deauthenticated? Since roaming has already been successful, why does it need to be disassociated?
A: Are wireless parameters unsuitable for the terminal?
B: Has the terminal enabled a certain type of detection that did not pass?
Logically speaking, these are the only two possible explanations. We randomly began checking all parameters of the AC, all AP counting information, and all indicators, but they all showed that it was the terminal which disconnected. Since option A did not reveal any significant abnormalities, option B remained the most likely culprit. To better analyze the details of option B on the wireless packet level, an unencrypted environment may be required.
After constructing an unencrypted environment, we continue our packet capture and analysis, resulting in the following findings:
It was found that the terminal will perform a unicast ARP to query the gateway's address before experiencing abnormal disconnection. However, it seems that it did not receive a response, and then triggered the abnormal Deauth after not receiving a response for 1 second. We found this pattern in multiple tests as well. After the abnormal Deauth, the terminal started to reconnect, and this process was similar to when the terminal first came online. During the DHCP process, the terminal also sent ARP, but in a broadcast manner, and there were responses to the broadcast ARP.
We also captured the roaming behavior of other devices such as smartphones, which also sent unicast ARP queries to the gateway after roaming without receiving a response, but did not experience abnormal disconnection.
Based on this, we can conclude that the PDA has a special judgment mechanism. After roaming, it will send a unicast ARP query to the gateway. If it does not receive a response within 1 second, the PDA will consider the network to be unavailable, and will disconnect from the Wi-Fi and try to reconnect after 3 seconds, triggering the DHCP process as if it was the first time connecting. However, broadcast ARP can be correctly responded to in this network.
We then repeated similar operations in a normal warehouse, where unicast ARP requests were responded to normally and roaming worked fine.
Following this line of thought, we gradually investigated by capturing interface mirror packets of access and aggregation switches, and finally determined the issue was due to the gateway device. Each warehouse had a separate S7506E gateway switch, but with different versions.
During subsequent investigations, we learned that the S7506E switch had a known issue of not responding to unicast ARP queries in version R7595, which could be resolved by upgrading the version or applying a hot patch.
Temporary solutions: relocate the gateway to the aggregation switch or temporarily switching to centralized forwarding through the AC with the AC acting as the PDA gateway.
The final solution was to apply a hotfix to the S7506E device in this warehouse to resolve the problem.