- Buy Microsoft Visio Professional or Microsoft Project Professional 2024 for just $80
- Get Microsoft Office Pro and Windows 11 Pro for 87% off with this bundle
- Buy or gift a Babbel subscription for 78% off to learn a new language - new low price
- Join BJ's Wholesale Club for just $20 right now to save on holiday shopping
- This $28 'magic arm' makes taking pictures so much easier (and it's only $20 for Black Friday)
Wireless 9800 WLC KPI Blog – Part 3
Part 3 of the 3-part Wireless Catalyst 9800 WLC KPIs
In previous blogs, Wireless Catalyst 9800 WLC KPIs, Part 1 and Wireless Catalyst 9800 WLC KPIs, Part 2, we shared how to check WLC and connections to other devices as well as how to check AP and RF health status.
In this blog, we will focus on Key Performance Indicators for client analysis, WLC packet drops, and packets punted to WLC CPU. I will share methodical steps and outputs that we can collect from WLC to measure the health of clients’ connectivity and WLC forwarding performance.
KPIs different buckets or areas:
- WLC checks
- Connection with other devices
- AP checks
- RF checks
- Client checks
- Packet Drops
Client Checks
After we have verified AP and RF health then we can focus on client connectivity. Using “show wireless summary” we can see the total number of clients connected. In addition, we can find out if there are any excluded, disabled, and foreign/anchored clients. We can keep monitoring this command periodically. Check if the number of clients is within the expected values for our deployment. We can also identify if there are any drastic changes for any of the values. The command also shows the number of APs, roles, radios, and their status.
Gladius1#sh wireless summary Max APs supported : 2000 Max clients supported : 32000 Access Point Summary Total Up Down ------------------------------------------ 802.11 2.4GHz 1 1 0 802.11 5GHz 4 3 1 802.11 dual-band 2 2 0 802.11 rx-dual-band 0 0 0 Client Serving(2.4GHz) 3 3 0 Client Serving(5GHz) 4 3 1 Monitor 0 0 0 Sensor 0 0 0 Client Summary Total Clients : 6 Excluded : 0 Disabled : 0 Foreign : 0 Anchor : 0 Local : 6
Check for total number of clients, excluded clients, and radio down APs.
In case we see excluded clients, we need to dig further to identify the reason for that. Determine if excluded clients have any misconfiguration or if exclusion could be due to any other reason. Reasons for excluding clients could be due to incorrect password, ip address matching other clients’ IP address, multiple association failures, etc. We can see the list of client exclusion policies and status using the command “sh wireless wps summary”.
We can break down the number of connected clients in the different conditions of the client state machine. This will help us to narrow down if there are too many clients stuck in transient states like Authenticating, IP learns, Mobility, or Webauth Pending. Use the command: “show wireless stats client detail | i Authenticating :|Mobility :|IP Learn :|Webauth Pending :|Run :|Delete-in-Progress :”
Gladius1#show wireless stats client detail | i Authenticating :|Mobility :|IP Learn :|Webauth Pending :|Run :|Delete-in-Progress : Authenticating : 0 Mobility : 0 IP Learn : 1 Webauth Pending : 0 Run : 5 Delete-in-Progress : 0
Check for clients in transient states. In this case, we see the client in IP learn state.
We will need to do a further investigation if the number of clients in transient states is not decreasing. The same will apply if most of the clients remain in the same transient state for a long period of time.
One example could be if we see a high number of clients stuck in “IP learn”. Then we should review the DHCP server status and connectivity between WLC and DHCP server. For static IP address allowed scenarios, we can review ARP forwarding.
Another example could be if the number of clients stuck in “Webauth” is high. There are several reasons that can cause this. One reason could be web page redirects not being received or not accessible by clients. Another option could be authentication failures when doing web login for guest SSIDs.
The last example could be if we see lots of clients stuck in “Authenticating”. If clients connected to dot1x SSIDs have authentication issues then we should review the Radius server. We need to determine if the issue occurs with a concrete Radius server or if the issue occurs in different servers at the same time. In the below sections, I will describe how to verify Radius server status.
We can also review client delete reasons and identify any unexpected reason with counters increasing. “Idle timeout” or “Session timeout” would be expected reasons for clients to disconnect. However, “DOT11 denied data rates” or “MIC validation failed” would be unexpected and may require some further analysis. Use the command: “show wireless stats client delete reasons | e :_0”
Gladius1#show wireless stats client delete reasons | e :_0 Total client delete reasons --------------------------- Controller deletes --------------------------- Due to mobility failure : 1 DOT11 denied data rates : 5781192 L2-AUTH connection timeout : 2 IP-LEARN connection timeout : 968 Mobility peer delete : 134 ---------------------------- Informational Delete Reason ----------------------------- AP down/disjoin : 690 Session timeout : 661 ----------------------------- Client initiate delete ----------------------------- AP Deletes ----------------------------- AP initiated delete for DHCP timeout : 1 AP initiated delete for reassociation timeout : 266
Check for unexpected delete reasons with high count and increasing. In this case, denied data rates
In one of the largest worldwide wireless events, we monitored delete reasons excluding ones showing zero hits. We could spot a delete reason that was consistently increasing over time. Using always-on-tracing we could find that clients deleted due to that reason were all connecting to a concrete SSID. When reviewing SSID configuration we could isolate a configuration mistake causing the disconnections. After addressing the configuration, no further client deletes for unexpected reason were seen. We could proactively spot an issue, find the root cause and fix it. Above all, without having to wait for end clients to complain to start the troubleshooting process.
WLC has also a list of predefined possible failures with counters. We can check counters to identify potential issues and be proactive in issue detection. Using the command: “show wireless stats trace-on-failure | ex :_0”
Gladius1#show wireless stats trace-on-failure | ex :_0 ---------------------------------------------------------- Wireless Trace On Failure Statistics ---------------------------------------------------------- 006. Export client MM....................................: 1 018. Capwap configuration status failure.................: 46136 020. Client association failure..........................: 5 021. Client MAB authentication failure...................: 5781677 023. Client stage timeout................................: 1642 025. Client mobility clean up............................: 1 027. DTLS handshake failure..............................: 2 030. DTLS no configuration packet drop...................: 5 032. DTLS invalid hello packet drop......................: 168 034. SANET AUTHC failure.................................: 6
Check for failures with high count and increasing. In this case, MAB authentication failures.
If we are using dot1x and Radius servers, we will need to monitor the status of the Radius servers. IOS-XE is using dead-time and dead criteria to determine status of Radius server. Those parameters allow the device to identify a Radius server that is not responding to requests, and perform a switchover to a secondary Radius server. The server will be declared as dead once the dead criteria is met. Dead criteria specifies the number of tries that should fail, and the time with no response from the server. Both criteria should be met to declare the server as dead. The server will remain in dead status until dead-time expire.
We can check if there is any dead server at this moment and the number of times a server has been declared as dead. This will help us to diagnose issues with the concrete Radius server due to loss of connectivity or misbehaviors from Radius or WLC. Use the command: “show aaa servers | i Platform Dead: total|RADIUS: id”
Gladius1#show aaa servers | i Platform Dead: total|RADIUS: id RADIUS: id 1, priority 1, host 192.168.0.98, auth-port 1645, acct-port 1646, hostname ISE SMD Platform Dead: total time 301s, count 2 Platform Dead: total time 179s, count 10UP RADIUS: id 2, priority 2, host 192.168.0.99, auth-port 1812, acct-port 1813, hostname ISE3 SMD Platform Dead: total time 0s, count 0 Platform Dead: total time 0s, count 0
Check for platform dead time and count to identify Radius servers that had issues.
Radius status is displayed per WNCD. It is possible that the same Radius server is marked as dead for some WNCDs and alive for others. Each AP belongs to a WNCD. There is a command to check APs assigned per WNCD “show wireless load-balance ap affinity WNCD <0-7>”. If clients connected to APs in one concrete WNCD send Radius requests, and those requests don’t have a response then Radius status for that WNCD will be DEAD. At the same time, clients in other WNCD could not be sending any Radius requests or getting a response.
For Radius marked as DEAD, we need to check if the Radius server is reachable and replying to authentication and accounting requests. Radius statistics will help us to identify if we are missing any responses for authentication or for accounting, the average time to reply, the number of access rejects and accepts, and latency distribution. Use the command: “show radius statistics”
Gladius1#show radius statistics Auth. Acct. Both Maximum inQ length: NA NA 1 Maximum waitQ length: NA NA 14 Maximum doneQ length: NA NA 1 Total responses seen: 279 0 279 Packets with responses: 279 0 279 Packets without responses: 0 396 396 Access Rejects : 2 Access Accepts : 20 Average response delay(ms): 10 0 10 Maximum response delay(ms): 173 0 173 Number of Radius timeouts: 0 4542 4542 Duplicate ID detects: 0 0 0 Buffer Allocation Failures: 0 0 0 Maximum Buffer Size (bytes): 764 780 780 Malformed Responses : 0 0 0 Bad Authenticators : 0 0 0 Unknown Responses : 0 0 0 Source Port Range: (2 ports only) 1645 - 1646 Last used Source Port/Identifier: 1645/0 1646/3 Elapsed time since counters last cleared: 3w3d20h41m Radius Latency Distribution: <= 2ms : 181 0 3-5ms : 32 0 5-10ms : 13 0 10-20ms: 14 0 20-50ms: 17 0 50-100m: 20 0 100ms : 2 0
Check for requests without response, timeouts, high latency
In one customer we were troubleshooting dot1x client’s connectivity issues and found the reason for failures was the Radius server marked as dead. When reviewing the outputs, we could see that Radius was replying to authentications but was no longer replying to accounting packets. A workaround to minimize impact was to disable the accounting list to avoid WLC sending accounting packets. While Radius administrators were troubleshooting accounting issues in the server.
Packet drops and punted to CPU Checks
Now we can check if there are any scalability issues due to the oversubscription of any of the WLC components. I would start by looking at the volume of traffic received and transmitted by physical interfaces. Then reviewing the number of broadcast/multicast and input or output drops. If we have a baseline we can compare the volume of traffic with the baseline and try to find out any discrepancies. Use command: “show int po1 | i line protocol|put rate|drops|broadcast”. Replace Po1 with your setup physical or logical interface.
Gladius1#show int po1 | i line protocol|put rate|drops|broadcast Port-channel1 is up, line protocol is up Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 0 5 minute input rate 39000 bits/sec, 42 packets/sec 5 minute output rate 14000 bits/sec, 12 packets/sec Received 9389675 broadcasts (34521510 multicasts) Output 45735 broadcasts (1075205 multicasts) 0 unknown protocol drops
Check for the volume of traffic input/output, drops, and broadcasts tx/rx
We can review packets dropped by WLC and the reasons for those drops. When monitoring drops it is important to check which are the reasons for the high volume of packet drops. Subsequently, we can find how fast those drop counters are increasing. We need to collect the same output several times with time reference. Enabling “terminal exec prompt timestamps” or collecting “show clock” will help us to have time references. Those time referenced outputs will be key to isolate impacting drops. Use the command: “show platform hardware chassis active qfp statistics drop”
Gladius1#show platform hardware chassis active qfp statistics drop Last clearing of QFP drops statistics : never ------------------------------------------------------------------------- Global Drop Stats Packets Octets ------------------------------------------------------------------------- CGACLDrop 31 7812 Disabled 635 105934 InvL2Hdr 701 206223 IpFormatErr 68 4488 Ipv4NoAdj 67749 6910538 Ipv4NoRoute 6 376 Ipv6NoRoute 1096 61376 Ipv6mcNoRoute 77683 9477326 SWPortMacConflict 50316 5874782 SwitchL2mLookupMiss 17568 6681680 TailDrop 54199 29501684 UnconfiguredIpv4Fia 3 242 UnconfiguredIpv6Fia 1564372 186850863 WlsCapwapError 1018 233293 WlsCapwapReassFragConsume 1064 1231968 WlsClientError 3116 112631
Check for drop reasons with a high number of packets, and fragmentation/reassembly drops.
One more check that we should do is to analyze the number of packets sent to the control plane (punted) of the WLC for processing. We can monitor the number of packets punted for each reason and check for abnormal volume. We can correlate an increase of punted packets with high CPU utilization events. Use the command: “show platform hardware chassis active qfp feature wireless punt statistics”
Gladius1#show platform hardware chassis active qfp feature wireless punt statistics CPP Wireless Punt stats: App Tag Packet Count ------- ------------ CAPWAP_PKT_TYPE_DOT11_PROBE_REQ 986190 CAPWAP_PKT_TYPE_DOT11_MGMT 10031 CAPWAP_PKT_TYPE_DOT11_IAPP 2975298 CAPWAP_PKT_TYPE_DOT11_DOT1X 24901 CAPWAP_PKT_TYPE_CAPWAP_KEEPALIVE 228099 CAPWAP_PKT_TYPE_CAPWAP_CNTRL 1628480 CAPWAP_PKT_TYPE_CAPWAP_DATA_PAT 33 CAPWAP_PKT_TYPE_MOBILITY_CNTRL 58091 SISF_PKT_TYPE_ARP 218545290 SISF_PKT_TYPE_DHCP 15455 SISF_PKT_TYPE_DHCP6 7772 SISF_PKT_TYPE_IPV6_ND 199108 SISF_PKT_TYPE_DATA_GLEAN 7 SISF_PKT_TYPE_DATA_GLEAN_V6 100
Check for a high number of punted packets and increasing overtime.
We could also identify if we are seeing any buffer failures and determine which is the size for those buffers that are reaching the maximum value. Use the command: “show buffers | i buffers|failures”
Gladius1#show buffers | i buffers|failures Small buffers, 104 bytes (total 1200, permanent 1200): 0 failures (0 no memory) Middle buffers, 600 bytes (total 900, permanent 900): 35 failures (35 no memory) Big buffers, 1536 bytes (total 900, permanent 900, peak 901 @ 2w6d): 0 failures (0 no memory) VeryBig buffers, 4520 bytes (total 100, permanent 100, peak 101 @ 2w6d): 0 failures (0 no memory) Large buffers, 5024 bytes (total 100, permanent 100, peak 101 @ 2w6d): 0 failures (0 no memory) VeryLarge buffers, 8304 bytes (total 100, permanent 100): 0 failures (0 no memory) Huge buffers, 18024 bytes (total 20, permanent 20, peak 21 @ 2w6d): 0 failures (0 no memory)
Check for buffer failures and identify buffer size.
The last check could be data plane utilization. We can find if the WLC is having data plane performance issues due to traffic volume, or some concrete features enabled. Use command shared in WLC checks: “show platform hardware chassis active qfp datapath utilization | i Load”
These KPIs were helpful to identify a customer issue. The customer observed a periodical high increase in the number of ARPs packets punted to the CPU. By monitoring the counter for ARPs punted to the CPU, and collecting packet capture in the control plane we could identify that those ARPs were sent from some concrete mac addresses that were doing malicious ARP scanning.
With this final bucket, we finish the Key Performance Indicators (KPIs) for Catalyst 9800 WLC.
List of commands to use for KPIs and automation scripts
In the document below, there is also a link to a script that will automatically collect all the commands. It will collect commands based on platform and release, save them in a file, and export the file. The script is using the “Guest-shell” feature that for now is only available in physical WLCs 9800-40/80 and 9800-L.
The document also provides an example of an EEM script to collect logs periodically. In conclusion, EEM along with the “Guest-shell” script will help to collect 9800 WLC KPIs and have a baseline for your Catalyst 9800 WLC.
For the list of commands used to monitor those KPIs
Share: