Wireless Catalyst 9800 WLC KPIs, Part 2
Part 2 of the 3-part Wireless Catalyst 9800 WLC KPIs
In the previous blog Wireless Catalyst 9800 WLC KPIs, Part 1 we shared how to check WLC and connections to other devices.
In this blog, we will concentrate on Key Performance Indicators for Access Points (AP) and Radio Frequency(RF). I will share approaches and commands to measure the health of the APs and RF.
KPIs different buckets or areas:
- WLC checks,
- Connection with other devices
- AP checks
- RF checks
- Client checks
- Packet Drops.
AP Checks
Now let’s focus on APs health. First of all, we can check the total number of APs connected to our WLC, and confirm that it matches the expected number. Use command: “show ap sum | i Number of APs”. If the AP count is not correct, we would need to identify the missing APs, the reason for the disconnection, and/or why they have not been able to rejoin the controller. As a starting point, it is useful to have a complete list of APs for a working scenario with ethernet mac and IP addresses (“show ap summary”).
Gladius1#show ap sum Load for five secs: 0%/0%; one minute: 0%; five minutes: 0% Time source is NTP, 19:18:03.363 CEST Wed May 25 2022 Number of APs: 8 AP Name Slots AP Model Ethernet MAC Radio MAC Location Country IP Address State ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- AP3800-r2sw1-te1-0-8 2 AIR-AP3802I-E-K9 0042.68a0.fc4a 0062.ecf3.8310 default location DE 192.168.127.108 Registered 9130i-r2sw1-te2016 3 C9130AXI-E 04eb.409e.14c0 04eb.409f.0c60 default location DE 192.168.25.133 Registered 9130i-r2sw1-te2015 3 C9130AXI-E 04eb.409e.1724 04eb.409f.1f80 default location DE 192.168.25.122 Registered 9130i-r3-sw2-g1-0-10 3 C9130AXI-B 04eb.409e.1d28 04eb.409f.4fa0 default location US 192.168.127.113 Registered AP1562-r3-sw-3-gi1-0-3 2 AIR-AP1562E-E-K9 0062.ec80.8c8c 2c33.1192.3e40 default location DE 192.168.127.106 Registered SS-I-1 2 C9115AXI-B 7069.5a74.7a50 7069.5a78.7780 default location US 192.168.127.97 Registered ap3800i-r2-sw1-te1-0-5 2 AIR-AP3802I-E-K9 0042.68c5.bdf0 cc16.7e5f.f000 default location CH 192.168.127.109 Registered 9120i-r4-sw2-te1-0-39 2 C9120AXI-E d4e8.8019.60e8 d4e8.801a.3340 default location DE 192.168.127.114 Registered
Check AP count, and have a list of ethernet mac and IP addresses of all the APs.
We can compare the output of working vs non-working scenarios to quickly identify and locate the missing devices.
Even if we see the expected count of APs connected to our WLC, we need to check if those APs are stable. WLC has a command that easily allows us to check uptime (reloads) or to validate Capwap tunnel reliability. Use command: “show ap uptime | ex ____([0-9])+ day” “exclude” keyword will help us to focus on APs reloaded or disconnected within 1 day.
Gladius2#sh ap uptime Number of APs: 8 AP Name Ethernet MAC Radio MAC AP Up Time Association Up Time --------------------------------------------------------------------------------------------------------------------------------------------------- AP3800-r2sw1-te1-0-8 0042.68a0.fc4a 0062.ecf3.8310 26 days 0 hour 57 minutes 41 seconds 15 days 1 hour 50 minutes 4 seconds 9130i-r2sw1-te2015 04eb.409e.1724 04eb.409f.1f80 9 days 3 hours 26 minutes 48 seconds 9 days 3 hours 24 minutes 24 seconds 9130i-r2sw1-te2016 04eb.409e.14c0 04eb.409f.0c60 9 days 1 hour 39 minutes 29 seconds 9 days 1 hour 26 minutes 47 seconds 9120i-r4-sw2-te1-0-39 d4e8.8019.60e8 d4e8.801a.3340 8 days 1 hour 36 minutes 57 seconds 8 days 1 hour 33 minutes 49 seconds SS-I-1 7069.5a74.7a50 7069.5a78.7780 26 days 0 hour 54 minutes 57 seconds 22 minutes 15 seconds ap3800i-r2-sw1-te1-0-5 0042.68c5.bdf0 cc16.7e5f.f000 26 days 0 hour 46 minutes 12 seconds 22 minutes 13 seconds 9130i-r3-sw2-g1-0-10 04eb.409e.1d28 04eb.409f.4fa0 22 minutes 21 seconds 19 minutes 39 seconds
Check uptime and Association uptime. In this case we see SS-I-1 and ap3800i-r2-sw1-te1-0-5 facing disconnection, while 9130i-r3-sw2-g1-0-10 facing reload.
In the above command, we can find if any AP unexpected reloads occurred. We can also find if a reload occurred for several APs at the same time. If those reloaded APs were in the same location or connected to the same switch, that could point to a network or power issue in that location/switch. Similarly, for AP disconnections, we can compare “Association Uptime” to identify patterns between them, determine if there are any unexpected tunnel teardowns, and when those occurred. Just keep in mind, that APs will flip the CAPWAP tunnel, in some specific configuration changes, for example when a new tag is applied.
If “AP Uptime” is lower than expected, and not due to general reload, then we can review if there are any AP crashes reported in the WLC and examine bootflash content for any related report file. Use command: “show ap crash” or “dir bootflash: | i crash”
Gladius1#show ap crash-file File Location: BOOTFLASH AP Name Crash File Radio Slot 0 Radio Slot 1 ------------------------------------------------------------------------------------------------------------------------------- ap3800i-r2-sw1-te0-1 ap3800i-r2-sw1-te0-1_0062ecaade80.crash Gladius1#dir bootflash: | i crash 54 -rw- 50476 May 9 2022 13:07:34 +02:00 ap3800i-r2-sw1-te0-1_0062ecaade80.crash 66 -rw- 120276 Jan 26 2022 11:46:55 +01:00 AP9120-2-r3-sw2-Gi1-0-39_d4e88019f140.crash 28 -rw- 93952 Nov 2 2021 13:02:21 +01:00 SS-E-2_00eeab18c160.crash 12 -rw- 42975 Oct 27 2021 15:01:44 +02:00 9115i-r4-sw2-te1-0-38_f80f6f154ce0.crash 42 -rw- 42235 May 15 2021 14:24:59 +02:00 9115i-r3-sw2-te1-0-38_f80f6f154960.crash 41 -rw- 26063 Mar 30 2021 13:06:45 +02:00 9115i-r3-sw2-te1-0-38_f80f6f154c80.crash
Check for AP crashes occurring, multiple crashes seen in the same AP, and periodic crashes.
It is advisable to review bootflash content from time to time to locate new crashes. If there are any new crashes, download them, and share those with TAC for root cause analysis. Finally, remove old ones to keep the file system clean.
In case we observe AP disconnections, we can establish what is the most common termination event, and what was the AP state at that moment. This will allow us to have a global picture. Use command: “show wireless stats ap session termination”.
Gladius1#show wireless stats ap session termination Event Previous State Occurance Count ------------------------------------------------------------------------------------ DTLS session closed JOINED 6 Heartbeat timer expiry JOINED 2 Reset by API IMAGE_DOWNLOAD 1 Image download status IMAGE_DOWNLOAD 6 Reset by API RUN 3 DTLS session closed RUN 17 Heartbeat timer expiry RUN 6
Check events with the highest count. If AP was in RUN state disconnections could be due to consistent packet drops.
After that, we can then drill down on using the AP history command to have more detailed information per concrete AP. Filtering AP history by disconnections will show if there were several APs disconnecting at the same time and the disconnect reason for each of the APs. By analyzing command output, we can also realize if there are multiple disconnections occurring for the same AP and the periodicity of the disconnections. Use command: “show wireless stats ap history | i Disjoined”
Gladius1#show wireless stats ap history | i Disjoined ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/24/22 12:27:39 NA DTLS close alert from peer ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/24/22 12:24:26 NA DTLS close alert from peer ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/24/22 12:17:47 NA DTLS close alert from peer ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/24/22 11:41:17 NA DTLS close alert from peer ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/24/22 11:38:04 NA DTLS close alert from peer ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/24/22 10:18:04 NA DTLS close alert from peer ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/09/22 13:02:28 NA Heart beat timer expiry ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/09/22 10:49:34 NA Heart beat timer expiry ap3800i-r2-sw1-te0-1 0042.68a0.ee78 Disjoined 05/05/22 19:53:31 NA Failure decoding wtp descriptor ap3800i-r3-sw2-Gi1-0-37 0042.68a1.03d2 Disjoined 05/12/22 12:02:38 NA DTLS close alert from peer ap3800i-r3-sw2-Gi1-0-37 0042.68a1.03d2 Disjoined 05/12/22 11:57:43 NA Wtp reset config cmd sent ap3800i-r3-sw2-Gi1-0-37 0042.68a1.03d2 Disjoined 05/10/22 10:54:49 NA DTLS close alert from peer
Check timestamps and disjoin reason. Find multiple disconnections per AP, disconnections occurring at the same time or periodically.
Another important check is to review APs tag assignment. Tags will determine the SSIDs, AP mode, RF profiles, and policies configured in each AP. We can verify that APs have the expected tags and the right method used for tag assignment. Comparing tags attached to APs in the same location, or working vs non-working APs, could help to spot incorrect tag allocation. Use command: “sh ap tag summary”
Moreover, we also need to identify if there is any AP showing misconfigured tags. Misconfigured tags could be due to using a nonexistent/removed parameter (profile policy, RF-profile, …), or an incorrect config combination. Those APs marked as misconfigured will not broadcast any BSSID. Use command: “sh ap tag summary | i Yes”
Gladius1#sh ap tag summary Number of APs: 4 AP Name AP Mac Site Tag Name Policy Tag Name RF Tag Name Misconfigured Tag Source ---------------------------------------------------------------------------------------------------------- HG-2 0cd0.f894.0f40 default-site-tag default-policy-tag default-rf-tag No Default AP1832I 80e8.6fd8.6330 site2 flex-vlan4 rf-hig No Location ap1700i f44e.0578.a560 site2 default-policy-tag default-rf-tag Yes Static AP9120 d4e8.8019.6100 default-site-tag LOCAL_VLAN169 default-rf-tag No Filter
Check for misconfigured tags, correct tag source, and same tag assignment for APs in the same branch
Even if the APs are up and have the right configuration, we can do some further checks to identify potential misbehaving APs with no clients connected. We need to be careful since a fine-working AP could show no clients at that moment. Based on our knowledge about the network and the number of clients seen in other APs in the same area, we can isolate APs that could be experiencing some issues. For those APs we can confirm that radios are up, and the AP is broadcasting the correct BSSIDs, then monitor those APs for a period of time. If AP is still showing no clients after the monitoring period, we can test to reset the AP radio or the CAPWAP connection with WLC to recover. Use command: “show ap sum sort descending client-count | i __0__”
Gladius1#show ap sum sort descending client-count | i __0__ ---------------------------------------------------------------------------------------------------------- AP-name AP-mac Client count Data Usage Through-Put Admin-State ---------------------------------------------------------------------------------------------------------- 9120i d4e8.801a.3340 0 1407172 515 Enabled AP1562 2c33.1192.3e40 0 4189901 69 Disabled AP3800 0062.ecf3.8310 0 48548613 473 Disabled
Check for APs with zero clients and in enabled state.
An example of those AP KPIs helping to identify an issue was a customer-facing AP random AP disconnections. When reviewing the APs that were frequently disconnected by analyzing the “show AP uptime” we could get a list of impacted APs. Thanks to the customer AP name convention combined with the output of “show ap cdp neighbors” we were able to identify that all the APs were in the same location and connected to one concrete switch. Disconnect reason for those APs was pointing to connection closed by AP. When checking AP logs we could see multiple retransmissions of CAPWAP packets. Then tested to ping from AP to WLC and we could see packet loss. The same packet loss was seen when pinging from AP to his gateway. Ping tests clearly showed a connectivity issue in switches between APs and their gateway.
RF Checks
We can monitor per band AP channel assignment, channel width, transmission power, and state of the radio. With that information, we can review if channels are evenly distributed to avoid co-channel interference and find if many APs are using max TXpower which could point to coverage issues. We can also identify if there are APs with radio not operative and marked as down. We need to do this verification for 24ghz, 5ghz, and 6ghz for the new 9136 APs. Use command: “show ap dot11 24ghz/5ghz/6ghz summary”, if you have 11ax APs supporting BSS-Coloring then you can add “extended” keyword to check BSS Color assigned to each AP.
Gladius1#sh ap dot11 5ghz summary AP Name Mac Address Slot Admin State Oper State Width Txpwr Channel Mode --------------------------------------------------------------------------------------------------------------------------------------------------------- 9130E 0c75.bdb5.71e0 1 Enabled Up 20 *2/8 (21 dBm) (100)* Local 9130E 0c75.bdb5.71e0 2 Disabled Down 20 *1/8 (15 dBm) (36)* Local AP9120A d4e8.8019.f140 1 Enabled Up 20 *2/8 (19 dBm) (40)* Local AP9120B d4e8.801a.3400 1 Enabled Up 20 7/8 (4 dBm) (40) Local
Check for Txpwr 1, uneven channel distribution, radios down, and unexpected static assignment.
Next statistics will help us to check the number of channel changes faced per radio. For 5ghz we can investigate if AP is changing channels due to the radar being detected in the same channel (DFS event). If we are seeing many channel changes and numbers are increasing, that could impact client connectivity. Channel change will reset the AP radio and disconnect all clients. In case channel change occurs in 5ghz to a DFS channel, AP radio will need to monitor the channel for 60sec before beaconing as clients cannot connect to that AP during that time. Excessive channel changes could point to RF or RRM issues and needs to be investigated. Use command “show ap auto-rf dot11 24ghz/5ghz | i Channel changes due to radar|AP Name|Channel Change Count”
Gladius1#sh ap auto-rf dot11 5ghz | i Channel changes due to radar|AP Name|Channel Change Count AP Name : 9130E-r3-sw2-g1014 Channel changes due to radar : 0 Channel Change Count : 2 AP Name : 9130E-r3-sw2-g1014 Channel changes due to radar : 0 AP Name : AP9120-2-r3-sw2-Gi1-0-39 Channel changes due to radar : 3 Channel Change Count : 10 AP Name : AP9120-r3-sw3-Gi1-0-47 Channel changes due to radar : 0 Channel Change Count : 62
Check for a high amount of channel changes and changes due to DFS events.
One more check that we can do is the load or channel utilization per radio. Catalyst 9800 WLC will show us the channel utilization and client count so we can identify APs with high load. If we see APs with few clients but high load, we can focus on those APs and check if that could be due to traffic transmitted or received by the AP or due to cochannel interference. Information about the load will also help us to identify the most loaded APs and areas where more density may be needed. Use command: “show ap dot11 24ghz/5ghz/6ghz load-info”
Gladius1#sh ap dot11 5ghz load-info AP Name Radio MAC Slot Channel Utilization (%) Clients ---------------------------------------------------------------------------------------- 9130E 0c75.bdb5.71e0 1 2 0 9130E 0c75.bdb5.71e0 2 0 0 AP9120A d4e8.8019.f140 1 11 5 AP9120B d4e8.801a.3400 1 11 0
Check for high channel utilization or channel utilization with no client (co-channel interference). We can see co-channel interference because AP9120A and 9120B are both in the same channel 40.
An example of an issue identified by checking those RF KPIs was a customer having client performance issues. When checked, the radio load in 5ghz was quite high even when there were few or no clients connected. We then dug further and the load was not due to transmit or receive data but due to co-channel interference. When analyzing the number of channels assigned to those APs with the high load, we found that only 4 channels were assigned to those APs due to a config issue in the rf-profile. After adding more channels to the RF-profile channel, utilization decreased and no further performance issues were reported.
For more detailed RF analysis you can use Wireless Config Analyzer Express (WCAE) tool: https://developer.cisco.com/docs/wireless-troubleshooting-tools/#wireless-config-analyzer-express
WCAE will show you the distribution of channels, TXpower, RF metrics per AP, and more details.
With provided methodology and commands you can proactively identify if there are any issues in our WLC APs and RF. In the next blog, we will share 9800 WLC KPIs to check client connectivity and WLC drops/punted packets.
List of commands to use for KPIs and automation scripts
In the document below, there is also a link to a script that will automatically collect all the commands. It will collect commands based on platform and release, save them in a file, and export the file. The script is using the “Guest-shell” feature that for now is only available in physical WLCs 9800-40/80 and 9800-L.
The document also provides an example of an EEM script to collect logs periodically. In conclusion, EEM along with the “Guest-shell” script will help to collect 9800 WLC KPIs and have a baseline for your Catalyst 9800 WLC.
For the list of commands used to monitor those KPIs
Share: