Wireless Catalyst 9800 WLC KPIs, Part 2


Part 2 of the 3-part Wireless Catalyst 9800 WLC KPIs

In the previous blog Wireless Catalyst 9800 WLC KPIs, Part 1 we shared how to check WLC and connections to other devices.

In this blog, we will concentrate on Key Performance Indicators for Access Points (AP) and Radio Frequency(RF). I will share approaches and commands to measure the health of the APs and RF.

KPIs different buckets or areas:

  • WLC checks,
  • Connection with other devices
  • AP checks
  • RF checks
  • Client checks
  • Packet Drops.

AP Checks

Now let’s focus on APs health. First of all, we can check the total number of APs connected to our WLC,  and confirm that it matches the expected number. Use command: “show ap sum | i Number of APs”. If the AP count is not correct, we would need to identify the missing APs, the reason for the disconnection, and/or why they have not been able to rejoin the controller. As a starting point, it is useful to have a complete list of APs for a working scenario with ethernet mac and IP addresses (“show ap summary”).

Gladius1#show ap sum

Load for five secs: 0%/0%; one minute: 0%; five minutes: 0%

Time source is NTP, 19:18:03.363 CEST Wed May 25 2022

Number of APs: 8

AP Name               Slots    AP Model              Ethernet MAC    Radio MAC       Location                          Country     IP Address                                 State

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

AP3800-r2sw1-te1-0-8    2      AIR-AP3802I-E-K9      0042.68a0.fc4a  0062.ecf3.8310  default location                  DE          192.168.127.108                            Registered

9130i-r2sw1-te2016      3      C9130AXI-E            04eb.409e.14c0  04eb.409f.0c60  default location                  DE          192.168.25.133                             Registered

9130i-r2sw1-te2015      3      C9130AXI-E            04eb.409e.1724  04eb.409f.1f80  default location                  DE          192.168.25.122                             Registered

9130i-r3-sw2-g1-0-10    3      C9130AXI-B            04eb.409e.1d28  04eb.409f.4fa0  default location                  US          192.168.127.113                            Registered

AP1562-r3-sw-3-gi1-0-3  2      AIR-AP1562E-E-K9      0062.ec80.8c8c  2c33.1192.3e40  default location                  DE          192.168.127.106                            Registered

SS-I-1                  2      C9115AXI-B            7069.5a74.7a50  7069.5a78.7780  default location                  US          192.168.127.97                             Registered

ap3800i-r2-sw1-te1-0-5  2      AIR-AP3802I-E-K9      0042.68c5.bdf0  cc16.7e5f.f000  default location                  CH          192.168.127.109                            Registered

9120i-r4-sw2-te1-0-39   2      C9120AXI-E            d4e8.8019.60e8  d4e8.801a.3340  default location                  DE          192.168.127.114                            Registered

Check AP count, and have a list of ethernet mac and IP addresses of all the APs.

We can compare the output of working vs non-working scenarios to quickly identify and locate the missing devices.

Even if we see the expected count of APs connected to our WLC, we need to check if those APs are stable. WLC has a command that easily allows us to check uptime (reloads) or to validate Capwap tunnel reliability.  Use command: “show ap uptime | ex ____([0-9])+ day” “exclude” keyword will help us to focus on APs reloaded or disconnected within 1 day.

Gladius2#sh ap uptime

Number of APs: 8

AP Name                    Ethernet MAC    Radio MAC       AP Up Time                                          Association Up Time

---------------------------------------------------------------------------------------------------------------------------------------------------

AP3800-r2sw1-te1-0-8       0042.68a0.fc4a  0062.ecf3.8310  26 days 0 hour 57 minutes 41 seconds                15 days 1 hour 50 minutes 4 seconds

9130i-r2sw1-te2015         04eb.409e.1724  04eb.409f.1f80  9 days 3 hours 26 minutes 48 seconds                9 days 3 hours 24 minutes 24 seconds

9130i-r2sw1-te2016         04eb.409e.14c0  04eb.409f.0c60  9 days 1 hour 39 minutes 29 seconds                 9 days 1 hour 26 minutes 47 seconds

9120i-r4-sw2-te1-0-39      d4e8.8019.60e8  d4e8.801a.3340  8 days 1 hour 36 minutes 57 seconds                 8 days 1 hour 33 minutes 49 seconds

SS-I-1                     7069.5a74.7a50  7069.5a78.7780  26 days 0 hour 54 minutes 57 seconds                22 minutes 15 seconds

ap3800i-r2-sw1-te1-0-5     0042.68c5.bdf0  cc16.7e5f.f000  26 days 0 hour 46 minutes 12 seconds                22 minutes 13 seconds

9130i-r3-sw2-g1-0-10       04eb.409e.1d28  04eb.409f.4fa0  22 minutes 21 seconds                               19 minutes 39 seconds

Check uptime and Association uptime. In this case we see SS-I-1 and ap3800i-r2-sw1-te1-0-5 facing disconnection, while 9130i-r3-sw2-g1-0-10 facing reload.

In the above command, we can find if any AP unexpected reloads occurred. We can also find if a reload occurred for several APs at the same time. If those reloaded APs were in the same location or connected to the same switch, that could point to a network or power issue in that location/switch. Similarly, for AP disconnections, we can compare “Association Uptime” to identify patterns between them, determine if there are any unexpected tunnel teardowns, and when those occurred.  Just keep in mind, that APs will flip the CAPWAP tunnel, in some specific configuration changes, for example when a new tag is applied.

If “AP Uptime” is lower than expected, and not due to general reload, then we can review if there are any AP crashes reported in the WLC and examine bootflash content for any related report file. Use command: “show ap crash” or “dir bootflash: | i crash”

Gladius1#show ap crash-file

File Location: BOOTFLASH

AP Name                         Crash File                Radio Slot 0                       Radio Slot 1

-------------------------------------------------------------------------------------------------------------------------------

ap3800i-r2-sw1-te0-1             ap3800i-r2-sw1-te0-1_0062ecaade80.crash


Gladius1#dir bootflash: | i crash

54      -rw-            50476   May 9 2022 13:07:34 +02:00  ap3800i-r2-sw1-te0-1_0062ecaade80.crash

66      -rw-           120276  Jan 26 2022 11:46:55 +01:00  AP9120-2-r3-sw2-Gi1-0-39_d4e88019f140.crash

28      -rw-            93952   Nov 2 2021 13:02:21 +01:00  SS-E-2_00eeab18c160.crash

12      -rw-            42975  Oct 27 2021 15:01:44 +02:00  9115i-r4-sw2-te1-0-38_f80f6f154ce0.crash

42      -rw-            42235  May 15 2021 14:24:59 +02:00  9115i-r3-sw2-te1-0-38_f80f6f154960.crash

41      -rw-            26063  Mar 30 2021 13:06:45 +02:00  9115i-r3-sw2-te1-0-38_f80f6f154c80.crash

Check for AP crashes occurring, multiple crashes seen in the same AP, and periodic crashes.

It is advisable to review bootflash content from time to time to locate new crashes. If there are any new crashes, download them, and share those with TAC for root cause analysis. Finally, remove old ones to keep the file system clean.

In case we observe AP disconnections, we can establish what is the most common termination event, and what was the AP state at that moment. This will allow us to have a global picture. Use command: “show wireless stats ap session termination”.

Gladius1#show wireless stats ap session termination

Event                           Previous State                  Occurance Count

------------------------------------------------------------------------------------

DTLS session closed             JOINED                          6

Heartbeat timer expiry          JOINED                          2

Reset by API                    IMAGE_DOWNLOAD                  1

Image download status           IMAGE_DOWNLOAD                  6

Reset by API                    RUN                             3

DTLS session closed             RUN                             17

Heartbeat timer expiry          RUN                             6

Check events with the highest count. If AP was in RUN state disconnections could be due to consistent packet drops.

After that, we can then drill down on using the AP history command to have more detailed information per concrete AP. Filtering AP history by disconnections will show if there were several APs disconnecting at the same time and the disconnect reason for each of the APs. By analyzing command output, we can also realize if there are multiple disconnections occurring for the same AP and the periodicity of the disconnections. Use command: “show wireless stats ap history | i Disjoined”

Gladius1#show wireless stats ap history | i Disjoined

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/24/22 12:27:39  NA DTLS close alert from peer

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/24/22 12:24:26  NA DTLS close alert from peer

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/24/22 12:17:47  NA DTLS close alert from peer

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/24/22 11:41:17  NA DTLS close alert from peer

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/24/22 11:38:04  NA DTLS close alert from peer

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/24/22 10:18:04  NA DTLS close alert from peer

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/09/22 13:02:28  NA Heart beat timer expiry

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/09/22 10:49:34  NA Heart beat timer expiry

ap3800i-r2-sw1-te0-1     0042.68a0.ee78  Disjoined  05/05/22 19:53:31  NA Failure decoding wtp descriptor

ap3800i-r3-sw2-Gi1-0-37  0042.68a1.03d2  Disjoined  05/12/22 12:02:38  NA DTLS close alert from peer

ap3800i-r3-sw2-Gi1-0-37  0042.68a1.03d2  Disjoined  05/12/22 11:57:43  NA Wtp reset config cmd sent

ap3800i-r3-sw2-Gi1-0-37  0042.68a1.03d2  Disjoined  05/10/22 10:54:49  NA DTLS close alert from peer

Check timestamps and disjoin reason. Find multiple disconnections per AP, disconnections occurring at the same time or periodically.

Another important check is to review APs tag assignment. Tags will determine the SSIDs, AP mode, RF profiles, and policies configured in each AP. We can verify that APs have the expected tags and the right method used for tag assignment. Comparing tags attached to APs in the same location, or working vs non-working APs, could help to spot incorrect tag allocation. Use command: “sh ap tag summary”

Moreover, we also need to identify if there is any AP showing misconfigured tags. Misconfigured tags could be due to using a nonexistent/removed parameter (profile policy, RF-profile, …), or an incorrect config combination.  Those APs marked as misconfigured will not broadcast any BSSID. Use command: “sh ap tag summary | i  Yes”

Gladius1#sh ap tag summary

Number of APs: 4

AP Name   AP Mac      Site Tag Name     Policy Tag Name     RF Tag Name   Misconfigured    Tag Source




----------------------------------------------------------------------------------------------------------

HG-2     0cd0.f894.0f40   default-site-tag   default-policy-tag   default-rf-tag    No      Default

AP1832I  80e8.6fd8.6330   site2              flex-vlan4             rf-hig          No      Location

ap1700i  f44e.0578.a560   site2              default-policy-tag   default-rf-tag    Yes     Static

AP9120   d4e8.8019.6100   default-site-tag   LOCAL_VLAN169        default-rf-tag    No      Filter

Check for misconfigured tags, correct tag source, and same tag assignment for APs in the same branch

Even if the APs are up and have the right configuration, we can do some further checks to identify potential misbehaving APs with no clients connected. We need to be careful since a fine-working AP could show no clients at that moment. Based on our knowledge about the network and the number of clients seen in other APs in the same area, we can isolate APs that could be experiencing some issues. For those APs we can confirm that radios are up, and the AP is broadcasting the correct BSSIDs, then monitor those APs for a period of time. If AP is still showing no clients after the monitoring period, we can test to reset the AP radio or the CAPWAP connection with WLC to recover. Use command: “show ap sum sort descending client-count | i __0__”

Gladius1#show ap sum sort descending client-count | i __0__

----------------------------------------------------------------------------------------------------------

AP-name         AP-mac           Client count          Data Usage          Through-Put     Admin-State

----------------------------------------------------------------------------------------------------------

9120i            d4e8.801a.3340       0                    1407172              515           Enabled

AP1562           2c33.1192.3e40       0                    4189901              69            Disabled

AP3800           0062.ecf3.8310       0                    48548613             473           Disabled

Check for APs with zero clients and in enabled state.

An example of those AP KPIs helping to identify an issue was a customer-facing AP random AP disconnections. When reviewing the APs that were frequently disconnected by analyzing the “show AP uptime” we could get a list of impacted APs. Thanks to the customer AP name convention combined with the output of “show ap cdp neighbors” we were able to identify that all the APs were in the same location and connected to one concrete switch. Disconnect reason for those APs was pointing to connection closed by AP. When checking AP logs we could see multiple retransmissions of CAPWAP packets. Then tested to ping from AP to WLC and we could see packet loss. The same packet loss was seen when pinging from AP to his gateway. Ping tests clearly showed a connectivity issue in switches between APs and their gateway.

RF Checks

We can monitor per band AP channel assignment, channel width, transmission power, and state of the radio. With that information, we can review if channels are evenly distributed to avoid co-channel interference and find if many APs are using max TXpower which could point to coverage issues. We can also identify if there are APs with radio not operative and marked as down. We need to do this verification for 24ghz, 5ghz, and 6ghz for the new 9136 APs. Use command: “show ap dot11 24ghz/5ghz/6ghz summary”, if you have 11ax APs supporting BSS-Coloring then you can add “extended” keyword to check BSS Color assigned to each AP.

Gladius1#sh ap dot11 5ghz summary

AP Name  Mac Address     Slot    Admin State    Oper State    Width    Txpwr           Channel    Mode

---------------------------------------------------------------------------------------------------------------------------------------------------------

9130E    0c75.bdb5.71e0  1       Enabled        Up            20       *2/8 (21 dBm)    (100)*      Local

9130E    0c75.bdb5.71e0  2       Disabled       Down          20       *1/8 (15 dBm)   (36)*        Local

AP9120A  d4e8.8019.f140  1       Enabled        Up            20       *2/8 (19 dBm)    (40)*       Local

AP9120B  d4e8.801a.3400  1       Enabled        Up            20       7/8 (4 dBm)     (40)         Local

Check for Txpwr 1, uneven channel distribution, radios down, and unexpected static assignment.

Next statistics will help us to check the number of channel changes faced per radio. For 5ghz we can investigate if AP is changing channels due to the radar being detected in the same channel (DFS event). If we are seeing many channel changes and numbers are increasing, that could impact client connectivity. Channel change will reset the AP radio and disconnect all clients. In case channel change occurs in 5ghz to a DFS channel, AP radio will need to monitor the channel for 60sec before beaconing as clients cannot connect to that AP during that time. Excessive channel changes could point to RF or RRM issues and needs to be investigated. Use command “show ap auto-rf dot11 24ghz/5ghz | i Channel changes due to radar|AP Name|Channel Change Count”

Gladius1#sh ap auto-rf dot11 5ghz | i Channel changes due to radar|AP Name|Channel Change Count

AP Name                                           : 9130E-r3-sw2-g1014

Channel changes due to radar              : 0

Channel Change Count                          : 2

AP Name                                           : 9130E-r3-sw2-g1014

Channel changes due to radar              : 0

AP Name                                           : AP9120-2-r3-sw2-Gi1-0-39

Channel changes due to radar              : 3

Channel Change Count                          : 10

AP Name                                           : AP9120-r3-sw3-Gi1-0-47

Channel changes due to radar              : 0

Channel Change Count                          : 62

Check for a high amount of channel changes and changes due to DFS events.

One more check that we can do is the load or channel utilization per radio. Catalyst 9800 WLC will show us the channel utilization and client count so we can identify APs with high load. If we see APs with few clients but high load, we can focus on those APs and check if that could be due to traffic transmitted or received by the AP or due to cochannel interference. Information about the load will also help us to identify the most loaded APs and areas where more density may be needed. Use command: “show ap dot11 24ghz/5ghz/6ghz load-info”

Gladius1#sh ap dot11 5ghz load-info

AP Name              Radio MAC       Slot  Channel Utilization (%)  Clients

----------------------------------------------------------------------------------------

9130E                0c75.bdb5.71e0     1                        2        0

9130E                0c75.bdb5.71e0     2                        0        0

AP9120A              d4e8.8019.f140     1                       11        5

AP9120B              d4e8.801a.3400     1                       11        0

Check for high channel utilization or channel utilization with no client (co-channel interference). We can see co-channel interference because AP9120A and 9120B are both in the same channel 40.

An example of an issue identified by checking those RF KPIs was a customer having client performance issues. When checked, the radio load in 5ghz was quite high even when there were few or no clients connected. We then dug further and the load was not due to transmit or receive data but due to co-channel interference. When analyzing the number of channels assigned to those APs with the high load, we found that only 4 channels were assigned to those APs due to a config issue in the rf-profile. After adding more channels to the RF-profile channel, utilization decreased and no further performance issues were reported.

For more detailed RF analysis you can use Wireless Config Analyzer Express (WCAE) tool: https://developer.cisco.com/docs/wireless-troubleshooting-tools/#wireless-config-analyzer-express

WCAE will show you the distribution of channels, TXpower, RF metrics per AP, and more details.

With provided methodology and commands you can proactively identify if there are any issues in our WLC APs and RF. In the next blog, we will share 9800 WLC KPIs to check client connectivity and WLC drops/punted packets.

List of commands to use for KPIs and automation scripts

In the document below, there is also a link to a script that will automatically collect all the commands. It will collect commands based on platform and release, save them in a file, and export the file. The script is using the “Guest-shell” feature that for now is only available in physical WLCs 9800-40/80 and 9800-L.

The document also provides an example of an EEM script to collect logs periodically. In conclusion, EEM along with the “Guest-shell” script will help to collect 9800 WLC KPIs and have a baseline for your Catalyst 9800 WLC.

 

For the list of commands used to monitor those KPIs

Share:



Source link