Wireless 9800 WLC KPI Blog – Part 3


Part 3 of the 3-part Wireless Catalyst 9800 WLC KPIs

In previous blogs, Wireless Catalyst 9800 WLC KPIs, Part 1 and Wireless Catalyst 9800 WLC KPIs, Part 2, we shared how to check WLC and connections to other devices as well as how to check AP and RF health status.

In this blog, we will focus on Key Performance Indicators for client analysis, WLC packet drops, and packets punted to WLC CPU.  I will share methodical steps and outputs that we can collect from WLC to measure the health of clients’ connectivity and WLC forwarding performance.

KPIs different buckets or areas:

  • WLC checks
  • Connection with other devices
  • AP checks
  • RF checks
  • Client checks
  • Packet Drops

Client Checks

After we have verified AP and RF health then we can focus on client connectivity. Using “show wireless summary” we can see the total number of clients connected. In addition, we can find out if there are any excluded, disabled, and  foreign/anchored clients. We can keep monitoring this command periodically. Check if the number of clients is within the expected values for our deployment. We can also identify if there are any drastic changes for any of the values. The command also shows the number of APs, roles, radios, and their status.

Gladius1#sh wireless summary
Max APs supported          : 2000
Max clients supported      : 32000
Access Point Summary
Total    Up    Down
------------------------------------------
802.11 2.4GHz             1     1       0
802.11 5GHz               4     3       1
802.11 dual-band          2     2       0
802.11 rx-dual-band       0     0       0

Client Serving(2.4GHz)    3     3       0
Client Serving(5GHz)      4     3       1
Monitor                   0     0       0
Sensor                    0     0       0

Client Summary
Total Clients : 6
Excluded      : 0
Disabled      : 0
Foreign       : 0
Anchor        : 0
Local         : 6

Check for total number of clients, excluded clients, and radio down APs.

In case we see excluded clients, we need to dig further to identify the reason for that. Determine if excluded clients have any misconfiguration or if exclusion could be due to any other reason. Reasons for excluding clients could be due to incorrect password, ip address matching other clients’ IP address, multiple association failures, etc. We can see the list of client exclusion policies and status using the command “sh wireless wps summary”.

We can break down the number of connected clients in the different conditions of the client state machine. This will help us to narrow down if there are too many clients stuck in transient states like Authenticating, IP learns, Mobility, or Webauth Pending. Use the command: “show wireless stats client detail | i Authenticating         :|Mobility               :|IP Learn               :|Webauth Pending        :|Run                    :|Delete-in-Progress     :”

Gladius1#show wireless stats client detail | i Authenticating         :|Mobility               :|IP Learn               :|Webauth Pending        :|Run                    :|Delete-in-Progress     :
Authenticating         : 0
Mobility               : 0
IP Learn               : 1
Webauth Pending        : 0
Run                    : 5
Delete-in-Progress     : 0

Check for clients in transient states. In this case, we see the client in IP learn state.

We will need to do a further investigation if the number of clients in transient states is not decreasing. The same will apply if most of the clients remain in the same transient state for a long period of time.

One example could be if we see a high number of clients stuck in “IP learn”. Then we should review the DHCP server status and connectivity between WLC and DHCP server. For static IP address allowed scenarios, we can review ARP forwarding.

Another example could be if the number of clients stuck in “Webauth” is high. There are several reasons that can cause this. One reason could be web page redirects not being received or not accessible by clients. Another option could be authentication failures when doing web login for guest SSIDs.

The last example could be if we see lots of clients stuck in “Authenticating”. If clients connected to dot1x SSIDs have authentication issues then we should review the Radius server. We need to determine if the issue occurs with a concrete Radius server or if the issue occurs in different servers at the same time. In the below sections, I will describe how to verify Radius server status.

We can also review client delete reasons and identify any unexpected reason with counters increasing. “Idle timeout” or “Session timeout” would be expected reasons for clients to disconnect. However, “DOT11 denied data rates” or “MIC validation failed” would be unexpected and may require some further analysis. Use the command: “show wireless stats client delete reasons | e :_0”

Gladius1#show wireless stats client delete reasons | e :_0
Total client delete reasons
---------------------------
Controller deletes
---------------------------
Due to mobility failure                                         : 1
DOT11 denied data rates                                         : 5781192
L2-AUTH connection timeout                                      : 2
IP-LEARN connection timeout                                     : 968
Mobility peer delete                                            : 134
----------------------------
Informational Delete Reason
-----------------------------
AP down/disjoin                                                 : 690
Session timeout                                                 : 661
-----------------------------
Client initiate delete
-----------------------------
AP Deletes
-----------------------------
AP initiated delete for DHCP timeout                            : 1
AP initiated delete for reassociation timeout                   : 266

Check for unexpected delete reasons with high count and increasing. In this case, denied data rates

In one of the largest worldwide wireless events, we monitored delete reasons excluding ones showing zero hits. We could spot a delete reason that was consistently increasing over time. Using always-on-tracing we could find that clients deleted due to that reason were all connecting to a concrete SSID. When reviewing SSID configuration we could isolate a configuration mistake causing the disconnections. After addressing the configuration, no further client deletes for unexpected reason were seen. We could proactively spot an issue, find the root cause and fix it. Above all, without having to wait for end clients to complain to start the troubleshooting process.

WLC has also a list of predefined possible failures with counters.  We can check counters to identify potential issues and be proactive in issue detection. Using the command: “show wireless stats trace-on-failure | ex :_0”

Gladius1#show wireless stats trace-on-failure | ex :_0
----------------------------------------------------------
Wireless Trace On Failure Statistics
----------------------------------------------------------
006. Export client MM....................................: 1
018. Capwap configuration status failure.................: 46136
020. Client association failure..........................: 5
021. Client MAB authentication failure...................: 5781677
023. Client stage timeout................................: 1642
025. Client mobility clean up............................: 1
027. DTLS handshake failure..............................: 2
030. DTLS no configuration packet drop...................: 5
032. DTLS invalid hello packet drop......................: 168
034. SANET AUTHC failure.................................: 6

Check for failures with high count and increasing. In this case, MAB authentication failures.

If we are using dot1x and Radius servers, we will need to monitor the status of the Radius servers. IOS-XE is using dead-time and dead criteria to determine status of Radius server. Those parameters allow the device to identify a Radius server that is not responding to requests, and perform a switchover to a secondary Radius server. The server will be declared as dead once the dead criteria is met. Dead criteria specifies the number of tries that should fail, and the time with no response from the server. Both criteria should be met to declare the server as dead. The server will remain in dead status until dead-time expire.

We can check if there is any dead server at this moment and the number of times a server has been declared as dead. This will help us to diagnose issues with the concrete Radius server due to loss of connectivity or misbehaviors from Radius or WLC. Use the command: “show aaa servers | i Platform Dead: total|RADIUS: id”

Gladius1#show aaa servers | i Platform Dead: total|RADIUS: id
RADIUS: id 1, priority 1, host 192.168.0.98, auth-port 1645, acct-port 1646, hostname ISE
SMD Platform Dead: total time 301s, count 2
Platform Dead: total time 179s, count 10UP
RADIUS: id 2, priority 2, host 192.168.0.99, auth-port 1812, acct-port 1813, hostname ISE3
SMD Platform Dead: total time 0s, count 0
Platform Dead: total time 0s, count 0

Check for platform dead time and count to identify Radius servers that had issues.

Radius status is displayed per WNCD. It is possible that the same Radius server is marked as dead for some WNCDs and alive for others. Each AP belongs to a WNCD. There is a command to check APs assigned per WNCD “show wireless load-balance ap affinity WNCD <0-7>”. If clients connected to APs in one concrete WNCD send Radius requests, and those requests don’t have a response then Radius status for that WNCD will be DEAD. At the same time, clients in other WNCD could not be sending any Radius requests or getting a response.

For Radius marked as DEAD, we need to check if the Radius server is reachable and replying to authentication and accounting requests. Radius statistics will help us to identify if we are missing any responses for authentication or for accounting, the average time to reply, the number of access rejects and accepts, and latency distribution. Use the command: “show radius statistics”

Gladius1#show radius statistics
Auth.      Acct.       Both
Maximum inQ length:         NA         NA          1
Maximum waitQ length:         NA         NA         14
Maximum doneQ length:         NA         NA          1
Total responses seen:        279          0        279
Packets with responses:        279          0        279
Packets without responses:          0        396        396
Access Rejects           :          2
Access Accepts           :         20
Average response delay(ms):         10          0         10
Maximum response delay(ms):        173          0        173
Number of Radius timeouts:          0       4542       4542
Duplicate ID detects:          0          0          0
Buffer Allocation Failures:          0          0          0
Maximum Buffer Size (bytes):        764        780        780
Malformed Responses        :          0          0          0
Bad Authenticators         :          0          0          0
Unknown Responses          :          0          0          0
Source Port Range: (2 ports only)
1645 - 1646
Last used Source Port/Identifier:
1645/0
1646/3
Elapsed time since counters last cleared: 3w3d20h41m
Radius Latency Distribution:
<= 2ms :        181          0
3-5ms  :         32          0
5-10ms :         13          0
10-20ms:         14          0
20-50ms:         17          0
50-100m:         20          0
100ms :          2          0

Check for requests without response, timeouts, high latency

In one customer we were troubleshooting dot1x client’s connectivity issues and found the reason for failures was the Radius server marked as dead. When reviewing the outputs, we could see that Radius was replying to authentications but was no longer replying to accounting packets. A workaround to minimize impact was to disable the accounting list to avoid WLC sending accounting packets. While Radius administrators were troubleshooting accounting issues in the server.

Packet drops and punted to CPU Checks

Now we can check if there are any scalability issues due to the oversubscription of any of the WLC components. I would start by looking at the volume of traffic received and transmitted by physical interfaces. Then reviewing the number of broadcast/multicast and input or output drops. If we have a baseline we can compare the volume of traffic with the baseline and try to find out any discrepancies. Use command: “show int po1 | i line protocol|put rate|drops|broadcast”. Replace Po1 with your setup physical or logical interface.

Gladius1#show int po1 | i line protocol|put rate|drops|broadcast
Port-channel1 is up, line protocol is up
  Input queue: 0/375/0/0 (size/max/drops/flushes); Total output drops: 0
  5 minute input rate 39000 bits/sec, 42 packets/sec
  5 minute output rate 14000 bits/sec, 12 packets/sec
     Received 9389675 broadcasts (34521510 multicasts)
     Output 45735 broadcasts (1075205 multicasts)
     0 unknown protocol drops

Check for the volume of traffic input/output, drops, and broadcasts tx/rx

We can review packets dropped by WLC and the reasons for those drops. When monitoring drops it is important to check which are the reasons for the high volume of packet drops. Subsequently, we can find how fast those drop counters are increasing. We need to collect the same output several times with time reference. Enabling “terminal exec prompt timestamps” or collecting “show clock” will help us to have time references. Those time referenced outputs will be key to isolate impacting drops. Use the command: “show platform hardware chassis active qfp statistics drop”

Gladius1#show platform hardware chassis active qfp statistics drop
Last clearing of QFP drops statistics : never
-------------------------------------------------------------------------
Global Drop Stats                         Packets                  Octets 
-------------------------------------------------------------------------
CGACLDrop                                      31                    7812 
Disabled                                      635                  105934 
InvL2Hdr                                      701                  206223 
IpFormatErr                                    68                    4488 
Ipv4NoAdj                                   67749                 6910538 
Ipv4NoRoute                                     6                     376 
Ipv6NoRoute                                  1096                   61376 
Ipv6mcNoRoute                               77683                 9477326 
SWPortMacConflict                           50316                 5874782 
SwitchL2mLookupMiss                         17568                 6681680 
TailDrop                                    54199                29501684 
UnconfiguredIpv4Fia                             3                     242 
UnconfiguredIpv6Fia                       1564372               186850863 
WlsCapwapError                               1018                  233293 
WlsCapwapReassFragConsume                    1064                 1231968 
WlsClientError                               3116                  112631

Check for drop reasons with a high number of packets, and fragmentation/reassembly drops.

One more check that we should do is to analyze the number of packets sent to the control plane (punted) of the WLC for processing. We can monitor the number of packets punted for each reason and check for abnormal volume.  We can correlate an increase of punted packets with high CPU utilization events. Use the command: “show platform hardware chassis active qfp feature wireless punt statistics”

Gladius1#show platform hardware chassis active qfp feature wireless punt statistics
CPP Wireless Punt stats:
                                 App Tag     Packet Count
                                 -------     ------------
         CAPWAP_PKT_TYPE_DOT11_PROBE_REQ           986190
              CAPWAP_PKT_TYPE_DOT11_MGMT            10031
              CAPWAP_PKT_TYPE_DOT11_IAPP          2975298
             CAPWAP_PKT_TYPE_DOT11_DOT1X            24901
        CAPWAP_PKT_TYPE_CAPWAP_KEEPALIVE           228099
            CAPWAP_PKT_TYPE_CAPWAP_CNTRL          1628480
         CAPWAP_PKT_TYPE_CAPWAP_DATA_PAT               33
          CAPWAP_PKT_TYPE_MOBILITY_CNTRL            58091
                       SISF_PKT_TYPE_ARP        218545290
                      SISF_PKT_TYPE_DHCP            15455
                     SISF_PKT_TYPE_DHCP6             7772
                   SISF_PKT_TYPE_IPV6_ND           199108
                SISF_PKT_TYPE_DATA_GLEAN                7
             SISF_PKT_TYPE_DATA_GLEAN_V6              100

Check for a high number of punted packets and increasing overtime.

We could also identify if we are seeing any buffer failures and determine which is the size for those buffers that are reaching the maximum value. Use the command: “show buffers | i buffers|failures”

Gladius1#show buffers | i buffers|failures
Small buffers, 104 bytes (total 1200, permanent 1200):
     0 failures (0 no memory)
Middle buffers, 600 bytes (total 900, permanent 900):
     35 failures (35 no memory)
Big buffers, 1536 bytes (total 900, permanent 900, peak 901 @ 2w6d):
     0 failures (0 no memory)
VeryBig buffers, 4520 bytes (total 100, permanent 100, peak 101 @ 2w6d):
     0 failures (0 no memory)
Large buffers, 5024 bytes (total 100, permanent 100, peak 101 @ 2w6d):
     0 failures (0 no memory)
VeryLarge buffers, 8304 bytes (total 100, permanent 100):
     0 failures (0 no memory)
Huge buffers, 18024 bytes (total 20, permanent 20, peak 21 @ 2w6d):
     0 failures (0 no memory)

Check for buffer failures and identify buffer size.

The last check could be data plane utilization. We can find if the WLC is having data plane performance issues due to traffic volume, or some concrete features enabled. Use command shared in WLC checks: “show platform hardware chassis active qfp datapath utilization | i Load”

These KPIs were helpful to identify a customer issue. The customer observed a periodical high increase in the number of ARPs packets punted to the CPU. By monitoring the counter for ARPs punted to the CPU, and collecting packet capture in the control plane we could identify that those ARPs were sent from some concrete mac addresses that were doing malicious ARP scanning.

With this final bucket, we finish the Key Performance Indicators (KPIs) for Catalyst 9800 WLC.

List of commands to use for KPIs and automation scripts

In the document below, there is also a link to a script that will automatically collect all the commands. It will collect commands based on platform and release, save them in a file, and export the file. The script is using the “Guest-shell” feature that for now is only available in physical WLCs 9800-40/80 and 9800-L.

The document also provides an example of an EEM script to collect logs periodically. In conclusion, EEM along with the “Guest-shell” script will help to collect 9800 WLC KPIs and have a baseline for your Catalyst 9800 WLC.

 

For the list of commands used to monitor those KPIs

 

Share:



Source link