Wireless Catalyst 9800 WLC KPIs, Part 1


Part 1 of the 3-part Wireless Catalyst 9800 WLC KPIs

When working in critical wireless infrastructures, it is important to be proactive and determine in advance if there is any potential issue that could impact end-clients experience. Wireless Catalyst 9800 WLC KPIs will help in that task.

In this blog, I will share a systematic approach plus a list of commands that I have used while providing support on the NOC for one of the largest worldwide wireless events. The idea behind is to keep a close eye on how to monitor Key Performance Indicators (KPIs) for Catalyst 9800 WLC.

KPIs outputs can be collected periodically to create a baseline when a network is working fine. Therefore, making it easier later to find any deviation by comparing new outputs with previously collected ones.

I have divided WLC KPIs into six different buckets or areas:

  • WLC checks
  • Connection with other devices
  • AP checks
  • RF checks
  • Client checks
  • Packet Drops

KPIs will help us to spot issues in any of the mentioned six areas. In this blog, I have included WLC checks and Connections with other devices. Additionally, there will be two more blogs where I will share AP checks, RF checks, Client checks, and Packet Drops.

WLC checks

I usually start by checking the WLC first, since it is the most critical part. If any issues are seen in the controller, they will cascade shortly after as problems with APs and clients.  In other words, the idea here is to perform top-down criteria.

While reviewing the health state of the WLC, I would first confirm that WLC is running the intended version and in install mode. Install mode will ensure that the controller will boot faster, with a reduced memory footprint. After that, I would check the uptime of the WLC to see if any reload has occurred. Use the command: “show version | i uptime|Installation mode|Cisco IOS Software”

Gladius1#show version | i uptime|Installation mode|Cisco IOS Software
Cisco IOS Software [Amsterdam], C9800 Software (C9800_IOSXE-K9), Version 17.3.5a, RELEASE SOFTWARE (fc2)
Gladius1 uptime is 2 weeks, 5 days, 21 hours, 30 minutes
Installation mode is INSTALL

Check expected release, uptime, and WLC running in install mode.

For Catalyst 9800 WLC deployed in High Availability, which by the way, is highly recommended for critical deployments, we need to first verify that the HA pair stack is formed and in a standby-hot state. Secondly, check the stack uptime and each of the member’s individual uptime. Thirdly, identify a number of switchovers between active and standby. Use the command: “show redundancy | i ptime|Location|Current Software state|Switchovers”.

Gladius1#show redundancy | i ptime|Location|Current Software state|Switchovers
       Available system uptime = 2 weeks, 1 day, 2 hours, 48 minutes
Switchovers system experienced = 1
               Active Location = slot 1
        Current Software state = ACTIVE
       Uptime in current state = 7 hours, 10 minutes
              Standby Location = slot 2
        Current Software state = STANDBY HOT
       Uptime in current state = 7 hours, 4 minutes

Check stack uptime, number of switchovers, and uptime for members. Switchover occurred 7 hours ago. Slot1 is new active and Slot2 reloaded.

In HA deployments, the recommendation is to use RMI feature. This will allow monitoring active and standby through Wireless Management Interface (WMI) and Redundancy Port (RP). After that, we should enable Default-gateway Check to confirm that both active and standby can reach the gateway. Here is a link to the 9800 High Availability deployment guide.

The next step will be to check if there are any WLC crashes. Determine if crash matches with the time of switchovers or unexpected reload. When WLC crash occurs it should generate a core dump or a system report. Those files are stored in WLC harddisk for 9800-40/80 or in bootflash for 9800-L/CL. Use command: “dir harddisk:/core/ | i core|system-report”, “dir stby-harddisk:/core/| i core|system-report” and replace harddisk by bootflash for 9800-L/CL.

Gladius1#dir harddisk:/core/ | i core|system-report
Directory of harddisk:/core/
3661831  -rw-         11260562  Mar 25 2022 22:07:12 +01:00  Gladius1_1_RP_0_wncd_16574_20220325-220708-CET.core.gz
3661830  -rw-            48528  Mar 25 2022 21:57:20 +01:00  Gladius1_1_RP_0-system-report_20220325-215658-CET-info.txt
3661829  -rw-        126548098  Mar 25 2022 21:57:10 +01:00  Gladius1_1_RP_0-system-report_20220325-215658-CET.tar.gz
3661828  -rw-            57191   Mar 9 2021 16:21:48 +01:00  Gladius1_1_RP_0-system-report_20210309-161907-CET-info.txt
3661827  -rw-        504311304   Mar 9 2021 16:20:51 +01:00  Gladius1_1_RP_0-system-report_20210309-161907-CET.tar.gz
3661826  -rw-         11714625  Nov 19 2020 10:35:54 +01:00  Gladius1_1_RP_0_wncd_30240_20201119-103550-CET.core.gz

Check for cores and system reports. 2xcores in wncd process and 2xsystem-reports have occurred.

In case we observe any core dump we can identify the impacted process by checking file name. For example: WLC_1_RP_0_wncd_16574_20220325-220708-CET.core.gz crash occurred in “wncd” process, WLC_1_RP_0_dbm_14119_20201104-092800-CET.core.gz crash occurred in “dbm” process. Open a TAC case to identify the root cause of the crash.

Once we have verified crashes or unexpected reloads, we can continue by reviewing WLC CPU and memory utilization. For CPU monitoring we need to run command several times. Detect if there are any processes showing CPU above 80% consistently and not as a spike. I prefer to execute the command with sorted keyword. That way you can focus on processes with high CPU first. We have seen cases where consistent high CPU in WNCD process lead to AP disconnections. However, the releases 17.3.5 and 17.6.3 have received additional hardening, with the objective to protect AP CAPWAP connections in case a high CPU occurs. Use command: “show processes cpu platform sorted | ex 0%      0%      0%”

Gladius1#show processes cpu platform sorted | ex 0%      0%      0%
CPU utilization for five seconds:  14%, one minute:  16%, five minutes:  16%
Core 0: CPU utilization for five seconds: 10%, one minute:  7%, five minutes: 11%
Core 1: CPU utilization for five seconds:  6%, one minute: 28%, five minutes: 12%
Core 2: CPU utilization for five seconds: 48%, one minute: 55%, five minutes: 68%
Core 3: CPU utilization for five seconds: 20%, one minute:  8%, five minutes: 11%
Core 4: CPU utilization for five seconds: 38%, one minute: 13%, five minutes: 17%
Core 5: CPU utilization for five seconds: 14%, one minute: 11%, five minutes: 13%
Core 6: CPU utilization for five seconds:  9%, one minute: 20%, five minutes: 23%
Core 7: CPU utilization for five seconds:  5%, one minute:  8%, five minutes: 18%
Core 8: CPU utilization for five seconds:  7%, one minute: 50%, five minutes: 34%
Core 9: CPU utilization for five seconds: 100%, one minute: 58%, five minutes: 27%
Core 10: CPU utilization for five seconds: 27%, one minute: 17%, five minutes: 25%
   Pid    PPid    5Sec    1Min    5Min  Status        Size  Name                 
--------------------------------------------------------------------------------
 19056   19037     99%     99%     99%  R          7525896  wncd_0               
 21922   21913     96%     97%     99%  R           127488  smand                
 19460   19451     37%     34%     33%  R          6363828  wncd_2               
 19604   19596     18%     19%     18%  R          4556132  wncd_3

Check CPU utilization per Core and per Process. Process wncd_0 and smand facing close to 100% CPU utilization

Catalyst 9800-CL and 9800-L platforms use CPU cores for data forwarding. Therefore, it is expected to see high CPU in ucode_pkt_PPE0. For those platforms to evaluate data plane performance use command: “show platform hardware chassis active qfp datapath utilization | i Load”

Gladius1#show platform hardware chassis active qfp datapath utilization | i load
CPP 0: Subdev 0            5 secs        1 min        5 min       60 min
Processing: Load (pct)            4            3            4            3
Check datapath load %

While checking memory utilization, we need to monitor if the device utilization is too high. Subsequently, identify if there are any processes holding memory and not releasing it over time (leak). Use command: “show platform resources” (basic), “show process memory platform sorted”, ”show processes memory platform accounting” (advanced)

Gladius1#show platform resources
**State Acronym: H - Healthy, W - Warning, C - Critical
Resource                 Usage                 Max             Warning         Critical        State
----------------------------------------------------------------------------------------------------
RP0 (ok, active)                                                                               H
Control Processor       0.79%                 100%            80%             90%             H
DRAM                   4839MB(15%)           31670MB         88%             93%             H
harddisk               0MB(0%)               0MB             80%             85%             H
ESP0(ok, active)                                                                               H
QFP                                                                                           H
TCAM                   68cells(0%)           1048576cells    65%             85%             H
DRAM                   420162KB(20%)         2097152KB       85%             95%             H
IRAM                   13738KB(10%)          131072KB        85%             95%             H
CPU Utilization        0.00%                 100%            90%             95%             H

Confirm state is healthy for metrics. Review Control Processor and memory utilization

Gladius1#show processes memory platform sorted
System memory: 15869340K total, 6152000K used, 9717340K free,
Lowest: 9717340K
Pid    Text      Data   Stack   Dynamic       RSS              Name
----------------------------------------------------------------------
3546  367768   1404580     136       488   1404580   linux_iosd-imag
23602   22335    449968     136      1052    449968    ucode_pkt_PPE0
24525     847    437624     136     46628    437624            wncd_0
24004     160    373176    3956      6400    373176           wncmgrd
26358     128    344868     136    136628    344868         mobilityd

Check free memory available. Identify top processes holding more memory.

Gladius1#show processes memory platform accounting
Hourly Stats
process                 callsite_ID(bytes)  max_diff_bytes   callsite_ID(calls)  max_diff_calls   tracekey                                  timestamp(UTC)
------------------------------------------------------------------------------------------------------------------------------------------------------------
cpp_cp_svr_fp_0         2887897091          7243446          2887897092          1133             1#e4bd31e0c668be2b8786dec9fcc99486        2022-05-25 14:04
ndbmand_rp_0            3571094529          5453112          3570931712          1119             1#00c5632bf072231d06cf80b8ccc37392        2022-05-09 21:52
wncd_4_rp_0             2556049411          3059712          3028615169          227              1#9f4792f37292983824f5bb97d7e2167c        2022-05-10 14:54
wncd_0_rp_0             2556049411          1990656          3028615168          680              1#9f4792f37292983824f5bb97d7e2167c        2022-05-25 11:05
wncd_2_rp_0             2556049411          1953792          3028615169          682              1#9f4792f37292983824f5bb97d7e2167c        2022-05-13 14:01
smand_rp_0              2887895047          1491984          3028615168          89               1#eaf6dd665e73b1edeee32fb9c5ac8639        2022-05-10 14:54

Check top processes and the number of calls. Stats are hourly, daily, weekly, and monthly.

As final controller health check, we can do a validation of the hardware. Check the status of power supplies, fans, SFPs, and temperature (only for physical WLCs). Likewise, review license status and the right number of licenses in use. Use commands: “show platform”, “show inventory”, “show environment” and “show license summary | i Status:”

Gladius1#show platform
Chassis type: C9800-40-K9
Slot      Type                State                 Insert time (ago)
--------- ------------------- --------------------- -----------------
0         C9800-40-K9         ok                    2w5d
0/0      BUILT-IN-4X10G/1G   ok                    2w5d
R0        C9800-40-K9         ok, active            2w5d
F0        C9800-40-K9         ok, active            2w5d
P0        C9800-AC-750W-R     ok                    2w5d
P1        Unknown             empty                 never
P2        C9800-40-K9-FAN     ok                    2w5d

Slot      CPLD Version        Firmware Version
--------- ------------------- ---------------------------------------
0         19030712            16.10(2r)
R0        19030712            16.10(2r)
F0        19030712            16.10(2r)

Gladius1#show inventory
NAME: "Chassis 1", DESCR: "Cisco C9800-40-K9 Chassis"
PID: C9800-40-K9       , VID: V03  , SN: TTM242504SR
NAME: "Chassis 1 Power Supply Module 0", DESCR: "Cisco Catalyst 9800-40 750W AC Power Supply Reverse Air"
PID: C9800-AC-750W-R   , VID: V01  , SN: ART2418F0GJ

NAME: "Chassis 1 Fan Tray", DESCR: "Cisco C9800-40-K9 Fan Tray"
PID: C9800-40-K9-FAN   , VID:      , SN:
NAME: "module 0", DESCR: "Cisco C9800-40-K9 Modular Interface Processor"
PID: C9800-40-K9       , VID:      , SN:
NAME: "SPA subslot 0/0", DESCR: "4-port 10G/1G multirate Ethernet Port Adapter"
PID: BUILT-IN-4X10G/1G , VID: N/A  , SN: JAE87654321
NAME: "subslot 0/0 transceiver 0", DESCR: "10GE LR"
PID: SFP-10G-LR          , VID: V02  , SN: AVD2141KCFB
NAME: "module R0", DESCR: "Cisco C9800-40-K9 Route Processor"
PID: C9800-40-K9       , VID: V03  , SN: TTM242504SR
NAME: "module F0", DESCR: "Cisco C9800-40-K9 Embedded Services Processor"
PID: C9800-40-K9       , VID:      , SN:
NAME: "Crypto Asic F0/0", DESCR: "Asic 0 of module F0"
PID: NOT               , VID: V01  , SN: JAE242711XF

Gladius1#show environment
Number of Critical alarms:  0
Number of Major alarms:     0
Number of Minor alarms:     0

Check power supplies, fan status, SFPs, SPAs, and any alarms.

An example of those Catalyst 9800 WLC KPIs helping to identify an issue, was a customer-facing High Availability setup issue between two WLCs. By reviewing the version, and hardware installed in both WLCs we identified a difference in SPA adapters that was causing the WLC to not pair as HA.

Connection with other devices Checks

In addition to WLC health, we can check the status of  WLC’s connections. The most important connections are mobility with other WLCs for inter-WLC roams, telemetry with DNAC/PI for monitoring and automation, and Nmsp with DNA-Spaces/CMX for location services. We need to ensure that those connections are established and working fine.

Confirm that mobility tunnels with other WLCs are up and using the right encryption and MTU. And clients can roam or be anchored to other WLC. If tunnels are down we can find if an issue is occurring in the control tunnel (UDP port 16666), in the data tunnel (UDP port 16667), or in both. Use command: “show wireless mobility sum”

Gladius1#sh wireless mobility summary
Wireless Management VLAN: 25
Wireless Management IP Address: 192.168.25.25
Mobility Control Message DSCP Value: 48
Mobility Keepalive Interval/Count: 10/3
Mobility Group Name: eWLC3
Mobility Multicast Ipv4 address: 0.0.0.0
Mobility MAC Address: 001e.f62a.46ff
Mobility Domain Identifier: 0x2e47
Controllers configured in the Mobility Domain:
 IP          Public Ip    MAC Address      Group Name   Multicast IPv4    Multicast IPv6  Status  PMTU
----------------------------------------------------------------------------------------------------------
192.168.25.25  N/A          001e.f62a.46ff   eWLC3        0.0.0.0         ::              N/A     N/A
192.168.5.35  192.168.5.35  00b0.e1f2.f480   3500-2       0.0.0.0         ::              Up     1385
192.168.25.23 192.168.25.23 706d.1535.6b0b   DAO2         0.0.0.0         :: Control And Data Path Down
192.168.25.33 192.168.25.33 f4bd.9e57.ff6b   5500         0.0.0.0         ::              Up     1005

Check for mobility down and low PMTU.

If we have DNAC for Assurance or Provision we can confirm that DNAC Netconf connection is established. Afterward verify telemetry statistics for WLC, APs, and clients are updated in DNAC.  Use command: “show telemetry internal connection”. After 17.7 this command have been replaced by “show telemetry connection all”

Gladius2#show telemetry internal connection
Load for five secs: 29%/5%; one minute: 4%; five minutes: 2%
Time source is NTP, 10:21:45.942 CET Wed Nov 4 2020
Telemetry connections
Index Peer Address               Port  VRF Source Address             State
----- -------------------------- ----- --- -------------------------- ----------
    1 192.168.0.105              25103   0 192.168.25.42              Active

Check for telemetry state

In case we are using DNA-Spaces for location. Firstly, we can confirm Nmsp connection status, and the number of packets transmitted and received. Secondly, list of clients in WLC probing database. And lastly, the client location is updated in DNA-Spaces. Use command “show nmsp status”

Gladius1#show nmsp status
NMSP Status
-----------
DNA Spaces/CMX IP Address  Active    Tx Echo Resp  Rx Echo Req   Tx Data     Rx Data     Transport
----------------------------------------------------------------------------------------------------------
192.168.0.65                  Active    693870        693870        16833737    181084      TLS      
192.168.0.66                  Inactive  21            21            222         7           TLS

Check for inactive servers, mismatch between echo tx/rx

With provided checks, we can proactively monitor the health of our 9800 WLC and connection with other devices like CMX/DNA-Spaces, other WLCs, and DNAC. In the next blog, we will share KPIs to monitor APs and RF.

List of commands to use for KPIs and automation scripts

In the document below, there is also a link to a script that will automatically collect all the commands. It will collect commands based on platform and release, save them in a file, and export the file. The script is using the “Guest-shell” feature that for now is only available in physical WLCs 9800-40/80 and 9800-L.

The document also provides an example of EEM script to collect logs periodically. In conclusion, EEM along with “Guest-shell” script will help to collect 9800 WLC KPIs and have a baseline for your Catalyst 9800 WLC.

For the list of commands used to monitor those KPIs

Share:



Source link