vRealize Operations Capacity Shows 100% Cluster Utilisation
Recently we were examining a vSphere cluster where vRealize Operations Manager was showing 100% CPU utilisation, with zero capacity remaining. However, the usage of all resources in the cluster was generally low. We know that the cluster capacity is based on demand rather than usage. CPU demand is the amount of CPU resources a virtual machine would use if there were no CPU contention or limit. Sometimes, this can cause a little confusion when we look at the utilisation metrics of the cluster.
This type of behaviour is actually expected because of how vRealize Operations interprets the data. When virtual machines have latency sensitivity set to high, all of the CPU is requested by the virtual machine in order to reserve it. Since vRealize Operations Manager cannot differentiate between latency sensitivity reservations and legitimate CPU requests, we see CPU and/or memory contention alerts. More information can be found in the KB article Virtual Machine(s) Workload badge reports constant 100+ score in VMware vRealize Operations Manager (2145552). The KB article suggests that if latency sensitivity cannot be set back to normal, then a custom group can be created to disable the alerts.
This scenario is well documented. However what if latency sensitivity is not enabled or configured beyond the default setting, but the symptoms are the same? In this case, the cluster is dedicated to running SQL workloads.
From using the metrics view of the cluster under the environment tab, we can see high peaks for the CPU co-stop and CPU ready values every night. The discrepency seems to be caused by the behaviour of the virtual machines in claiming all available CPU resource at a specific time. Whilst this might sound environmentally specific, there are a number of scenarios where this could be the case and a workaround is needed.
Beyond changing the behaviour of the virtual machines, some available options are as follows:
- Action the rightsize recommendations to ensure we are not over allocating CPU resources
- Follow the steps outlined in the KB article above to ignore/disable the alerts
- Follow the steps outlined below to set a maintenance schedule, disregarding metrics where the peak is at a consistent time every day or night
- In the capacity policy change the setting of the time remaining calculations
Updating how the time remaining is calculated may be a last resort, but can provide a slightly different interpretation of the data. You can see the description of each setting, and how the associated projection graph changes in the screenshots below. The default policy uses conservative capacity planning which takes the higher values, whereas aggressive uses the averages values of resource utilisation.
To update this setting either change the default policy, or create a new policy to assign to specific objects like a cluster. Follow the policy based steps outlined below, disregarding the maintenance schedule. You can find out more information on how remaining time is calculated in the blog Rightsizing VMs with vRealize Operations.
The following steps will walk through creating a maintenance schedule with associated capacity policy. You can also change the time remaining calculations from the capacity policy, with or without a maintenance schedule. The screenshots are from vROps 8.6, but previous versions of 8.x should be a similar process.
- First, create the maintenance schedule. From the left hand navigation pane, expand Configure and select Maintenance Schedules.
- Click Add. Enter the name, time zone, and time configuration of the schedule. Click Save.
- Next, we need to create a policy. From the Configure menu again, select Policies.
- Click Add. Enter the name, and select a policy to clone. Click Create Policy.
- Select the policy from the list, and click Edit Policy.
- Select the Capacity block, and then choose the object type.
Here if required you can change the policy for time remaining calculations, mentioned above, as well as manually change the alert thresholds. When considering the time remaining calculations, the default conservative policy will take the highest resource utilisation to project the time remaining before this crosses the usable capacity threshold. The aggressive policy will use the mean average resource utilisation to project the time remaining before this average crosses the usable capacity threshold. Both policies are of use, aggressive may be better suited to smaller organisations wanting to sweat hardware assets.
- Make any desired changes to the policy per the description above. Scroll down to Maintenance Schedule and select the schedule created earlier. Click Save.
- Next, select Groups and Objects. Choose a custom group or object to apply the policy to, and click Save.
- Now that the policy is configured and assigned to an object, it is active and in use.
- When we check back on the maintenance schedule we can now see the linked policy.
There are additional ways of setting maintenance schedules, the example above is relevant to the described use case to disregard metrics during a certain time interval. You can also manually enter maintenance through both the vROps UI and API, see Maintenance Mode for vRealize Operations Objects, Part 1 by Thomas Kopton, or create dynamic groups containing hosts in maintenance mode, see Maintenance Mode for vRealize Operations Objects, Part 2.