25 min read

 In this article by Iwan ‘e1’ Rahabok, author of VMware vRealize Operations Performance and Capacity Management, we will learn that vSphere 5.5 comes with many counters, many more than what a physical server provides. There are new counters that do not have a physical equivalent, such as memory ballooning, CPU latency, and vSphere replication. In addition, some counters have the same name as their physical world counterpart but behave differently in vSphere. Memory usage is a common one, resulting in confusion among system administrators. For those counters that are similar to their physical world counterparts, vSphere may use different units, such as milliseconds.

(For more resources related to this topic, see here.)

As a result, experienced IT administrators find it hard to master vSphere counters by building on their existing knowledge. Instead of trying to relate each counter to its physical equivalent, I find it useful to group them according to their purpose. Virtualization formalizes the relationship between the infrastructure team and application team. The infrastructure team changes from the system builder to service provider. The application team no longer owns the physical infrastructure.

The application team becomes a consumer of a shared service—the virtual platform. Depending on the Service Level Agreement (SLA), the application team can be served as if they have dedicated access to the infrastructure, or they can take a performance hit in exchange for a lower price. For SLAs where performance matters, the VM running in the cluster should not be impacted by any other VMs. The performance must be as good as if it is the only VM running in the ESXi.

Because there are two different counter users, there are two different purposes. The application team (developers and the VM owner) only cares about their own VM. The infrastructure team has to care about both the VM and infrastructure, especially when they need to show that the shared infrastructure is not a bottleneck. One set of counters is to monitor the VM; the other set is to monitor the infrastructure. The following diagram shows the two different purposes and what we should check for each. By knowing what matters on each layer, we can better manage the virtual environment.

VMware vRealize Operations Performance and Capacity Management

The two-tier IT organization

At the VM layer, we care whether the VM is being served well by the platform. Other VMs are irrelevant from the VM owner’s point of view. A VM owner only wants to make sure his or her VM is not contending for a resource. So the key counter here is contention. Only when we are satisfied that there is no contention can we proceed to check whether the VM is sized correctly or not. Most people check for utilization first because that is what they are used to monitoring in the physical infrastructure. In a virtual environment, we should check for contention first.

At the infrastructure layer, we care whether it serves everyone well. Make sure that there is no contention for resource among all the VMs in the platform. Only when the infrastructure is clear from contention can we troubleshoot a particular VM. If the infrastructure is having a hard time serving majority of the VMs, there is no point troubleshooting a particular VM.

This two-layer concept is also implemented by vSphere in compute and storage architectures. For example, there are two distinct layers of memory in vSphere. There is the individual VM memory provided by the hypervisor and there is the physical memory at the host level. For an individual VM, we care whether the VM is getting enough memory. At the host level, we care whether the host has enough memory for everyone. Because of the difference in goals, we look for a different set of counters.

In the previous diagram, there are two numbers shown in a large font, indicating that there are two main steps in monitoring. Each step applies to each layer (the VM layer and infrastructure layer), so there are two numbers for each step. Step 1 is used for performance management. It is useful during troubleshooting or when checking whether we are meeting performance SLAs or not. Step 2 is used for capacity management. It is useful as part of long-term capacity planning. The time period for step 2 is typically 3 months, as we are checking for overall utilization and not a one off spike.

With the preceding concept in mind, we are ready to dive into more detail. Let’s cover compute, network, and storage.

Compute

The following diagram shows how a VM gets its resource from ESXi. It is a pretty complex diagram, so let me walk you through it.

VMware vRealize Operations Performance and Capacity Management

The tall rectangular area represents a VM. Say this VM is given 8 GB of virtual RAM. The bottom line represents 0 GB and the top line represents 8 GB. The VM is configured with 8 GB RAM. We call this Provisioned. This is what the Guest OS sees, so if it is running Windows, you will see 8 GB RAM when you log into Windows.

Unlike a physical server, you can configure a Limit and a Reservation. This is done outside the Guest OS, so Windows or Linux does not know. You should minimize the use of Limit and Reservation as it makes the operation more complex.

Entitlement means what the VM is entitled to. In this example, the hypervisor entitles the VM to a certain amount of memory. I did not show a solid line and used an italic font style to mark that Entitlement is not a fixed value, but a dynamic value determined by the hypervisor. It varies every minute, determined by Limit, Entitlement, and Reservation of the VM itself and any shared allocation with other VMs running on the same host.

Obviously, a VM can only use what it is entitled to at any given point of time, so the Usage counter does not go higher than the Entitlement counter. The green line shows that Usage ranges from 0 to the Entitlement value.

In a healthy environment, the ESXi host has enough resources to meet the demands of all the VMs on it with sufficient overhead. In this case, you will see that the Entitlement, Usage, and Demand counters will be similar to one another when the VM is highly utilized. This is shown by the green line where Demand stops at Usage, and Usage stops at Entitlement. The numerical value may not be identical because vCenter reports Usage in percentage, and it is an average value of the sample period. vCenter reports Entitlement in MHz and it takes the latest value in the sample period. It reports Demand in MHz and it is an average value of the sample period. This also explains why you may see Usage a bit higher than Entitlement in highly-utilized vCPU. If the VM has low utilization, you will see the Entitlement counter is much higher than Usage.

An environment in which the ESXi host is resource constrained is unhealthy. It cannot give every VM the resources they ask for. The VMs demand more than they are entitled to use, so the Usage and Entitlement counters will be lower than the Demand counter. The Demand counter can go higher than Limit naturally. For example, if a VM is limited to 2 GB of RAM and it wants to use 14 GB, then Demand will exceed Limit. Obviously, Demand cannot exceed Provisioned. This is why the red line stops at Provisioned because that is as high as it can go.

The difference between what the VM demands and what it gets to use is the Contention counter. Contention is Demand minus Usage. So if the Contention is 0, the VM can use everything it demands. This is the ultimate goal, as performance will match the physical world. This Contention value is useful to demonstrate that the infrastructure provides a good service to the application team. If a VM owner comes to see you and says that your shared infrastructure is unable to serve his or her VM well, both of you can check the Contention counter.

The Contention counter should become a part of your SLA or Key Performance Indicator (KPI). It is not sufficient to track utilization alone. When there is contention, it is possible that both your VM and ESXi host have low utilization, and yet your customers (VMs running on that host) perform poorly. This typically happens when the VMs are relatively large compared to the ESXi host. Let me give you a simple example to illustrate this. The ESXi host has two sockets and 20 cores. Hyper-threading is not enabled to keep this example simple. You run just 2 VMs, but each VM has 11 vCPUs. As a result, they will not be able to run concurrently. The hypervisor will schedule them sequentially as there are only 20 physical cores to serve 22 vCPUs. Here, both VMs will experience high contention.

Hold on! You might say, “There is no Contention counter in vSphere and no memory Demand counter either.” This is where vRealize Operations comes in. It does not just regurgitate the values in vCenter. It has implicit knowledge of vSphere and a set of derived counters with formulae that leverage that knowledge.

You need to have an understanding of how the vSphere CPU scheduler works. The following diagram shows the various states that a VM can be in:

VMware vRealize Operations Performance and Capacity Management

The preceding diagram is taken from The CPU Scheduler in VMware vSphere® 5.1: Performance Study (you can find it at http://www.vmware.com/resources/techresources/10345). This is a whitepaper that documents the CPU scheduler with a good amount of depth for VMware administrators. I highly recommend you read this paper as it will help you explain to your customers (the application team) how your shared infrastructure juggles all those VMs at the same time. It will also help you pick the right counters when you create your custom dashboards in vRealize Operations.

Storage

If you look at the ESXi and VM metric groups for storage in the vCenter performance chart, it is not clear how they relate to one another at first glance. You have storage network, storage adapter, storage path, datastore, and disk metric groups that you need to check. How do they impact on one another?

I have created the following diagram to explain the relationship. The beige boxes are what you are likely to be familiar with. You have your ESXi host, and it can have NFS Datastore, VMFS Datastore, or RDM objects. The blue colored boxes represent the metric groups.

VMware vRealize Operations Performance and Capacity Management

From ESXi to disk

NFS and VMFS datastores differ drastically in terms of counters, as NFS is file-based while VMFS is block-based. For NFS, it uses the vmnic, and so the adapter type (FC, FCoE, or iSCSI) is not applicable. Multipathing is handled by the network, so you don’t see it in the storage layer. For VMFS or RDM, you have more detailed visibility of the storage. To start off, each ESXi adapter is visible and you can check the counters for each of them. In terms of relationship, one adapter can have many devices (disk or CDROM). One device is typically accessed via two storage adapters (for availability and load balancing), and it is also accessed via two paths per adapter, with the paths diverging at the storage switch. A single path, which will come from a specific adapter, can naturally connect one adapter to one device. The following diagram shows the four paths:

VMware vRealize Operations Performance and Capacity Management

Paths from ESXi to storage

A storage path takes data from ESXi to the LUN (the term used by vSphere is Disk), not to the datastore. So if the datastore has multiple extents, there are four paths per extent. This is one reason why I did not use more than one extent, as each extent adds four paths.

If you are not familiar with extent, Cormac Hogan explains it well on this blog post: http://blogs.vmware.com/vsphere/2012/02/vmfs-extents-are-they-bad-or-simply-misunderstood.html

For VMFS, you can see the same counters at both the Datastore level and the Disk level. Their value will be identical if you follow the recommended configuration to create a 1:1 relationship between a datastore and a LUN. This means you present an entire LUN to a datastore (use all of its capacity).

The following screenshot shows how we manage the ESXi storage. Click on the ESXi you need to manage, select the Manage tab, and then the Storage subtab. In this subtab, we can see the adapters, devices, and the host cache. The screen shows an ESXi host with the list of its adapters. I have selected vmhba2, which is an FC HBA. Notice that it is connected to 5 devices. Each device has 4 paths, so I have 20 paths in total.

VMware vRealize Operations Performance and Capacity Management

ESXi adapter

Let’s move on to the Storage Devices tab. The following screenshot shows the list of devices. Because NFS is not a disk, it does not appear here. I have selected one of the devices to show its properties.

VMware vRealize Operations Performance and Capacity Management

ESXi device

If you click on the Paths tab, you will be presented with the information shown in the next screenshot, including whether a path is active. Note that not all paths carry I/O; it depends on your configuration and multipathing software. Because each LUN typically has four paths, path management can be complicated if you have many LUNs.

VMware vRealize Operations Performance and Capacity Management

ESXi paths

The story is quite different on the VM layer. A VM does not see the underlying shared storage. It sees local disks only. So regardless of whether the underlying storage is NFS, VMFS, or RDM, it sees all of them as virtual disks. You lose visibility in the physical adapter (for example, you cannot tell how many IOPSs on vmhba2 are coming from a particular VM) and physical paths (for example, how many disk commands travelling on that path are coming from a particular VM). You can, however, see the impact at the Datastore level and the physical Disk level. The Datastore counter is especially useful. For example, if you notice that your IOPS is higher at the Datastore level than at the virtual Disk level, this means you have a snapshot. The snapshot IO is not visible at the virtual Disk level as the snapshot is stored on a different virtual disk.

VMware vRealize Operations Performance and Capacity Management

From VM to disk

Counters in vCenter and vRealize Operations

We compared the metric groups between vCenter and vRealize Operations. We know that vRealize Operations provides a lot more detail, especially for larger objects such as vCenter, data center, and cluster. It also provides information about the distributed switch, which is not displayed in vCenter at all. This makes it useful for the big-picture analysis.

We will now look at individual counters. To give us a two-dimensional analysis, I would not approach it from the vSphere objects’ point of view. Instead, we will examine the four key types of metrics (CPU, RAM, network, and storage). For each type, I will provide my personal take on what I think is a good guidance for their value. For example, I will give guidance on a good value for CPU contention based on what I have seen in the field. This is not an official VMware recommendation. I will state the official recommendation or popular recommendation if I am aware of it.

You should spend time understanding vCenter counters and esxtop counters. This section of the article is not meant to replace the manual. I would encourage you to read the vSphere documentation on this topic, as it gives you the required foundation while working with vRealize Operations. The following are the links to this topic:

You should also be familiar with the architecture of ESXi, especially how the scheduler works.

vCenter has a different collection interval (sampling period) depending upon the timeline you are looking at. Most of the time you are looking at the real-time statistic (chart), as other timelines do not have enough counters. You will notice right away that most of the counters become unavailable once you choose a timeline. In the real-time chart, each data point has 20 seconds’ worth of data. That is as accurate as it gets in vCenter. Because all other performance management tools (including vRealize Operations) get their data from vCenter, they are not getting anything more granular than this. As mentioned previously, esxtop allows you to sample down to a minimum of 2 seconds.

Speaking of esxtop, you should be aware that not all counters are exposed in vCenter. For example, if you turn on 3D graphics, there is a separate SVGA thread created for that VM. This can consume CPU and it will not show up in vCenter. The Mouse, Keyboard, Screen (MKS) threads, which give you the console, also do not show up in vCenter.

The next screenshot shows how you lose most of your counters if you choose a timespan other than real time. In the case of CPU, you are basically left with two counters, as Usage and Usage in MHz cover the same thing. You also lose the ability to monitor per core, as the target objects only list the host now and not the individual cores.

VMware vRealize Operations Performance and Capacity Management

Counters are lost beyond 1 hour

Because the real-time timespan only lasts for 1 hour, the performance troubleshooting has to be done at the present moment. If the performance issue cannot be recreated, there is no way to troubleshoot in vCenter. This is where vRealize Operations comes in, as it keeps your data for a much longer period. I was able to perform troubleshooting for a client on a problem that occurred more than a month ago!

vRealize Operations takes data every 5 minutes. This means it is not suitable to troubleshoot performance that does not last for 5 minutes. In fact, if the performance issue only lasts for 5 minutes, you may not get any alert, because the collection may happen exactly in the middle of those 5 minutes. For example, let’s assume the CPU is idle from 08:00:00 to 08:02:30, spikes from 08:02:30 to 08:07:30, and then again is idle from 08:07:30 to 08:10:00. If vRealize Operations is collecting at exactly 08:00, 08:05, and 08:10, you will not see the spike as it is spread over two data points. This means, for vRealize Operations to pick up the spike in its entirety without any idle data, the spike has to last for 10 minutes or more.

In some metrics, the unit is actually 20 seconds. vRealize Operations averages a set of 20-second data points into a single 5-minute data point.

The Rollups column is important. Average means the average of 5 minutes in the case of vRealize Operations. The summation value is actually an average for those counters where accumulation makes more sense. An example is CPU Ready time. It gets accumulated over the sampling period. Over a period of 20 seconds, a VM may accumulate 200 milliseconds of CPU ready time. This translates into 1 percent, which is why I said it is similar to average, as you lose the peak.

Latest, on the other hand, is different. It takes the last value of the sampling period. For example, in the sampling for 20 seconds, it takes the value between 19 and 20 seconds. This value can be lower or higher than the average of the entire 20-second period.

So what is missing here is the peak of the sampling period. In the 5-minute period, vRealize Operations does not collect low, average, and high from vCenter. It takes average only.

Let’s talk about the Units column now. Some common units are milliseconds, MHz, percent, KBps, and KB. Some counters are shown in MHz, which means you need to know your ESXi physical CPU frequency. This can be difficult due to CPU power saving features, which lower the CPU frequency when the demand is low. In large environments, this can be operationally difficult as you have different ESXi hosts from different generations (and hence, are likely to sport a different GHz). This is also the reason why I state that the cluster is the smallest logical building block. If your cluster has ESXi hosts with different frequencies, these MHz-based counters can be difficult to use, as the VMs get vMotion-ed by DRS.

vRealize Operations versus vCenter

I mentioned earlier that vRealize Operations does not simply regurgitate what vCenter has. Some of the vSphere-specific characteristics are not properly understood by traditional management tools. Partial understanding can lead to misunderstanding. vRealize Operations starts by fully understanding the unique behavior of vSphere, then simplifying it by consolidating and standardizing the counters. For example, vRealize Operations creates derived counters such as Contention and Workload, then applies them to CPU, RAM, disk, and network.

Let’s take a look at one example of how partial information can be misleading in a troubleshooting scenario. It is common for customers to invest in an ESXi host with plenty of RAM. I’ve seen hosts with 256 to 512 GB of RAM. One reason behind this is the way vCenter displays information. In the following screenshot, vCenter is giving me an alert. The host is running on high memory utilization. I’m not showing the other host, but you can see that it has a warning, as it is high too. The screenshots are all from vCenter 5.0 and vCenter Operations 5.7, but the behavior is still the same in vCenter 5.5 Update 2 and vRealize Operations 6.0.

VMware vRealize Operations Performance and Capacity Management

vSphere 5.0 – Memory alarm

I’m using vSphere 5.0 and vCenter Operations 5.x to show the screenshots as I want to provide an example of the point I stated earlier, which is the rapid change of vCloud Suite.

The first step is to check if someone has modified the alarm by reducing the threshold. The next screenshot shows that utilization above 95 percent will trigger an alert, while utilization above 90 percent will trigger a warning. The threshold has to be breached by at least 5 minutes. The alarm is set to a suitably high configuration, so we will assume the alert is genuinely indicating a high utilization on the host.

VMware vRealize Operations Performance and Capacity Management

vSphere 5.0 – Alarm settings

Let’s verify the memory utilization. I’m checking both the hosts as there are two of them in the cluster. Both are indeed high. The utilization for vmsgesxi006 has gone down in the time taken to review the Alarm Settings tab and move to this view, so both hosts are now in the Warning status.

VMware vRealize Operations Performance and Capacity Management

vSphere 5.0 – Hosts tab

Now we will look at the vmsgesxi006 specification. From the following screenshot, we can see it has 32 GB of physical RAM, and RAM usage is 30747 MB. It is at 93.8 percent utilization.

VMware vRealize Operations Performance and Capacity Management

vSphere – Host’s summary page

Since all the numbers shown in the preceding screenshot are refreshed within minutes, we need to check with a longer timeline to make sure this is not a one-time spike. So let’s check for the last 24 hours. The next screenshot shows that the utilization was indeed consistently high. For the entire 24-hour period, it has consistently been above 92.5 percent, and it hits 95 percent several times. So this ESXi host was indeed in need of more RAM.

VMware vRealize Operations Performance and Capacity Management

Deciding whether to add more RAM is complex; there are many factors to be considered. There will be downtime on the host, and you need to do it for every host in the cluster since you need to maintain a consistent build cluster-wide. Because the ESXi is highly utilized, I should increase the RAM significantly so that I can support more VMs or larger VMs. Buying bigger DIMMs may mean throwing away the existing DIMMs, as there are rules restricting the mixing of DIMMs. Mixing DIMMs also increases management complexity. The new DIMM may require a BIOS update, which may trigger a change request. Alternatively, the large DIMM may not be compatible with the existing host, in which case I have to buy a new box. So a RAM upgrade may trigger a host upgrade, which is a larger project.

Before jumping in to a procurement cycle to buy more RAM, let’s double-check our findings. It is important to ask what is the host used for? and who is using it?.

In this example scenario, we examined a lab environment, the VMware ASEAN lab. Let’s check out the memory utilization again, this time with the context in mind. The preceding graph shows high memory utilization over a 24-hour period, yet no one was using the lab in the early hours of the morning! I am aware of this as I am the lab administrator.

We will now turn to vCenter Operations for an alternative view. The following screenshot from vCenter Operations 5 tells a different story. CPU, RAM, disk, and network are all in the healthy range. Specifically for RAM, it has 97 percent utilization but 32 percent demand. Note that the Memory chart is divided into two parts. The upper one is at the ESXi level, while the lower one shows individual VMs in that host.

The upper part is in turn split into two. The green rectangle (Demand) sits on top of a grey rectangle (Usage). The green rectangle shows a healthy figure at around 10 GB. The grey rectangle is much longer, almost filling the entire area.

The lower part shows the hypervisor and the VMs’ memory utilization. Each little green box represents one VM.

On the bottom left, note the KEY METRICS section. vCenter Operations 5 shows that Memory | Contention is 0 percent. This means none of the VMs running on the host is contending for memory. They are all being served well!

VMware vRealize Operations Performance and Capacity Management

vCenter Operations 5 – Host’s details page

I shared earlier that the behavior remains the same in vCenter 5.5. So, let’s take a look at how memory utilization is shown in vCenter 5.5. The next screenshot shows the counters provided by vCenter 5.5. This is from a different ESXi host, as I want to provide you with a second example. Notice that the ballooning is 0, so there is no memory pressure for this host. This host has 48 GB of RAM. About 26 GB has been mapped to VM or VMkernel, which is shown by the Consumed counter (the highest line in the chart; notice that the value is almost constant). The Usage counter shows 52 percent because it takes from Consumed. The active memory is a lot lower, as you can see from the line at the bottom. Notice that the line is not a simple straight line, as the value goes up and down. This proves that the Usage counter is actually the Consumed counter.

VMware vRealize Operations Performance and Capacity Management

vCenter 5.5 Update 1 memory counters

At this point, some readers might wonder whether that’s a bug in vCenter. No, it is not. There are situations in which you want to use the consumed memory and not the active memory. In fact, some applications may not run properly if you use active memory. Also, technically, it is not a bug as the data it gives is correct. It is just that additional data will give a more complete picture since we are at the ESXi level and not at the VM level. vRealize Operations distinguishes between the active memory and consumed memory and provides both types of data. vCenter uses the Consumed counter for utilization for the ESXi host.

As you will see later in this article, vCenter uses the Active counter for utilization for VM. So the Usage counter has a different formula in vCenter depending upon the object. This makes sense as they are at different levels.

vRealize Operations uses the Active counter for utilization. Just because a physical DIMM on the motherboard is mapped to a virtual DIMM in the VM, it does not mean it is actively used (read or write). You can use that DIMM for other VMs and you will not incur (for practical purposes) performance degradation. It is common for Microsoft Windows to initialize pages upon boot with zeroes, but never use them subsequently.

For further information on this topic, I would recommend reviewing Kit Colbert’s presentation on Memory in vSphere at VMworld, 2012. The content is still relevant for vSphere 5.x. The title is Understanding Virtualized Memory Performance Management and the session ID is INF-VSP1729. You can find it at http://www.vmworld.com/docs/DOC-6292. If the link has changed, the link to the full list of VMworld 2012 sessions is http://www.vmworld.com/community/sessions/2012/.

Not all performance management tools understand this vCenter-specific characteristic. They would have given you a recommendation to buy more RAM.

Summary

In this article, we covered the world of counters in vCenter and vRealize Operations. The counters were analyzed based on their four main groupings (CPU, RAM, disk, and network). We also covered each of the metric groups, which maps to the corresponding objects in vCenter. For the counters, we also shared how they are related, and how they differ.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here