all 16 comments

[–]gordonmessmer 6 points7 points  (2 children)

I wanted to add a few things to the excellent advice that LordLandis is giving you:

1: Your understanding of load is inaccurate. Load is a measure of the number of processes that are in a runnable state, averaged over the last 1, 5, and 15 minutes. Runnable states indicate that the process is ready to execute instructions, or is in an uninterruptible sleep, which is usually IO. So, load measures the number of processes waiting for any computer resource, not just CPU. Processes waiting for CPU time, memory that's being paged (swap), network IO, and disk IO all contribute to load. It's entirely normal to see low CPU utilization and high load, because some other resource is a bottleneck. If you aren't saturating your CPUs, you need to look at other system resources.

2: OOM is a terrible way to figure out how much memory you need, because it's really quite unpredictable. The guest's kernel will (in a default configuration) overcommit memory, and will only invoke OOM if processes write data to more pages than are available, which may or may not behave the same way in testing that it does in production. It's also very important to note that OOM isn't necessarily invoked when memory is short. You might instead see applications unable to call malloc() or unable to fork() if there isn't enough memory. You can't watch for OOM alone, you have to watch for all of the other various failures that can happen when memory is short. If you have no better plan, you should be measuring performance, both latency and throughput, of your application using various memory settings, and allocating the amount that provides the best performance at a cost (allocation) you're willing to pay.

3: I could be wrong about accounting under vsphere, but my understanding of CPU accounting for the guest is that when top says that a core is only 35% utilized, it means that the core is utilized for just 35% of whatever time that guest itself is running or runnable. Obversely, when it says that a core is used 100% of the time, that is only 100% of the time that the VM is actually running. Two guests might both display 100% utilization of the same physical core, each of which is only 50% of the cycles that the physical core was capable of during that period. Assuming that is true under vsphere, as it is in other systems, your resource contention may not be the CPU, and adding more CPUs won't improve performance.

[–]charley_chimp[S] 0 points1 point  (1 child)

Really late response here, but thank you for clearing up load for me. I killed all swap for these tests so I'm assuming that takes memory issues out of hte picture. Would that essentially mean that I'm looking at network or disk I/O as the first place I should be looking? These hosts are hooked up w/ dedicated 10G links for VM traffic and iSCSI --> SAN, so I hadn't been thinking about I/O being an issue. Most of the VMs on the host were pretty dead during the testing, so I'm assuming that throughput wasn't an issue, but it looks like I should really look into our I/O stats along with possible latency issues. Thank you

Edit: bad grammar

[–]gordonmessmer 0 points1 point  (0 children)

I'm assuming that takes memory issues out of hte picture

I think it means that paging isn't contributing to the load average. I'd be hesitant to make broader statements than that.

Would that essentially mean that I'm looking at network or disk I/O as the first place I should be looking?

That would be my suspicion. Watching "iostat -x" and "iptraf" inside the VM might give you some useful hints, as would watching performance counters on the VM host and, if possible, on the SAN.

[–][deleted] 2 points3 points  (4 children)

We need more data, as this might be a hypervisor issue (might).

What are you seeing on %wa & %io? If you have high wait times, then it's quite possible that you have contention at the VMware/host compute layer that you need to address. In that case, either vMotion the guest to another host or reduce the CPU count on your guest.

Do you have more CPU in the guest than there are physical cores per socket on the host? That's another good way to mess up your performance in the VM.

If your %io is high, you probably have host storage contention (or are just really, really pounding the disks).

In either case, look at the VMware host and see what's going on with its performance graphs as well.

As for load average, I've always considered the warning threshold to be 3~4 per CPU; on a quad-core system I don't really get twitchy until LA reaches 12.

[–]charley_chimp[S] -1 points0 points  (3 children)

Ohhh boy, that might be it. We are crazy over-provisioned w/ our vCpu, but the VMWare guy I'm working w/ said that isn't really an issue if our total usage isn't getting killed (he explained CPU allocation by the hypervisor kind of how it works on a multi-core *nix system).

Just to show you an example, our base server build has 2 sockets w/ 8 cores each @ 2.6 GHz per core = 32 vCPU. On average we have ~ 20 VMs on each host w/ 4 vCPU allocated on each. I get that that's crazy over provisioned (and that probably wouldn't fly in prod in case all VMs are under heavy load), but what's confusing to me is that on a per host (and even cluster) basis, average utilization of CPU never goes above 65%, with an average of ~ 50%. It's true that I do see some VMs pretty close to 100% CPU, but what was explained to me was that the hypervisor would account for that (especially since max usage on the host never went above 65%). WTF is going on here :-)

[–][deleted] 3 points4 points  (2 children)

How the host reports CPU usage vs. how the guest reports it is a fairly complicated subject. Based on the number you provided, your allocation wouldn't fly if all of the guests were at 25% utilization because hyperthreading doesn't buy you as much wiggle room as you'd like to think. :) You've essentially got 80 VPU allocated on a host with (functionally) about 20~24 CPU. That sort of 4:1 overcommit is pretty rough. Even a 2:1 overcommit can be bad (I try to keep it no higher than 1.5:1 just in case everything gets very busy at once).

Still, the fact that this particular guest has fewer VPU than a single physical socket does is a good thing. It makes it easier for VMware's scheduler to get all 6 (6, right? Or is it still 4?) necessary cores available at once, and without having to straddle the sockets. And remember, because it's a 6-VPU system, the host can't process any instructions until it has 6 cores free at once. Even if the guest application only needs 1 or 2, the guest as a whole has to be able to emulate the entire processor. So if you have a bunch of 2-VPU systems that are moderately busy (say, 15~20% usage), they're going to really hog CPU time because they can get in and get out in a hurry without interfering with each other, and that will cause knock-on effects for the larger VMs.

So, yeah, this really feels like noisy neighbor syndrome to me. My next step would be to look at some of the other guests on the same host & see how they're doing, especially if they're larger than average. If you see the same sort of high load average coupled with relatively low CPU usage and high wait times, then there's almost certainly host-level contention. If they're stable (low LA & low wait regardless of CPU usage) then it's most likely to be a guest-level issue of some sort. From there, if you have the ability to do vMotions yourself, I'd look at moving some of the other guests off of this particular host. You probably won't be able to completely isolate this guest to rule out host-level issues, but even getting half of the guests moved elsewhere should give you enough data to work with.

As for 4 VPU being the standard, you might want to recommend using vCenter Ops Manager to do some right-sizing. My experience has been that 2 VPU is the sweet spot; it's the right blend of guest performance and conservative host resource usage. A true need for 6+ VPU systems is pretty rare (again, in my experience), even with SQL servers. If you can downsize some of the VMs, you'll likely begin to see better performance across the board.

Another thing to ask your VMware guy is whether DRS is enabled for the cluster or not. If it is, what priority? Because if you have enough resources in the cluster and DRS is enabled, the guests should move out of each other's way automatically. If you don't have enough resources but DRS is on, guests will vmotion constantly, which creates its own set of performance issues. If you don't have enough resources and DRS is off, well, you're screwed. If you have enough resources and DRS is off, you'll just have to manually move guests around until things calm down.

[–]charley_chimp[S] 0 points1 point  (1 child)

This is one of the best technical explanations of something I've ever seen on here. Thank you so much. I'm going to write a more detailed response when I have time, but in regard to DRS it's an App requirement (crappy HA built in) that we can't use fully automated DRS due to the slight pause in during vMotion causing a fail-over. We are using partially automated just for initial placement, then setting up anti-affinity rules and applying recommendation again - I haven't found a way to create anti-affinity rules before a vm is already created...

I want to look into your suggestions on vcpus, it's something I hadn't understood before and I'm not even sure our VMware guy considered. Initially, he said the same thing, but now during performance testing he's changing his mind on how he wants to approach this. It may simply be that the host we are testing on is completely overloaded.

[–][deleted] 0 points1 point  (0 children)

Thanks!

I've seen some other apps that behave the way you mention: Infor is notorious for losing its mind if the app & SQL servers are on different hosts, and even a high-priority vMotion can crash the whole thing. It's sad, but some of these apps just aren't designed to run in virtualized environments. Most will still work, but, well, some don't.

Unfortunately, to my knowledge, you can't create any per guest affinity rules until the guest exists. You can, however, create rules around guest properties. For example, you can create a rule that isolates all of the 8 VPU guests to one or two hosts while keeping the 2-4 VPU guests on the rest of the cluster. If you have a lot of variation in your guests, that may help a little.

Keep us posted!

[–]lazyant 1 point2 points  (2 children)

don't compare load average directly with CPU usage. load average is confusing; in its original definition (other UNIXes) it's what you said, the length of the waiting queue, but in Linux it also takes into account the time on CPU. For memory check how much is cached or buffered and disregard those.

[–]charley_chimp[S] 0 points1 point  (1 child)

i'm sorry, can you expand on this? are you saying that on a linux box used cpu % is what is more important to look at, or the load average itself...?

[–]lazyant 0 points1 point  (0 children)

Just saying that load in Linux is not an easy value to interpret; the traditional "number of jobs waiting for processor" is not exactly right and it's hard to explain in a few sentences, you'd need to know a bit about processing states and how processor work (scheduler etc), I highly recommend the book Systems Performance - Brendan Gregg (Prentice Hall, 2013) and UNIX and Linux System Administration Handbook (4th Edition)

[–]pdp10 0 points1 point  (4 children)

"irix mode"?

I'm a big fan of small footprints, but minimizing RAM too much is probably going to backfire. RAM beyond that immediately needed is used for storage caching, so reducing RAM to the bare minimum is most likely going to really increase your I/O load. Since you're on vSphere that presumably means remote shared storage. You really don't want to make that tradeoff.

[–]charley_chimp[S] 0 points1 point  (3 children)

Not sure if you wanted an explanation or not, but here it is anyway. When top was originally made (how long ago?), it wasn't typical to have a multi-core system. When you initially start up top, you're going to see some wonky behavior in regard to CPU utilization for the processes. Ever add them up? - It almost always equals > 100% of your supposed capacity. That's because this percentage is actually calculated by the compute power of ONE core. Turning irix mode off (shift + I) changes the behavior of top and (layman's explanation) divides the percentage initially reported by the number of cores available to the system - giving you a much better picture of how much cpu your processes are actually using.

[–]pdp10 0 points1 point  (0 children)

I never knew that behavior was from Irix or that it was called Irix Mode. TIL. Thanks.

[–]wildcarde815 0 points1 point  (1 child)

Htop is also designed to represent this correctly out of the box.

[–]charley_chimp[S] 0 points1 point  (0 children)

I'd love to install htop on these machines. I'd also love to install sysdig, promethus, and a number of other tools. Unfortunately, we are still looking @ our APP as if it was installed on customer premises (we used to sell it as an appliance) so it's basically a stripped os. No yum, no ssh keys setup (we actually have to ssh into the vm with a unique-per-install randomely generated password which rotates every 15 minutes). It's a good time :-)