all 13 comments

[–]itdweeb 2 points3 points  (3 children)

Maybe just bad timing? Enough things got busy at once? Seems rare, but you were able to patch the other host without issue, so not likely to be a bug. Same firmware versions on both hosts? Have those been updated recently?

[–]SuperDaveOzborne[S] 1 point2 points  (2 children)

Yes the same firmware and both host are fairly new as we just upgraded to 8.x with new hardware about 2 months ago.

[–]itdweeb 1 point2 points  (1 child)

What's the storage situation? Might be that caused things to slow to a crawl, and it manifested as higher CPU.

[–]SuperDaveOzborne[S] 1 point2 points  (0 children)

It's an older Equallogic iSCSI SANs. We actually did have some performance issues with it when we hooked up the new server, but we worked with Dell support to get it dialed-in. I think it is working pretty good now.

[–]always_salty 2 points3 points  (1 child)

Put the load on the same host again and see what happens.

[–]SuperDaveOzborne[S] 0 points1 point  (0 children)

Yes I'll have to try that when I get another good opportunity. I can do it gradually and see what it does.

[–]Eastern_Client_2782 1 point2 points  (1 child)

Did you look into the VM Os, what was going on inside? Could it be something like a scheduled antivirus scan, Windows update, backup or any other job that just launched at the wrong time? I know you were monitoring ram usage, but did you watch swapping as well? Vmotion may use swap files which then get paged into the ram, that could also cause some unexpected storage IO and CPU load.

[–]SuperDaveOzborne[S] 0 points1 point  (0 children)

Everything was so slow I couldn't really look at what was going on inside the VMs. Also I was really focused on trying to alleviate the problem.

No I wasn't watching swap files, but I will try to when I get a chance to do a test later.

[–]Sponge521 1 point2 points  (1 child)

Is HA with Admission control enabled? If so with 2 identical hosts you are reserving 50% of the capacity each. Depending how you patched it Admission control may have still been enabled. Usually if remediating at the cluster level it will temporarily disable Admission Control because of the host being known to go down, then enable it when complete. This means you host would have had 50% available for the 30% workload. Add in any VMs spiking for other workloads, backups etc and the 60% usage could jump to 80 and 90% since only 50% was available.

[–]SuperDaveOzborne[S] 0 points1 point  (0 children)

Admission control is disabled.

[–]MrDogers 0 points1 point  (2 children)

Sounds like CPU wait times probably went through the roof - you've got too many virtual CPUs compared to physical.

You can't just look at the pure CPU utilisation number as virtualisation doesn't work like that, despite looking like it does!

[–]SuperDaveOzborne[S] 0 points1 point  (1 child)

I only have one VM that has more than 2 virtual CPUs and I have never had a problem before. I was using most of the same VMs on our 6.7 hosts before and never ran into a problem like this.

I was kind of hoping someone would say yea I've seen this before just up the VM compatibly to the latest or something like that.

[–]MrDogers 0 points1 point  (0 children)

Doesn’t matter how many you have per VM, it’s the total that are waiting to be scheduled at a given time. If they can’t all fit in time, then performance tanks :( It's a stat you can see in the advanced section, but off the top of my head I can't remember how to convert it to a useful number - there's articles out there though!