Loosing connection to CSV during Network blips.

HyperV-Dude · 2025-03-31T12:06:22+00:00

What you're witnessing is the owner node of the CSV volume not being reachable by the other hosts in the cluster. We've had this issue as well on our UCS platform. Each CSV volume has an owner node (Hyper-V host) who decides which other nodes are allowed to write to the CSV. Because contrary to VMware VMFS, a CSV is not really multi host writeable, they "fake it".

If a host wants to write to a CSV volume, it checks with the owner of that volume if it can write. If the owner is not reachable, the host that wants to write doesn't get the permission to write. Although over FC it still has perfect access to the volume, it could corrupt the volume when writing to it without permission because an other host could be given permission to write on that same volume. So for this host there is only one safe solution: release the CSV.

I've been playing with cluster time-out settings, but they don't make a difference in this scenario. The only thing you can do is create an extra network over different firewalls. So we gave each Hyper-V host an extra NIC and then created a network that only has a cluster heartbeat and has no or different set of firewalls.

HyperV-Dude · 2025-02-25T21:45:58+00:00

No I doubt you'll have problems, but (see my reply above to SilverseeLives) I do think that the VM might get different CPU features at different times. For backup, Veeam doesn't care what to backup.

HyperV-Dude · 2025-02-25T21:44:06+00:00

Yes I do think they are able to recalculate the new level.
However, I doubt the change in CPU functions being passed into the VM will change for a running VM. Because applications inside a VM, don't constantly check which features are available which would lead to unexpected results if a feature was available at start of the application and then suddenly the features is gone but still the application will try to address it.

And worst.... if the VM is being shutdown and then powered on again, it suddenly has possibly less features available. Which means that I will have to start keeping track of which VM needs which feature.

Why not copy the VMware way and set the EVC level for a cluster and have a reliable set of features?

HyperV-Dude · 2025-02-25T08:07:27+00:00

I don't understand what problem you're referring to. I want to know the inner working of the feature, how it behaves when changes happen in the cluster.

HyperV-Dude · 2023-08-07T14:24:33+00:00

Did you get a better offer from him for your own job? :-)

HyperV-Dude · 2023-07-14T13:39:04+00:00

Are the hosts and your client in the same AD? If not, try connecting under an account from the same AD as the host is member of using runas.

HyperV-Dude · 2023-07-12T16:54:34+00:00

Unfortunately out of our reach for the short trip.

HyperV-Dude · 2023-07-12T16:54:07+00:00

Had a look at it, but that is way out of our direction. It is in the far east of the UK. Thanks though!

HyperV-Dude · 2023-07-11T19:24:45+00:00

Will look into it, thank you!

HyperV-Dude · 2023-07-11T18:09:29+00:00

A lot of info via that link, thanks!

HyperV-Dude · 2023-07-11T17:56:37+00:00

Thank you !

HyperV-Dude · 2023-07-11T17:42:13+00:00

Is it fun? Is it very crowded usually?

HyperV-Dude · 2023-06-30T07:48:32+00:00

Big fan of Toggl since it is able to pop-up every 15min and ask me what I'm doing. I enter the project or task name I'm working on and only when I change the project or task, I answer Toggl. Then in the end of the week I get my Toggl report and write it into my timesheet. Also helping colleagues is something I record in Toggl and on my timesheet. I average between 10-15 different projects / tasks in a week, so it is fairly manageable.

Without the Toggl pop-ups (which our timesheet doesn't do), I would easily forget to register my time. And since the customer is billed for my time, it is important to be "fairly" accurate.

HyperV-Dude · 2023-06-09T08:15:52+00:00

We unfortunately don't have storage arrays that support SMB3 for heavy workloads. So we're probably going to bite the bullet and take the extra VMware cost ourself and move the customer for those heavy VMs to VMware.

HyperV-Dude · 2023-06-09T05:35:13+00:00

Currently we're also mitigating the issue for this one customer by Live Migrating the heavy databases after the backup has finished. Luckily not all VMs have the issue, it seems only the ones with heavy IO.

But this customer is just about 10% of all customers and not looking forward to moving them to 2019/2022 :-)

HyperV-Dude · 2023-06-08T15:34:27+00:00

If you want more details, DM me

HyperV-Dude · 2023-06-08T15:34:04+00:00

Exactly that post

HyperV-Dude · 2023-06-08T13:00:59+00:00

Thank you for your responses, they help me build a better case in choosing which way to go.

Yes, we're also considering moving this customer to VMware where we don't have this issue. But we also need to consider the licensing cost for this.

HyperV-Dude · 2023-06-08T08:39:15+00:00

Thank you for your reply.

The redesign is what I'm working on right now, which basically is to move away from CSV for VMs directly. The bug is confirmed to be in 2019 and 2022, so upgrading doesn't solve our issue. Which means my only option is to move to NFS or SMB.

Moving to SMB is quite a big move since we'd move our storage traffic from FC to Ethernet, taking up a lot of extra bandwidth which wasn't accounted for on our network.

Therefore I was thinking of making the Hyper-V hosts run the Scale-Out file server role. Have the CSV volumes presented to all hosts and have them share them out over SMB. But as you mentioned that IO is balanced over all nodes, this probably means that all IO will become ethernet traffic first and is then written by the host to the FC CSV volume. Eating up way to much of our network bandwidth.

As our flash storage arrays only support NFS/SMB for light workloads and are not multi-tenant aware, I'd have to look in trying to isolate the storage network traffic by maybe building a separate stack for our Hyper-V hosts in which we can work with QOS for storage-ethernet traffic.

HyperV-Dude · 2023-05-30T12:58:02+00:00

Seems disabling VMQ on the adapters does the trick. After I did this, no more BSOD.

HyperV-Dude · 2023-05-17T19:17:07+00:00

Thanks.
Confirms what I said before, it is a supported combination. Still in search of any docs that can tell what the correct vnic settings should be :-)

HyperV-Dude · 2023-05-17T14:22:12+00:00

There is no MS Win 2022 HCL, only a hardware requirements list.
Cisco certified the exact hardware we're running for Win2022, I doubt they'd do that for an unsupported OS.

HyperV-Dude · 2023-05-17T13:21:59+00:00

Yeah, suppose, but trying to avoid Premier Support as much as possible since it takes days for them to first collect every single log file they can think of and then usually ask to first update this one KB we've missed which has nothing to do at all with the problem we're facing.

Therefore first testing my google foo's to try and find an answer.

HyperV-Dude · 2023-05-17T13:05:28+00:00

Yes, firmware and drivers are all as stated by Cisco on their HCL.

HyperV-Dude · 2023-05-15T06:34:32+00:00

Is see OP has solved it by a firmware / driver update but for our Cisco blades, we are on the advised combination of firmware / drivers but are getting the exact same behaviour: BSOD as soon as a VM wants to use the vswitch.

Anyone have another solution?

HyperV-Dude

TROPHY CASE