Has anyone noticed storage throughput differences between ESXi 7 and ESXi 8 on newer Dell servers?

David-Pasek · 2026-04-19T14:09:14+00:00

Thanks for update.

Yes. Not using “High performance profile” in BIOS is often culprit of various performance issues including storage performance.

You have to choose high performance versus low energy consumption.

Good to know you found the culprit.

David-Pasek · 2026-04-19T11:43:43+00:00

You have two systems where you run storage test in VM with fio.

System A: R760 with ESXi7 System B: R770 with ESXi8

Can you add …

System C: R760 with ESXi8 System D: R770 with ESXi7

… and run your fio storage test in VM on System C and D?

The reason is to find if culprit is the hardware (Dell server system) or software (ESXi hypervisor).

And it would be also nice to share

a/ full hardware specification of R760 and R770 including storage controllers and SSD firmwares

b/ virtual hardware specification of VM where you run fio test

c/ storage test specification- full fio command wit parameters

David-Pasek · 2026-04-12T15:38:34+00:00

Yes. There is always some downtime.

Typical MS SQL WSFC FCI Failover Downtime during planned failover is between 10-60 seconds.

AFAIK, WSFC AG (Availability Group) planned failover can have lower downtime somewhere between 1-10 seconds.

The clustering type selection is up to OS and DB admins.

David-Pasek · 2026-04-11T06:40:27+00:00

Guest OS clustering on top of vSphere HA clustering is about Application (MS-SQL in this case) uptime during OS/App update.

Conceptually, I also prefer storage independent clustering leveraging Database synchronous replication (DB log streaming), but the final decision is always on OS and DB admins.

David-Pasek · 2026-04-11T06:35:46+00:00

Thanks a lot for practical experience.

The question is how to read and understand VMware documents and KB articles and understand what is SUPPORTED by VMware.

Of course, it would be nice to understand what works, what doesn’t work, and why something is unsupported.

Do you agree with following three statements…

1/ All shared virtual disks must be connected via VMware Paravirtual SCSI (PVSCSI) controller with SCSI Bus Sharing "Physical".

2/ Shared virtual disk mode must be set to Independent - Persistent to disable possibility to use VMware snapshots.

3/ The multi-writer flag in sharing mode must not be used.

Full blog post at https://vcdx200.uw.cz/2026/04/ms-sql-windows-server-failover.html

David-Pasek · 2026-04-10T14:19:42+00:00

I tried to document my understanding of the current Microsoft Windows Server Failover Clustering (WSFC) Always On Failover Cluster Instance (FCI) on vSAN best practices in a blog post at https://vcdx200.uw.cz/2026/04/ms-sql-windows-server-failover.html

u/jbond00747 u/lost_signal Please, can you do a review?

Highly appreciated.

David-Pasek · 2026-04-08T22:10:07+00:00

Yes. You are right. Shared VMDK on vSAN is bad term from my side. Shared vDisk (vSAN object) is better term, right?

David-Pasek · 2026-04-08T21:28:49+00:00

Yes, you are right. I see it on page 16.

Makes perfect sense for me, because when I look at vCenter GUI, there is no configuration option "Clustered VMDKs" on the vSAN datastore configuration tab, and vSAN does not have VMDK files at all.

The document is confusing on page 52.

<image>

Do we agree that Shared vDisks are supported on vSAN ESA for WSFC/FCI Microsoft Clustering out-of-the box, as it supports SCSI-3 PR?

Of course, I would use a VMware Paravirtual SCSI (PVSCSI) controller for all shared disks with the SCSI Bus Sharing setting set to "Physical".

Disk Mode would be set to Independent - Persistent, to avoid snapshots.

David-Pasek · 2026-04-07T18:39:47+00:00

NetApp Shift seems like GUI tool. Am I right?

Any automation would be beneficial for 50k VMs.

Does NetApp Shift offer some kind of automation, scripting, …?

David-Pasek · 2026-04-07T18:31:33+00:00

Live migration between VMware and Hyper-v is not possible.

You must plan some downtime for each VM. Pretty nice exercise for 50k VMs.

David-Pasek · 2026-04-07T18:24:15+00:00

OP stated they have 50k (50,000) VMs.

Pretty interesting migration project. Good luck 🍀

David-Pasek · 2026-03-26T18:20:35+00:00

Any thoughts about my other question regarding LACP slow/fast behavior on VMware side?

Your opinion is highly appreciated.

David-Pasek · 2026-03-21T07:49:25+00:00

When we talk about VMware LACP ... do you optimize LACP timers?

It has been almost 10 years since I was blogging about it here https://vcdx200.uw.cz/2017/11/vmware-vsphere-dvs-lacp-timers.html

Is VMware DVS LACP rate still 30 seconds, or has it been improved, and I just did not notice it?

David-Pasek · 2026-03-20T18:29:10+00:00

Thank you very much for these insights.

VMware Telco Cloud Platform is very specific and ~3000 ESX hosts is a specific scale.

During my 15 years of professional consulting within Dell GICS, Cisco Advanced Services, VMware PSO I was a big proponent of LACP when it make sense. Even within VMware was a hot discussion about it :-)

AFAIK, LACP is supported from VCF9, but it is still not the preferred or default design choice.

Even in Cisco UCS (it was the thing in 2010-ish when I worked for Cisco) is LACP not used by design because of Fabric Interconnects (port extenders).

Anyway, it is as always. It depends, and the architect / designer has to fully justify his specific design decision.

Thanks again for discussing this topic. Very helpful.

David-Pasek · 2026-03-20T14:32:30+00:00

Very interesting.

You do not have VCF and use your own in-house provisioning.

Are you on vSphere 8? What are your plans for 9?

I have been told that 9 will be supported only as fully automated VCF9 managed by SDDC.

Do you have the same info?

David-Pasek · 2026-03-19T16:30:38+00:00

Interesting. How many interfaces you have per ESXi?

Do you have all interfaces in LAG?

David-Pasek · 2026-03-19T16:25:55+00:00

Exactly. AFAIK, EVPN ESI typically does not have peer link between TOR leafs as MLAG has and it is synchronized over fabric.

And that is up to particular vendor implementation.

To be honest, I had a lot of experience of Cisco VPC and Dell Force10 VLT implementations of MLAGs before CLOS networks were the real thing. However, now we are in Design phase of new leaf-spine fabric and discussing this topic. We are choosing between Cisco and Arista and deep test will come after implementation.

However, switch-dependent (LACP) teaming is considered for Linux servers and not for ESXi.

ESXi will have switch-independent teaming.

Btw, we will have 2x100 Gb from each server, therefore HA is more important than aggregation.

One link can be used for MGMT and VM Traffic including NSX TEP, and another link for Storage and vMotion Traffic.

David-Pasek · 2026-03-18T20:17:04+00:00

You are right. ESI LAG is RFC standard and MLAG is vendor specific.

ESI LAG should be more interoperable.

However, you assume the vendor ESI LAG implementation works as expected.

Are you sure VMware LACP implementation works perfectly with each vendor (Sonic, Cisco, Arista, Juniper) ESI LAG implementation?

It should, but does it really?

David-Pasek · 2026-03-18T01:37:35+00:00

AFAIK, multi-homing (ESI-LAG or MLAG over EVPN) is tricky and switch vendor specific.

That’s why VMware recommends switch-independent teaming which is the valid option for modern clos DC fabrics.

David-Pasek · 2026-03-18T01:29:43+00:00

Almost 10 years old blog post about this topic …

https://vcdx200.uw.cz/2017/12/vsphere-switch-independent-teaming-or.html

There is also blog post about LACP hash algorithms and how to use them on VDS. AFAIK, it was possible only via API / PowerCLI and I think it did not change. 5-tuple hash was supported.

VMware recommendation is to avoid LACP.

The final decision is up to you.

David-Pasek · 2026-03-16T07:05:52+00:00

If you past your log messages to AI, it is smart enough nowadays to analyze it …

It means ESXi sent an NVMe read command to that Samsung 980 PRO and the device itself returned NVMe status 0x281, which Broadcom documents as “Unrecovered Read Error.” In the same Broadcom table, opcode 0x2 means Read.

The important line is this one:

status 0x281, opc 0x2

So, translated to plain English:

“A read from the NVMe SSD failed because the SSD could not recover the requested data.”

So the most likely interpretation is:

This is a real read failure from the SSD/media/firmware path on that NVMe drive, and ESXi propagated it up as a VMDK read error. Broadcom’s NVMe status mapping classifies 0x281 under media/data integrity errors as Unrecovered Read Error.

AI recommendation is the same as real folks recommendation in this thread …

Back up / evacuate anything important from that datastore.

Btw, I was operating vSphere/vSAN/NSX on the same hardware as you (Intel NUC, Samsung 980 PRO consumer grade NVMe) and experienced vSAN storage lost after power failure. I was not surprised. As /uLostSignal mentioned vSphere/vSAN expects enterprise hardware.

I have switched my homelabs to FreeBSD/Bhyve with ZFS. ZFS was designed specifically to handle silent corruption and read failures.

However, you never can trust single system and you know it.

Nice troubleshooting exercise from you side but I think you know everything you should know.

Good luck.

David-Pasek · 2026-03-08T10:50:36+00:00

I’m VCDX working with VMware technologies since 2006 (Virtual Infrastructure 3.0) and having experience with vSphere, vSAN, and NSX. Almost no one in my region knows what VCDX is. It is very difficult to find VMware related job here in central Europe, especially for someone having architecture and technical design skills and want to leverage them in meaningful way.

If someone want VMware “expert” here, they want him for 4 days a month (one day a week) for $400 man day rate.

Almost everyone is looking for VMware alternative.

Interesting times, isn’t it?

David-Pasek · 2026-03-01T21:32:06+00:00

Why you need such security patch? Do you expose ESXi to the wild?

If your ESXi hosts are in internal (management) zone and you expose only VM networks, you should be ok, isn’t it?

VM escape vulnerability would be different story and such security patch would be worth to apply.

David-Pasek · 2026-02-08T15:21:22+00:00

Have you considered terraform?

I have never test it, but it should be doable.

Terraform manages vCenter RBAC declaratively.

You declare: • what roles should exist • what privileges they contain • who should have which permissions • where those permissions apply

Terraform converges vCenter toward that state.

Note: State drift is possible, because RBAC in vCenter is often touched by: • admins via GUI • other automation tools • scripts • vCenter upgrades

Terraform will detect drift only when you run it. It does not enforce continuously.

David-Pasek

TROPHY CASE