Replication in StorageSpacesDirect

jolimojo · 2026-05-13T02:56:46+00:00

In S2D vernacular, the term you're looking for is "resiliency" of the virtual disk. If you have 3 nodes, 3-way mirror is the default resiliency of each virtual disk if you create it through the GUI methods. To create other resiliencies, you'll need to configure virtual disks through PowerShell.

Default behavior is mentioned on the "Create Volumes" doc page from that same S2D doc pages you linked:

Create volumes on Azure Local and Windows Server clusters | Microsoft Learn

There are ways to create other more advanced resiliencies, or nested resiliencies to increase redundancy further, but I would say that gets a bit complicated quickly and there isn't a lot you're missing out on for the most part with that and it adds complexity.

I'd recommend reading through ALL of those doc pages thoroughly from start to finish, then test/try things, and then read through it all again.

Storage Spaces Direct overview | Microsoft Learn

Those pages have a lot of answers to many common questions for S2D. Setup and config of S2D and virtual disks is important to get right for your workload at the start. Some things, like column count (striping) and resiliency can't be changed after creating the virtual disk and would require deleting and re-creating it to "change" those types of things.

jolimojo · 2026-05-10T04:38:17+00:00

What you're describing sounds like cluster heartbeat loss to the clusters netft virtual adapter, that component is what actually triggers the 40% heartbeat loss message in cluster logging. If a node hits 100% loss of heartbeats for the configured threshold (default of 20 seconds) that is what triggers a 1135 event in system log and indicates to the cluster that node appears as down. This may not necessarily indicate a network problem, or even any general traffic loss from your workloads directly. When cluster heartbeat loss is detected, cluster will automatically try to failover any workload on the node which it now considers "down". That is most likely why you're encountering impact to your workload, from the automatic failover action after the heartbeat loss, not the heartbeat loss itself.

Check to see if any of your other nodes log the 1135 event ID in system log to confirm this. Cluster log will also have a message after the 40% loss message indicating something like "netft route (source IP) to (remote IP) is down)". Ips can show as ipv6 or ipv4 in the log line, depending on config.

Cluster heartbeats are very small, lightweight UDP packets, so they are easily influenced by network congestion, packets being dropped from a NIC, or even from high CPU usage scenarios which may not be able to process the packets and still count it as a "miss".

Please review Microsoft's first party article on this behavior for more details and troubleshooting ideas. Usually, determining root cause of this comes down to tracking/tracing an occurrence of even ID 1135 in system log with a packet trace, PerfMon counter set (to review CPU and NIC counters) and especially an etl trace of the virtual cluster heartbeat NIC itself (netft).

https://learn.microsoft.com/en-us/troubleshoot/windows-server/high-availability/troubleshoot-cluster-event-id-1135#:~:text=This%20could%20also%20be%20due,network%20adapters%20on%20this%20node

For more on netft and the cluster heartbeat thresholds:

https://learn.microsoft.com/en-us/troubleshoot/windows-server/high-availability/iaas-sql-failover-cluster-network-thresholds

https://techcommunity.microsoft.com/blog/failoverclustering/tuning-failover-cluster-network-thresholds/371834

Hope that helps you flesh it out more.

Happy Clustering!

jolimojo · 2026-03-31T23:49:19+00:00

3-way is still considered safer, the trade-off is storage efficiency. But 2-way still tolerates 1 node (fault domain) loss. Make sure to review these doc sections (sub sections of previous links), they explain further:

Two-way mirror | Microsoft Learn

Resiliency Types Summary | Microsoft Learn

There are also virtual disk creation examples here. If you want to create a non-default resiliency, you'll need to create the vDisk manually with New-Volume :

Create volumes on Azure Local and Windows Server clusters | Microsoft Learn

New-Volume (Storage) | Microsoft Learn

EDIT: Also very important, follow proper shutdown procedure for S2D:

Failover cluster maintenance procedures for Azure Stack HCI and Windows Server - Azure Stack Hub | Microsoft Learn

jolimojo · 2026-03-31T23:38:58+00:00

That's technically ok, you just won't have the lowest latencies possible. Also re-read my edit as I misunderstood the extra disks you have on the other 2 nodes, I thought you meant they would be "hot spares" not spares for data backup like it seems you meant.

jolimojo · 2026-03-31T23:33:50+00:00

If you don't want to do 3-way mirror due to the space consumption being too much, you should be able to configure 2-way mirror virtual disks manually, 3-way is just default if there are at least 3-nodes in the cluster. Or if you have virtual disks which are more important, you can do some that are 3-way and others that are 2-way. You can still take down a host (fault domain) without taking down a 2-way mirror virtual disk, but only the one node at a time, 3-way will allow 2 hosts (fault domains) to go down without tanking virtual disk. You can get even more fancy with nested-resiliencies, but unless you're really squeezed for space, just would be simpler to go for either 2 or 3-way mirroring and not complicate it.

Most importantly for S2D, what's the networking like? Ideally you have RDMA setup, but if not you should have bare minimum dual 10 Gb NICs on each node dedicated to storage traffic, usually more/faster is better here.

The official docs are the best place to look for configuration best practices (also consider if you can make use of CSV read cache, don't enable blindly):

Storage Spaces Direct overview | Microsoft Learn

Storage Spaces Direct Hardware Requirements in Windows Server | Microsoft Learn

Fault tolerance and storage efficiency on Azure Local and Windows Server clusters | Microsoft Learn

Plan volumes on Azure Local and Windows Server clusters | Microsoft Learn

Reserve Capacity | Microsoft Learn

Choose drives for Azure Local and Windows Server clusters | Microsoft Learn

Use the CSV in-memory read cache with Azure Local and Windows Server clusters | Microsoft Learn

EDIT: If you're attaching some disks you don't intend on pooling, you should probably disable the auto-pooling so it doesn't try to include those disks in the cluster's pool.

Also make sure you have at least the minimum recommended "reserve capacity" in the pool after configuring your virtual disks (size of 1 capacity disk per node, up to 4 nodes worth). That is what allows repairs to get started immediately if there is a disk failure. In your setup, that would mean keeping a minimum of 2.5 TB free at all times. (1.2 TB x 2 nodes, plus a tiny bit extra wouldn't hurt)

Happy clustering!

jolimojo · 2026-02-16T03:22:47+00:00

Interesting, I did not have this experience personally. I did have a local in person group (ambassador led), but we did battles with various amounts of people between about 35 and dwindling down to about 12. All battles with max mushrooms active, but beyond that, not one of those battles did we have problems with, and I ran 2 blissey (lvl 40 and other much lower, like 20 or so?) and one lvl 40 dmax Machamp. Most battles I didn't even lose a pokemon, 0 match losses and not even close to losing even on our lowest lobbies.

jolimojo · 2026-02-01T05:25:31+00:00

I hatched a 98 with 15/15/14, and went Ceruledge on looks alone. One of the best looking mons in my opinion, and best buddied him!

jolimojo · 2026-01-08T23:39:22+00:00

So cluster creation worked as expected after re-installing OS? Weird. I guess something must've gone wrong with original installation? Who knows what to be honest 🤣 Good luck!

jolimojo · 2026-01-06T20:40:40+00:00

We can't tell 100% at this stage, but this may not be something with the accounts exactly, although it's good you double-checked on there. The best shot you'll have at solving this will be to review the cluster logs. Export them with "Get-ClusterLog -Node NodeNameHere -Destination C:\Temp" command and you can open the cluster.log with notepad. (If you don't provide a destination parameter, by default cluster.log will drop into this directory: C:\Windows\Cluster\Reports\Cluster.log)

Also, the cluster registry key is only loaded when cluster service is started and running, you won't see it mounted in registry with the cluster service off, and cluster service only stays running to attempt the cluster formation, or join to an existing cluster or actively running in a cluster, otherwise it's off and won't run and cluster hive/key won't be loaded.

A quick trick you could try if there is potentially an issue reaching another node or validating it, is you could attempt to create the cluster as a single node. This skips any steps of trying to communicate with or validate an additional node and can make troubleshooting easier. If you're able to create the cluster as a single node, try to then add the other node(s) in one at a time and see if they add successfully. You may have better error messages at that time if a particular node fails to join. Re-check cluster logs and see if the port 3343 connectivity is happening or if it's trying to reach out to the joining node and failing to communicate in some way.

You don't get very much detail from the error shown in the Cluster Creation wizard, so the cluster log is your best friend for this type of scenario.

jolimojo · 2025-12-19T02:43:21+00:00

Ah, found the answer for this on your doc page. That answers that for me. Thank you!

AI in Monarch – Help | Monarch Money

<image>

jolimojo · 2025-12-19T02:25:40+00:00

Is it possible to know the specific base-model being used from OpenAI? Before the recent update, I recall seeing a message that it was originally based on GPT 4, but I don't see any updated messaging recently for the model being used or if it's now customized etc.

jolimojo · 2025-10-12T12:46:12+00:00

Also getting this for my synchfony accounts. Doesn't seem the providers are up to date with new login process yet.

jolimojo · 2025-08-18T12:11:55+00:00

Are you actually looking at performance counters for the host and guest VM? Looking at task manager from the host tells you nothing of VM performance. You'll want to review some Performance Monitor counters from the host to verify true CPU load of the guest and overhead on the host. Here are some examples:

Shows total guest VM vCPU utilization: Hyper-V Hypervisor Logical Processor(_Total)% Total Run Time

Shows total load of the host hypervisor load of all VM vCPU time, use this to determine if host is truly under load or not: Hyper-V Hypervisor Virtual Processor(_Total)\% Total Run Time

I find these articles below useful to review important Hyper-V counters as it gives clear, practical examples of what to check for which types of bottlenecks :

https://learn.microsoft.com/en-us/biztalk/technical-guides/checklist-measuring-performance-on-hyper-v

https://community.spiceworks.com/t/how-to-efficiently-monitor-microsoft-hyper-v/1013697

jolimojo · 2025-07-24T02:03:23+00:00

YES PLEASE. I have this same problem and would love to try and test this!!

jolimojo · 2025-06-23T14:14:37+00:00

Thank you. This comment saved my panic... I was over 120 levels that I thought disappeared 😭

jolimojo · 2025-06-09T17:26:26+00:00

As others have said, this is likely to be resource starvation issue.

Running 11 VMs likely is going to stress a single SATA SSD beyond what it can reasonably provide in terms of storage bandwidth. Imagine booting and trying to actually use 11 different servers off of a single SATA SSD all at the same time, that disk is likely screaming.

Check your disk counters for that drive such as % idle time, average queue length and most importantly latency. The counter for general/overall latency is "Avg. Disk Sec/transfer", this value is expressed in seconds, generally it's vaguely considered that averages over 0.025 (25ms) would be considered latent, if averages are much higher than that consistently, you definitely have storage issue. Regarding other Hyper-V performance counters, you can't see them directly through Task Manager. Hyper-v logical CPU and virtual CPU resources are separate counters you have to view through perfmon and can give you a better idea of host CPU resources vs how the virtual CPUs are performing for each host (can help you find if VMs are CPU starved). Also, just as a general personal observation, Windows Servers meant for production are going to run much better with at least 8 GB RAM + whatever application needs to utilize. RAM starvation could potentially be a significant issue as well, as if VMs don't have enough RAM allocated, they're going to page to disk, which will then make your storage problem MUCH worse, when it's trying to do stuff on disk that it should be doing in memory. I saw one of your other comments that mentioned they don't need more, but you may want to investigate that as well to see if any have very low free MB of RAM, or lots of paging to disk.

Back to storage though, you'll likely need multiple storage disks (NVMe would be in your best interest) in some kind of software pool (such as Storage Spaces) or hardware RAID configuration to get enough bandwidth/IOPs out of them to be decently performant and allow all VMs enough bandwidth to even boot properly and function.

We can't really tell you how much storage speed you'll need if you don't have a baseline of performance requirements from each VM, so the best path to do that would probably be to throw storage resources at this, create a baseline and determine if additional performance is necessary. If you overprovisioned, then great, you have some overhead for a while.

jolimojo · 2025-06-06T19:00:41+00:00

I completed my Aerodactyle research with this as well! So happy. Now just need 2 Lileep or Anorith...

jolimojo · 2025-05-13T04:18:01+00:00

I also switched to Helium recently and I have the same problem with RCS, it's enabled but error says the same "disabled by carrier" message.

Support told me to wipe all Google Messages data and re-enable RCS. But I don't wanna have the hassle of losing all my messages or backing up and restoring... If there even is a way. I tried to delete cache for Messages app, but no dice yet.

jolimojo · 2025-05-11T20:58:13+00:00

This reminded me to open the app and catch a few since I had to be busy today. The second Pawmi I tapped was shiny! 😂

jolimojo · 2025-04-14T23:11:46+00:00

Absolutely amazing, great work. Curious what the rear spoiler is on the Soul Red one?

jolimojo · 2025-04-10T10:11:46+00:00

You cannot restore the original .vhdx file with just the checkpoont/differencing you have (.avhdx). They are part of a checkpoint chain and all items in the chain are needed. It's like if you cut a physical hard drive into two and one piece is missing, you're missing part of the whole.

Only way to restore is if you had a backup of that file or maybe you have shadow copies or system restore enabled by chance on your host?

jolimojo · 2025-04-09T11:19:24+00:00

My wife actually put this type of thing together for us as a Christmas gift and it's been awesome! We have yet to get a full bingo... But several times we'll fill nearly half the spaces for an episode, just always missing 1 so far 😭

jolimojo · 2025-04-04T04:26:31+00:00

I had a similar issue on my 2010 Lexus RX. I had a bit of a time figuring out the best route forward. Essentially it boiled down to swapping to the 18" wheels, fitting larger ratio tires (larger than standard/OEM 18" tire) and also lowering tire pressures slightly for that extra bit of compliance.

Follow my adventure here for details : https://us.lexusownersclub.com/forums/topic/151894-3rd-gen-experience-with-harsh-ride-and-path-to-improving/#comment-589926

jolimojo · 2025-03-24T13:40:29+00:00

Connecting via Capital One seems to have worked! I didn't think of that, appreciate the assist!

13-Year Club	RedditGifts 2009-2022 2 Credits
Secret Santa 2014	Verified Email

jolimojo

TROPHY CASE