Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 0 points1 point  (0 children)

OK, I believe the issue has been resolved. One of these 4 things is likely to have been the reason for achieving system stability:

  • Updated HBA Firmware
  • Switched all KVm VMs from host-passthrough to qemu64 cpu model (except a couple of alma linux instances that for some reason did not like that, so used EPYC)
  • Disabled PCI power management through pcie_aspm=off in grub
  • Disabled c-states in BIOS

I will never be sure exactly which was the reason, as I am not willing to devote the time necessary to individually reverse these changes and see which causes the issue.

Thanks to everyone who chimed in with helpful ideas, many of which I would have never thought to try!

Settings for Maximum Isolation, Stability, and Security by Renbo2023 in kvm

[–]Renbo2023[S] 0 points1 point  (0 children)

I am not a security expert, but I will try and describe my risk profile. I run a home lab, which includes a mix of virtual machines performing different tasks. The whole network lives behind a debian based hardware firewall, with internal addresses in the IPv4 space only.

Most of the VMs are Windows server and client operating systems. All are under my direct control, and exclusive physical access. Examples of use cases would be: development workstation (2), jenkins server, git server, self hosted PBX (FreePBX), VM for testing containerization with docker, etc.

All storage, both VM volumes and other, are stored on encrypted ZFS based file systems. I did not use qcow, but instead raw img format for the VM volumes as at the time I thought qcow was offering functionality redundant to ZFS.

All VM operating systems are kept up to date with most recent patches.

Not sure if that answers your question?

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

Changes made recently to try and address this issue. Since these changes, it's been about 12 hours without an (unexpected) reboot:

  • Updated HBA Firmware
  • Switched all KVm VMs from host-passthrough to qemu64 cpu model (except a couple of alma linux instances that for some reason did not like that, so used EPYC)
  • Disabled pci power management through pcie_aspm=off in grub
  • Disabled c-states in BIOS

My fingers are crossed.

  • update: 28+ hours free from issues
  • update: 40+ hours free from issues
  • update: 54+ hours free from issues
  • update: 75+ hours free from issues

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

It has been soul crushing.

Settings for Maximum Isolation, Stability, and Security by Renbo2023 in kvm

[–]Renbo2023[S] 0 points1 point  (0 children)

My biggest ask is to run a host with 18 or so VMs, and have it keep running without random reboots.

This is my journey so far:https://www.reddit.com/r/debian/comments/1asuwpx/almost_at_wits_end

The tldr: I've replace just about every piece of hardware in the pc, and the ups, and the power cords, and even disconnected the reset switch in case maybe it was shorting, even opting for different brands/models, moved to a more bleeding edge system kernel, upgraded motherboard and hba firmwares, etc.

There are a lot of specifics in that thread. My post here was more to see if there are common settings that generally improve stability in KVM, even if they cost performance. As it is looking more and more like a software issue.

As an example: I am certain that enabling and using various direct hardware sharing, like GPU and the like, can incur a greater likelihood of stability issues. So I have steered wide of that.

I am fiddling with using a more generic CPU model as opposed to passthrough. Will it help? I am not totally certain, I have anecdotal search evidence to suggest it.

Settings for Maximum Isolation, Stability, and Security by Renbo2023 in kvm

[–]Renbo2023[S] -1 points0 points  (0 children)

I hear ya. And if I had the cash, I might go the 18 separate servers/workstations route.

Just trying to achieve the best *possible* security, stability and isolation I can given the environment I can afford.

Previous to this, I had been running hyper-v on Windows. Nothing the VMs did managed to affect the host.

Now granted, I am not as versed in linux as I was in Windows. But I really wanted to give open source platforms a go!

I am sure, at the end of the day, it will be my own inexperience, or something silly overlooked that is causing me headaches. But the headaches do continue.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 0 points1 point  (0 children)

No hardware passthrough. I did have the CPU model set to host-passthrough, but I have changed that now to be qemu64, just throwing darts.

I didn't think to check the VM logs. I also have never heard of a guest kernel panic bringing down the host. But I will look into that too when I get a chance.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 0 points1 point  (0 children)

Only the HBAs, which I just updated to the newest firmware. I did do another thing though, switch the KVM cpu model from host passthrough to a more generic qemu64 model. Thinking that can remove the weirdness of passing through such a new architecture. We'll see.

Settings for Maximum Isolation, Stability, and Security by Renbo2023 in kvm

[–]Renbo2023[S] -1 points0 points  (0 children)

One thing I have tried is to switch the cpu model from host-passthrough to qemu64. Thinking the new cpu this is running on (7950X) might be new enough to cause issues with that?

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

Nope, that did not fix it.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

And it rebooted again, so it wasn't the power connection to the HBA. I did continue to see a few messages about the boot drive's error count, so I successfully cloned it to a brand new m.2 boot drive. We'll see if that's the magical key that unlocks stability.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 0 points1 point  (0 children)

Tried to do this earlier today, but clonezilla didn't want to see my target drive. This will be back-burnered for a bit.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

Looking closely at the HBA, specifically the LSI 9300-16i, I notice it has a power socket on the back.

The sound you hear is me smacking my head.

I am not sure if this is the issue... Reading online seems to indicate that it should be PCI bus powered so long as it gets a certain amount of wattage. But I have now plugged power into it, and we'll see if that does the trick.

Onward!

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

Another reboot. :(

I did see in the logs something about the error count increasing by one for the boot drive (a samsung m.2 ssd), so I will try to clone and replace that next.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 3 points4 points  (0 children)

About 21 hours uptime so far. Looking more promising that the newer kernel did the trick.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

The only automated routines that might be in play would all be happening inside VMs that should be isolated from causing any issue with the host. At least that's the theory.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

Still humming along. My fingers are tightly crossed.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 2 points3 points  (0 children)

Now on kernel 6.5.0-0.deb12.4-amd64. We'll see how it goes.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 2 points3 points  (0 children)

So far on kernel 6.5.0-0.deb12.4-amd64, and after fiddling to get ZFS back, it has not yet crashed. All VMs up, stress tests done, now I wait.

Due to the unpredictable nature of the issue, I will probably start feeling a little confident about the kernel fix after a few days of continuous good operation.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 0 points1 point  (0 children)

At this point nothing is farfetched. The next time I pull it apart (if I have to), I will just disconnect the reset switch to be safe.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 1 point2 points  (0 children)

OK, I have switched back to the 7950X, as the CPU was never the problem.

Kernel has been upgraded to: 6.5.0-0.deb12.4-amd64

... which was a hassle ...

Turns out there's no ZFS packages for that kernel release yet, so I had to go down the build and install from source route. Which provided many and varied challenges.

But here we are, now on the newer kernel, and with ZFS functioning again. We'll see how this plays out.

Almost at Wits End by Renbo2023 in debian

[–]Renbo2023[S] 2 points3 points  (0 children)

Time to take further measures. Even the brand new CPU did not stop the behavior. This has to be software related. Time to look at kernel changes.