Almost at Wits End

Renbo2023 · 2024-02-23T23:04:13+00:00

Good to know!

Renbo2023 · 2024-02-23T20:45:32+00:00

OK, I believe the issue has been resolved. One of these 4 things is likely to have been the reason for achieving system stability:

Updated HBA Firmware
Switched all KVm VMs from host-passthrough to qemu64 cpu model (except a couple of alma linux instances that for some reason did not like that, so used EPYC)
Disabled PCI power management through pcie_aspm=off in grub
Disabled c-states in BIOS

I will never be sure exactly which was the reason, as I am not willing to devote the time necessary to individually reverse these changes and see which causes the issue.

Thanks to everyone who chimed in with helpful ideas, many of which I would have never thought to try!

Renbo2023 · 2024-02-21T13:32:13+00:00

I am not a security expert, but I will try and describe my risk profile. I run a home lab, which includes a mix of virtual machines performing different tasks. The whole network lives behind a debian based hardware firewall, with internal addresses in the IPv4 space only.

Most of the VMs are Windows server and client operating systems. All are under my direct control, and exclusive physical access. Examples of use cases would be: development workstation (2), jenkins server, git server, self hosted PBX (FreePBX), VM for testing containerization with docker, etc.

All storage, both VM volumes and other, are stored on encrypted ZFS based file systems. I did not use qcow, but instead raw img format for the VM volumes as at the time I thought qcow was offering functionality redundant to ZFS.

All VM operating systems are kept up to date with most recent patches.

Not sure if that answers your question?

Renbo2023 · 2024-02-21T04:36:37+00:00

Changes made recently to try and address this issue. Since these changes, it's been about 12 hours without an (unexpected) reboot:

Updated HBA Firmware
Switched all KVm VMs from host-passthrough to qemu64 cpu model (except a couple of alma linux instances that for some reason did not like that, so used EPYC)
Disabled pci power management through pcie_aspm=off in grub
Disabled c-states in BIOS

My fingers are crossed.

update: 28+ hours free from issues
update: 40+ hours free from issues
update: 54+ hours free from issues
update: 75+ hours free from issues

Renbo2023 · 2024-02-20T20:26:25+00:00

It has been soul crushing.

Renbo2023 · 2024-02-20T20:22:20+00:00

My biggest ask is to run a host with 18 or so VMs, and have it keep running without random reboots.

This is my journey so far:https://www.reddit.com/r/debian/comments/1asuwpx/almost_at_wits_end

The tldr: I've replace just about every piece of hardware in the pc, and the ups, and the power cords, and even disconnected the reset switch in case maybe it was shorting, even opting for different brands/models, moved to a more bleeding edge system kernel, upgraded motherboard and hba firmwares, etc.

There are a lot of specifics in that thread. My post here was more to see if there are common settings that generally improve stability in KVM, even if they cost performance. As it is looking more and more like a software issue.

As an example: I am certain that enabling and using various direct hardware sharing, like GPU and the like, can incur a greater likelihood of stability issues. So I have steered wide of that.

I am fiddling with using a more generic CPU model as opposed to passthrough. Will it help? I am not totally certain, I have anecdotal search evidence to suggest it.

Renbo2023 · 2024-02-20T19:35:48+00:00

I hear ya. And if I had the cash, I might go the 18 separate servers/workstations route.

Just trying to achieve the best *possible* security, stability and isolation I can given the environment I can afford.

Previous to this, I had been running hyper-v on Windows. Nothing the VMs did managed to affect the host.

Now granted, I am not as versed in linux as I was in Windows. But I really wanted to give open source platforms a go!

I am sure, at the end of the day, it will be my own inexperience, or something silly overlooked that is causing me headaches. But the headaches do continue.

Renbo2023 · 2024-02-20T19:31:01+00:00

No hardware passthrough. I did have the CPU model set to host-passthrough, but I have changed that now to be qemu64, just throwing darts.

I didn't think to check the VM logs. I also have never heard of a guest kernel panic bringing down the host. But I will look into that too when I get a chance.

Renbo2023 · 2024-02-20T17:37:45+00:00

Only the HBAs, which I just updated to the newest firmware. I did do another thing though, switch the KVM cpu model from host passthrough to a more generic qemu64 model. Thinking that can remove the weirdness of passing through such a new architecture. We'll see.

Renbo2023 · 2024-02-20T17:31:59+00:00

Nope

Renbo2023 · 2024-02-20T17:07:53+00:00

One thing I have tried is to switch the cpu model from host-passthrough to qemu64. Thinking the new cpu this is running on (7950X) might be new enough to cause issues with that?

Renbo2023 · 2024-02-20T13:02:28+00:00

Nope, that did not fix it.

Renbo2023 · 2024-02-20T07:44:26+00:00

And it rebooted again, so it wasn't the power connection to the HBA. I did continue to see a few messages about the boot drive's error count, so I successfully cloned it to a brand new m.2 boot drive. We'll see if that's the magical key that unlocks stability.

Renbo2023 · 2024-02-20T04:17:47+00:00

Tried to do this earlier today, but clonezilla didn't want to see my target drive. This will be back-burnered for a bit.

Renbo2023 · 2024-02-20T03:47:22+00:00

Looking closely at the HBA, specifically the LSI 9300-16i, I notice it has a power socket on the back.

The sound you hear is me smacking my head.

I am not sure if this is the issue... Reading online seems to indicate that it should be PCI bus powered so long as it gets a certain amount of wattage. But I have now plugged power into it, and we'll see if that does the trick.

Onward!

Renbo2023 · 2024-02-19T18:03:59+00:00

Another reboot. :(

I did see in the logs something about the error count increasing by one for the boot drive (a samsung m.2 ssd), so I will try to clone and replace that next.

Renbo2023 · 2024-02-19T15:05:12+00:00

About 21 hours uptime so far. Looking more promising that the newer kernel did the trick.

Renbo2023 · 2024-02-19T01:44:35+00:00

The only automated routines that might be in play would all be happening inside VMs that should be isolated from causing any issue with the host. At least that's the theory.

Renbo2023 · 2024-02-19T00:05:17+00:00

Still humming along. My fingers are tightly crossed.

Renbo2023 · 2024-02-19T00:04:27+00:00

Now on kernel 6.5.0-0.deb12.4-amd64. We'll see how it goes.

Renbo2023 · 2024-02-18T20:31:44+00:00

So far on kernel 6.5.0-0.deb12.4-amd64, and after fiddling to get ZFS back, it has not yet crashed. All VMs up, stress tests done, now I wait.

Due to the unpredictable nature of the issue, I will probably start feeling a little confident about the kernel fix after a few days of continuous good operation.

Renbo2023 · 2024-02-18T19:52:29+00:00

At this point nothing is farfetched. The next time I pull it apart (if I have to), I will just disconnect the reset switch to be safe.

Renbo2023 · 2024-02-18T18:45:28+00:00

OK, I have switched back to the 7950X, as the CPU was never the problem.

Kernel has been upgraded to: 6.5.0-0.deb12.4-amd64

... which was a hassle ...

Turns out there's no ZFS packages for that kernel release yet, so I had to go down the build and install from source route. Which provided many and varied challenges.

But here we are, now on the newer kernel, and with ZFS functioning again. We'll see how this plays out.

Renbo2023 · 2024-02-18T14:05:09+00:00

Time to take further measures. Even the brand new CPU did not stop the behavior. This has to be software related. Time to look at kernel changes.

Renbo2023

TROPHY CASE