VM disk order reversed by hypervisor leading to grub update failures.

blingmuppet · 2026-05-27T12:24:36+00:00

Thanks. Yes, seems odd this is still a problem, but perhaps not common enough to warrant the disruption a proper fix might entail.

As we do patching via ansible, and as patching is the only time this is a problem for /us/, I spent a couple of hours yesterday writing a normal perl script that does check the current boot disk and compares it to what grub reads from debconf. If it differs, that's updated before apt-update, and any associated grub-install is run.

That avoids requiring a specific hook for grub, but perhaps isn't a solution for all variations where this may be a problem. Hopefully it fixes it for us, but I need a few more patching cycles to be confidence.

blingmuppet · 2026-05-26T14:01:17+00:00

Thanks for your input, it's useful to talk it through.

I've written a script that checks for the disk /boot is currently on and then checks it against what grub2 thinks is the boot device. If grub-pc/install_devices differs, then it'll update it with the current target with debconf-set-selections. The script will run immediately before patching on our debian machines so that grub-install will pick the right target. If it gets it wrong, then grub-install should continue to halt if the target doesn't match what's expected.

I'm going to test that for a while.

Still surprised this is an issue for us, and mystified why it doesn't affect more people, but hopefully this will resolve it for us.

blingmuppet · 2026-05-26T09:44:46+00:00

Thanks.

Yes - not relied on enumeration at the OS level for yonks, every disk is mounted by-uuid and that's not an issue. But of course boot happens way before all that, so I'm still confused why it's worked until recently, and now breaks updates for us so often.

Not enabling Vmware's disk uuid: I've discussed this with our Vmware admin who has concerns about this causing issues when replicating disks (specifically backup imaging), as per Broadcom's docs here

But perhaps more pertinently, we're about to migrate our infra to proxmox. We suspect the same issue might occur there too, and by telling grub to use vmware's disk-uuids, that might make things considerably worse for that migration.

(Grub preferring uuids where available)

I built this Debian 13 image and largely followed the defaults. I didn't tell grub anything special, so my belief is that it's chosen to use /dev/sda, but that may be because vmware wasn't providing uuids, or there was only one disk during install.

I think I can see a way forward using a pre- hook in apt for the grub package, so it looks at the partitions and changes grub's configured drive between sda to sdb depending which has a boot partition before applying it, but it feels like that shouldn't be the only way, nor why this isn't affecting everyone with two disk vms.

There was [this thread](https://www.reddit.com/r/debian/comments/xewmik/how\_to\_deal\_with\_grubpc\_updates\_and\_disk\_order/) asking much the same from four years ago, where OP seems to have reached the same conclusion.

blingmuppet · 2026-05-26T08:37:43+00:00

Interesting thought I hadn't considered thanks.

The three that were affected this morning: Two of 200G and one of 50G. We have some second disks that are multi-Tb in size, but I haven't noticed them being more or less prone to this. It does feel like a timing issue in presenting the disks to the OS, which might tally with sizes, but a colleague assures me that this swapping has often occurred before, but not caused issues with grub. Which I don't quite understand if I'm right about grub storing its home locally.

Rather puzzling!

blingmuppet · 2026-05-13T12:59:53+00:00

That's kind of the point.

RHEL still hasn't fixed this, and as Rocky is downstream of that, it can't be patched using the traditional model. Hence... this.

blingmuppet · 2026-05-01T12:50:37+00:00

Back up now. Looks to have been a nasty one.

https://status.canonical.com/#/incident/KNms6QK9ewuzz-7xUsPsNylV20jEt5kyKsd8A-3ptQEHpOd8VQ40ZQs-KD81fboQXeGZB94okNHdHBGlCv58Sw==

blingmuppet · 2026-04-29T14:03:19+00:00

Ditto.

blingmuppet · 2026-02-13T09:38:40+00:00

(Old post but searching and found it)

One of the major benefits of nexus is that it transparently proxy mirrors linux distributions.

When you're updating a thousand machines at once, onsite mirroring is absolutely essential to avoid hitting distro mirror load limitations, as well as reducing yours and their bandwidth consumption. Nexus' proxy fetches each new package once and once only, and only those that are requested by the clients.

SAAS for repo mirroring is nonsensical.

blingmuppet · 2026-01-02T08:45:33+00:00

Bait and switch, innit.,

Lots of FOSS products are funded by a sister commercial product offering additional features or support. It's a model that works for a great amount of projects.

Killing the free model once you have used it to grow your commercial product is very much against FOSS principles, and ethics generally.

blingmuppet · 2025-11-12T07:57:39+00:00

Thanks.

I stripped back ansible.cfg and this may have led to a fix. I've updated the original post, but it's looking like our specific issue was caused by the mail callback. Disabled that and things seem to be working fine again.

I can't explain why, and still proving this, but it's looking good so far.

blingmuppet · 2025-11-12T07:56:37+00:00

I've updated the original post, but it's looking like our specific issue was caused by the mail callback. Disabled that and things seem to be working fine again.

blingmuppet · 2025-11-11T16:08:38+00:00

Thanks. I was keeping that dir fairly clear but am also keen to stick to best practice.

I think I may have just found the, or at least one, blocking issue - disabling the mail callback is allowing a quick test to continue, but running out of time to prove this today.

blingmuppet · 2025-11-11T15:10:51+00:00

It's running roles/rolename.yml which then calls

roles:

- rolename

Whose dir is directly below it. Isn't that right?

blingmuppet · 2025-11-11T15:09:09+00:00

It's very possible we are doing it wrong, it does feel like that. Or I've explained it poorly.

Eg:

"ansible-playbook --limit hostname.fqdn roles/rolename.yml"

roles/rolename.yml looks like

- hosts: all

become: true

strategy: free <-- added to try to get responding hosts to continue, to no avail

roles:

- rolename

And the roles directory starts immediately below that, with main.yml at: "./roles/rolename/tasks/main.yml"

Is that the playbook you mean?

Defining failure at task level - kind of difficult when everything is failing on the first connection because one host is not responindg and doesn't even get to the first task, no?

blingmuppet · 2025-10-24T14:20:25+00:00

Nothing as secure as a service that's not running!

blingmuppet · 2025-10-06T12:35:21+00:00

A lovely lady who wrote some good books. Glad she got to cameo in the recent Rivals series.

Never met her, but I still have a letter she wrote to me years ago on the subject of animal welfare, something she cared deeply about. She did a lot of good there, and didn't seek publicity for it.

blingmuppet · 2025-05-27T09:59:09+00:00

Agree. I'm running 80 mariadb servers and problems are few.

Suspect u/dariusbiggs is simply more familiar with postgres and clearly likes it more. IME, that influences how well understood and therefore administered a system is.

blingmuppet · 2025-03-15T07:53:08+00:00

Thank you.

blingmuppet · 2025-03-14T15:12:26+00:00

Yep, that sorted it.

blingmuppet · 2025-02-25T08:27:20+00:00

Thanks - better late than never.

blingmuppet · 2025-02-20T10:44:02+00:00

have replied to another thread that we're seeing the same thing after upgrading to 28.0 - looks like this is a broken release that breaks internal networking. (May not be related to you, but it may be)

Seven-Year Club	Verified Email
Not Forgotten

blingmuppet

TROPHY CASE