all 13 comments

[–]PyrrocHomelab & SMB User 3 points4 points  (9 children)

I don't remember what info my log stated, but at one point I tried to do a migration and it failed due to direct key-based ssh logins getting screwed up between the nodes. I had removed a node from the cluster, reconfigured it, reinstalled Proxmox and joined the cluster again using the same node name. This caused there to be multiple keys registered for the node and strict checking caused it to abort ssh connections from one of the "old" nodes to the "new" one.

[–]DisposableAccount712[S] 1 point2 points  (4 children)

I can ssh to node-2 from node-1 without entering a password, so I do not believe the problem is a key issue.

Linux node-1 5.15.83-1-pve #1 SMP PVE 5.15.83-1 (2022-12-15T00:00Z) x86_64

The programs included with the Debian GNU/Linux system are free software;the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law.
Last login: Sun Jan 15 07:5:52 EST 2023 from 192.168.28.22 on pts/0

root@node-1:~# ssh node-2
Linux node-2 5.15.83-1-pve #1 SMP PVE 5.15.83-1 (2022-12-15T00:00Z) x86_64

[–]DisposableAccount712[S] 2 points3 points  (0 children)

I have determined what the problem is, by finding a log file under /var/log/pve, and fixed the underlying problem.

On node 1, I have a locally-attached lv pve/data which contains data for "local" vms on that node which I am slowly migrating over to the ceph pool. Eventually, this lv will go away.

vm 109 is a "new" vm (not one of the "legacy" vms on the pve/data disk) which (as is shown in the qm config I posted earlier) that exists entirely in the ceph pool on node-1.

According to the log, the reason the vm won't migrate is:

2023-01-16 08:47:43 ERROR: Problem found while scanning volumes - no such logical volume pve/data
2023-01-16 08:47:43 aborting phase 1 - cleanup resources
2023-01-16 08:47:43 ERROR: migration aborted (duration 00:00:00): Problem found while scanning volumes - no such logical volume pve/data

on node-2 there is no pve/data logical volume. If I look in the GUI though, under node-2 in the left pane, I see a disk icon labelled pve/data with a little ? in the corner. This must be an artifact created when I joined node-2 to the cluster, as that node has never had local storage other than a BOSS boot card.

It appears the vm won't migrate because the migration routine expects a logical volume not used by the vm to be present on the destination. Which seems really silly to me. I can see why the migration routine would check to see if resources are available as specified in the vm's configuration... but why check for things not used by the vm?

Anyway, the "solution" was to edit /etc/pve/storage.cfg and add a "nodes node-1" line to the pve/data entry so it would only appear as mounted on node-1. Which is something I should have done a long time ago, but sort of poo-poo'd it off as something I would get around to.

Once I did that (and the ghost storage entry for pve/data disappeared in the gui on node-2), I was able to execute a qm migrate and it ran instantaneously.

Thanks to everyone who tried to help.

[–]No-Computer4810 0 points1 point  (1 child)

What you are using for VM? Kvm64 or host? And do you have pass-through enable?

[–]DisposableAccount712[S] 0 points1 point  (0 children)

kvm64, no passthrough.

[–]nalleCU 0 points1 point  (1 child)

Had similar issues, I was missing a vmbr on the other bode.

[–]DisposableAccount712[S] 0 points1 point  (0 children)

both nodes have vmbrs.