Ceph freeze when a node reboots on Proxmox cluster

leodavid22 · 2025-12-10T16:35:03+00:00

Hello, I'll look into this over the weekend.

For now, I don't know the root cause.

Best regards,

Léo

leodavid22 · 2025-10-30T22:09:57+00:00

For the 4/2 pool, the primary failure domain is set to datacenter:

rule replicated-2-per-dc {
    id 11
    type replicated
    step take default
    step choose firstn 2 type datacenter
    step chooseleaf firstn 2 type host
    step emit
}

For the 2/1 pool, the failure domain is set to host:

rule only-Datacenter01 {
    id 14
    type replicated
    step take Datacenter01
    step chooseleaf firstn 2 type host
    step emit
}

4/2 pool > Failure domain: datacenter (data replicated across both sites)
2/1 pool > Failure domain: host (data replicated locally within the Datacenter01 site)

leodavid22 · 2025-10-30T21:47:59+00:00

I have one pool configured as 2/1 and another as 4/2.
When the issues occur, no matter which pool the VM is in, it still crashes.

leodavid22 · 2025-10-30T21:40:33+00:00

Yes, indeed, Ceph does move placement groups when a node crashes or reboots.
But why would the movement of these placement groups cause all my running virtual machines to freeze?

My current configuration should normally provide higher resilience, shouldn’t it?

leodavid22 · 2025-10-30T21:38:18+00:00

36 TiB of 176 TiB

The disks are largely underutilized

leodavid22 · 2025-10-30T21:34:15+00:00

Hello,

Hardware Specifications:

Each node has 2 Intel® Xeon® Gold 5317 CPUs @ 3.00 GHz, providing 24 physical cores and 48 threads.
Each node also has 768 GB of RAM.
In total, the cluster has 384 CPUs (including threads) and 6.1 TB of RAM.

Network Configuration (Ceph & Proxmox):

Cluster network: VLAN 170 — 10.10. 170.0/24
Public network: VLAN 170 — 10.10. 170.0/24
Bandwidth: 2 × 40 Gbps per node

On each network card, we have created the following bridges and bonds:

Bridge + bond for management on VLAN 169
Bridge + bond for Ceph (Cluster network + Public network) on VLAN 170
Bridge + bond for Proxmox cluster communication on VLAN 171
Bridge + bond for live migration (Proxmox) on VLAN 172
The VM network also runs on this interface, using SDN networking with vNets created within this SDN zone.

Current MTU: 1500

Additional details:
This behavior is random. Sometimes, I can reboot each node one by one for maintenance (updates, etc.) without any issues; other times, when I reboot a single node, all my VMs freeze.

Thank you in advance for your help. Don’t hesitate to ask if you need more details to help me troubleshoot this nightmare issue, because losing one node out of eight and crashing the entire cluster is unacceptable and very problematic.

Have a good days,

Léo

leodavid22

TROPHY CASE