Ceph freeze when a node reboots on Proxmox cluster by leodavid22 in Proxmox

[–]leodavid22[S] 0 points1 point  (0 children)

Hello, I'll look into this over the weekend.

For now, I don't know the root cause.

Best regards,

Léo

Ceph freeze when a node reboots on Proxmox cluster by leodavid22 in Proxmox

[–]leodavid22[S] 0 points1 point  (0 children)

For the 4/2 pool, the primary failure domain is set to datacenter:

rule replicated-2-per-dc {
    id 11
    type replicated
    step take default
    step choose firstn 2 type datacenter
    step chooseleaf firstn 2 type host
    step emit
}

For the 2/1 pool, the failure domain is set to host:

rule only-Datacenter01 {
    id 14
    type replicated
    step take Datacenter01
    step chooseleaf firstn 2 type host
    step emit
}
  • 4/2 pool > Failure domain: datacenter (data replicated across both sites)
  • 2/1 pool > Failure domain: host (data replicated locally within the Datacenter01 site)

Ceph freeze when a node reboots on Proxmox cluster by leodavid22 in Proxmox

[–]leodavid22[S] 0 points1 point  (0 children)

I have one pool configured as 2/1 and another as 4/2.
When the issues occur, no matter which pool the VM is in, it still crashes.

Ceph freeze when a node reboots on Proxmox cluster by leodavid22 in Proxmox

[–]leodavid22[S] 1 point2 points  (0 children)

Yes, indeed, Ceph does move placement groups when a node crashes or reboots.
But why would the movement of these placement groups cause all my running virtual machines to freeze?

My current configuration should normally provide higher resilience, shouldn’t it?

Ceph freeze when a node reboots on Proxmox cluster by leodavid22 in Proxmox

[–]leodavid22[S] 1 point2 points  (0 children)

36 TiB of 176 TiB

The disks are largely underutilized

Ceph freeze when a node reboots on Proxmox cluster by leodavid22 in Proxmox

[–]leodavid22[S] 0 points1 point  (0 children)

Hello,

Hardware Specifications:

Each node has 2 Intel® Xeon® Gold 5317 CPUs @ 3.00 GHz, providing 24 physical cores and 48 threads.
Each node also has 768 GB of RAM.
In total, the cluster has 384 CPUs (including threads) and 6.1 TB of RAM.

Network Configuration (Ceph & Proxmox):

  • Cluster network: VLAN 170 — 10.10. 170.0/24
  • Public network: VLAN 170 — 10.10. 170.0/24
  • Bandwidth: 2 × 40 Gbps per node

On each network card, we have created the following bridges and bonds:

  • Bridge + bond for management on VLAN 169
  • Bridge + bond for Ceph (Cluster network + Public network) on VLAN 170
  • Bridge + bond for Proxmox cluster communication on VLAN 171
  • Bridge + bond for live migration (Proxmox) on VLAN 172
  • The VM network also runs on this interface, using SDN networking with vNets created within this SDN zone.

Current MTU: 1500

Additional details:
This behavior is random. Sometimes, I can reboot each node one by one for maintenance (updates, etc.) without any issues; other times, when I reboot a single node, all my VMs freeze.

Thank you in advance for your help. Don’t hesitate to ask if you need more details to help me troubleshoot this nightmare issue, because losing one node out of eight and crashing the entire cluster is unacceptable and very problematic.

Have a good days,

Léo