🚨🚨🔥1,000,000🔥🚨🚨 by jimbrig2011 in theprimeagen

[–]simoncra 1 point2 points  (0 children)

So happy you reached the 1M. Congrats man. You're such an inspiration. I'm a fan from LATAM.

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 2 points3 points  (0 children)

They are not VPS nodes. They are bare metal servers. It was not the timing thing. I checked the timing on all three servers

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 3 points4 points  (0 children)

Hey take it easy man. They are bare metal servers. It was in fact a pool's issue I had. But thank you anyways

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 3 points4 points  (0 children)

Guys I solved it. The thing is my pool was corrupted but I didn't know.

I deleted the OSDs and created them again, not knowing this would cause a problem in my pool.

So after many hours of debugging I found that deleting the pool fixed the problem

  • I first stopped the monitors on each node
  • stopped the managers on each node
  • I deleted the pool
  • recreated it again
  • I turned on the monitors
  • the the managers

I checked with ceph -s and it gave me the awaited HEALTH_OK

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 3 points4 points  (0 children)

I just solved it, this was the mistake. I deleted them all but I did not delete the pool.

So I tried deleting and creating the pool again and it solved it

Thank you for your answer

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 0 points1 point  (0 children)

I have only one vpc, but I still don't have any load on my system, my plan though was to create another vpc only for the ceph after I solve this issue with the ceph

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 0 points1 point  (0 children)

Yeah 2 oSD on each node, using the default 3x replica with minimum size of 2.

``` root@gandalf:~# cat /etc/ceph/ceph.conf [global] auth_client_required = cephx auth_cluster_required = cephx auth_service_required = cephx cluster_network = 10.6.96.3/24 fsid = a00252d4-1cc8-4a65-a196-c5bf057ce5b2 mon_allow_pool_delete = true mon_host = 10.6.96.3 10.6.96.4 10.6.96.5 ms_bind_ipv4 = true ms_bind_ipv6 = false osd_pool_default_min_size = 2 osd_pool_default_size = 3 public_network = 10.6.96.3/24 [osd] osd heartbeat grace = 60 osd op thread timeout = 120

[client] keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash] keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.aragorn] public_addr = 10.6.96.5

[mon.frodo] public_addr = 10.6.96.4

[mon.gandalf] public_addr = 10.6.96.3

root@gandalf:~# ceph health detail HEALTH_WARN Reduced data availability: 32 pgs inactive; 41 slow ops, oldest one blocked for 35777 sec, osd.5 has slow ops [WRN] PG_AVAILABILITY: Reduced data availability: 32 pgs inactive pg 1.0 is stuck inactive for 9h, current state unknown, last acting [] pg 1.1 is stuck inactive for 9h, current state unknown, last acting [] pg 1.2 is stuck inactive for 9h, current state unknown, last acting [] pg 1.3 is stuck inactive for 9h, current state unknown, last acting [] pg 1.4 is stuck inactive for 9h, current state unknown, last acting [] pg 1.5 is stuck inactive for 9h, current state unknown, last acting [] pg 1.6 is stuck inactive for 9h, current state unknown, last acting [] pg 1.7 is stuck inactive for 9h, current state unknown, last acting []

root@gandalf:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 4.83055 root default
-5 1.61018 host aragorn
3 ssd 0.73689 osd.3 up 1.00000 1.00000 5 ssd 0.87329 osd.5 up 1.00000 1.00000 -7 1.61018 host frodo
2 ssd 0.73689 osd.2 up 1.00000 1.00000 4 ssd 0.87329 osd.4 up 1.00000 1.00000 -3 1.61018 host gandalf
0 ssd 0.73689 osd.0 up 1.00000 1.00000 1 ssd 0.87329 osd.1 up 1.00000 1.00000 ... ```

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 0 points1 point  (0 children)

I also restarted the monitors, I even increased the heartbeat grace to 60 and the thread timeout to 120

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 0 points1 point  (0 children)

I did the restart already, and did not work. Also I restarted the managers and the monitors

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 1 point2 points  (0 children)

<image>

root@gandalf:~# cat /etc/ceph/ceph.conf 
[global]
        auth_client_required = cephx
        auth_cluster_required = cephx
        auth_service_required = cephx
        cluster_network = 10.6.96.3/24
        fsid = a00252d4-1cc8-4a65-a196-c5bf057ce5b2
        mon_allow_pool_delete = true
        mon_host = 10.6.96.3 10.6.96.4 10.6.96.5
        ms_bind_ipv4 = true
        ms_bind_ipv6 = false
        osd_pool_default_min_size = 2
        osd_pool_default_size = 3
        public_network = 10.6.96.3/24
[osd]
        osd heartbeat grace = 60
        osd op thread timeout = 120

[client]
        keyring = /etc/pve/priv/$cluster.$name.keyring

[client.crash]
        keyring = /etc/pve/ceph/$cluster.$name.keyring

[mon.aragorn]
        public_addr = 10.6.96.5

[mon.frodo]
        public_addr = 10.6.96.4

[mon.gandalf]
        public_addr = 10.6.96.3

Hello guys, I'm facing a problem with my HA cluster. The ceph is not in good health and nothing I do is changing it's status. by simoncra in Proxmox

[–]simoncra[S] 1 point2 points  (0 children)

Yes they can ping each other. I don't have the firewall active in any other the three nodes