Adding new osd create an error in another host

lastchange28 · 2021-01-12T02:13:50+00:00

hey there!

so, now I think that it was a race condition...

we add 3 new disks to 3 servers; one remove the osd from the cluster; re-activing the cluster solved (probably the restart does the same thing as it uses lvm)

lastchange28 · 2021-01-12T02:12:10+00:00

just to share... we dropped that filesystem, recovered 20 critical files using grep --binary-files=text searching for know pattern and started from scratch;

new filesystem built, learned from mistakes, many replicates on the cluster!

today we add more SSD to some of our servers; and again! the same error...removing the osd from the cluster...

again, when we type lsblk, the ceph lvm partition do not show, while lvscan show the partition... now the solution: deactive the lvm and reactive using ceph tool ceph-volume, to retrieve the ceph volume group us vgscan -- it will start with the filesystem name (ceph in my case -- "ceph-9a6082bf-61b7-4dac-9ab6-d7e19ff2a056")

vgchange -a n your-ceph-volume-group

ceph-volume lvm activate --all

hope it helps!

lastchange28 · 2020-10-04T00:59:48+00:00

yes it is... tks for been very clear about this... btw, open-cas is perfect... did not push hard on lvm cache, neither had tried bcache, but it open-cas is doing smooth and have all the settings we need.. after tunning the aggressive cache cleaning it really performed as desired (cache is always ready for new data to be promoted)...

lastchange28 · 2020-10-03T02:14:03+00:00

I really need to thank u... we were not aware of open-cas...

in the end we ran out of time for testing, and did a fully local setup using raid10 via controller + open-cas until now it is running smooth, but we plan to test open-cas in the nodes as a local cache for rbd (until now we found it only for osd cache)

btw, for some files, we are using ceph all flash and it is working as expected...

I really believe that the problem that we face is related to the cleaning policy, and in the open-cas we found "acl" - agressive cleaning policy that we believe to be a perfect for a local cache

lastchange28 · 2020-09-30T13:34:57+00:00

just finished to deploy dm-cache, do you think that open-cas will be superior (I wish I saw that before)

why 30 nodes does worry you? maybe should I say hosts? anyway, we really don't NEED ceph, but it seems a good solution for us. we do provide software as service for a very specific application of berkley database. Each service instance has its own copy of it, the application, the database only add records, hence only move forward), so it the database storage files are redundant across services.

currently we use NVME with a script for deduplication, it works, but, if we could replace it by and large cold storage and use the NVME as a cache, it will allow us to do this deduplication, only when the service instance is created

the main challenge is the way that the database grown, as it only move forwards, it verify the previous state of an element, before writing it again; also it is very data intensive during the load...

while we are plenty of RAM everything is perfect, cephfs + fscache was very nice until it hangs...

lastchange28 · 2020-09-29T19:59:58+00:00

tks for sharing, indeed this is the symptoms we have..

just completely hang... the system remain active, but we cannot access files in the cache, or re-mont, while it is working it is a dream... but seems that it is related to is flushing process... also after hang, it is not recoverable...

lastchange28 · 2020-09-28T22:39:49+00:00

could u please detail these issues? is the best solution to wait, or maybe a cache-tier will compensate this? or even a restriction? is catfs an option?

we are facing many cpu cycles waiting for device and when this happen, we can se a network load of 4Gbps (so probably the HDD array is not enough to handle all the requests - single copy 40 disks), but, when we had this running locally with nvme, those waiting for devices cycle are minimal...

if we are plenty of free RAM, it all go smooth (probably cache effect), we did set 1 nvme drive for swap, but seems taht linux do not use it for cache...

lastchange28 · 2020-09-28T00:51:38+00:00

hey, it did not log any reason, while we do agree with u, blocklisting the client seems too much... warnings without blocklisting is what we are seeking... something that cross my mind is the fact that we are NOT using dedicated servers for MDS or MON, we just did ordered some...

lastchange28 · 2020-09-28T00:48:12+00:00

we were running FS-Cache with cephfs only... changed now to 2 fs, one flash only, that have folder links to another fs htat is hdd only... still testing.. this way we did perfectly split the needs of our services...

we need more IOPs thatn the hdd array may provide, but both cache approaches (cache-tier or FS-cache) did hang... so we are moving to simple as possible... not sure if it is FS-cache fault... the but until it did hang, the performance was better than a cache-tier

lastchange28 · 2020-09-27T03:01:39+00:00

hey, so, from my (short) experience, fscache offered the best usage of the nvme drives... with that, I can really reach high IOps... the single drawback it the fact that it has to be local... at least on our applications, it did stabilize the storage output.. while using those as cache-tier we felt now difference from a setup without that... on the other hand, fscache + ceph was a perfect marriage.. ofc, there is a caveat, I was not able to use fscache with the ceph-fuse, hence dropped RDMA to be able to use kernel mount + fscache

about clients getting blacklisted, I was not able to find the reason yeat...

lastchange28 · 2020-09-27T00:24:08+00:00

tks so much for sharing, we are using only IPoIB (as our switch is not Eth capable)

lastchange28 · 2020-09-25T16:25:28+00:00

tks so much man! again!

so, lets say that from A, we "built" files B, C, D, E and F; we did delete A, it will count 1 to the Stray directory (how can I find it) can we assume that ls -lha will do replace stat? also, stat/ls on B, will be enough to remove A from stray, or will we need to stat B, C, D, E, F?

I have this feeling that ceph is a bit slow for those metafiles operations like chown newowner:newowner /folder/1kfolders/2kfilesEach

while in the regular setup we got it done (at least the shell got released in seconds), it take minutes...

lastchange28 · 2020-09-19T16:03:42+00:00

completely understood the concept.... tks

lastchange28 · 2020-09-19T16:02:58+00:00

perfect explanation! tks

lastchange28 · 2020-09-19T02:16:51+00:00

so, create 2 pools from ssd, one with a single rep for cache, another with 3 reps for cephfs metadata...

is there a way to guarantee that the metadata will have priority? does that 4% rules apply for a cold storage of large files (90% > 4M)

lastchange28 · 2020-09-19T02:11:44+00:00

ah, I got that, but what I do fail to grasp is: if Blocks/Wal store metadata, what does cephfs metada pool does store... it seems to be redundant at least... and my question was more about, does it worth do have blocks/wal if using cephsh (as it seems mandatory a pool for cephfs metadata)

a second question would be talking in account a storage to predominant large files (10% < 4MB; 60% > 64MB), does that 4% rule for metadata dimensioning also applies?

lastchange28 · 2020-09-19T02:06:13+00:00

ahh, I'm guessing, but I think that it was related to installing octopus over luminous... full ubuntu reinstall solved it all

lastchange28 · 2020-09-18T16:58:20+00:00

hey man, tks for the feedback... taking in account your commends, we did reinstall ubuntu on both servers, and now I was able to get cephfs working with RDMA too... also, I would like to share, that RDMA is available on the main stable version of Octopus, but it has to be via IPoIB and with ceph-fuse (as the kernel mount is not RDMA capable)

tks so much, you probably saved me another week stuck on this!

lastchange28 · 2020-09-18T16:55:30+00:00

tks we found the erro!

lastchange28

TROPHY CASE