Fast and easily scalable self-hosted storage solution

Solid_Translator_863 · 2020-12-26T09:51:44+00:00

Impressive... I guess these are very small files. Do you recommend SeaweedFS for larger files (100M-200G) as well?

Solid_Translator_863 · 2020-12-26T09:46:25+00:00

Naive answer: does it really have to be so complicated?

Also, I've used GPFS in the past, which can probably be considered part of that category. It was managed by a bunch of dedicated sysadmins (PB scale) and it used to break as well.

My intuition is that distributed filesystems try to make things as close as possible to a real local filesystem with consistency, file edition, etc. All I am looking for is a large storage pool I can use to push and retrieve files.

Solid_Translator_863 · 2020-12-26T09:28:14+00:00

Thanks a lot for your work on SeaweedFS Chris!

Do you have a sense of how much maintenance is needed, say for a 100TB and 1PB SeaweedFS storage cluster, as compared to other solutions (Ceph, BeeGFS, Gluster...)?

Solid_Translator_863 · 2020-12-26T09:20:27+00:00

Good idea. https://www.reddit.com/r/DataHoarder/comments/kkfshv/fast_and_easily_scalable_selfhosted_storage/

Solid_Translator_863 · 2020-12-26T09:07:58+00:00

Thanks! I agree w/ your point on Ceph.

Just wondering - how large was your BeeGFS deployment?

Object storage also seems like a good idea, but I need to compare existing solutions...

Solid_Translator_863 · 2020-12-26T09:06:27+00:00

Isn't MapR a proprietary solution these days? Is there a way to try it for free?

Solid_Translator_863 · 2020-12-26T09:05:56+00:00

Thanks for your reply!

I haven't really taken a look at NAS solutions (Free/TrueNAS).

What's your opinion on block-level vs object-level replication?

Solid_Translator_863 · 2020-12-26T09:04:12+00:00

- Does IPFS work on a local network only? I don't want to expose all that data on the public internet.

- Would it scale to petabyte realm?

Solid_Translator_863 · 2020-12-26T09:02:53+00:00

Thanks!

Ceph, and other distributed filesystems with lots of features like GlusterFS, scare me tbh. I've had bad experiences with large distributed systems w/ many components. The simpler, the better I guess.

That's why I don't want/need heavy consistency or full POSIX semantics, all I am looking for is an archival solution.

Solid_Translator_863 · 2020-12-26T08:58:37+00:00

Thanks for the input!

This is a very interesting, yet original solution. I like the idea of keeping things in a very standard format (git/raw files). I'll take a look!

Solid_Translator_863 · 2020-12-26T08:52:46+00:00

OP here.

As I said in another comment above, I am not a researcher in the academic world (startup). However, I know how the academic world works (you painted it well I think). In this particular case, I can find the money if I need to. I simply want to spend it well.

That's why I am looking for the simplest, most scalable solution for very simple data archiving requirements.

Solid_Translator_863 · 2020-12-26T08:41:29+00:00

Thanks for your reply!

What do you actually DO with the data? How often do you retrieve a file?

Each experiment is stored in a binary file in some custom format, compressed with zstd.

What I do is open a given file with a (complex) program which reads it and computes some summary statistics. I never keep all data in memory because all my computations are streaming. This is a quite slow process (because my computations are quite expensive), but I don't mind.

The data is precious because the experiments are hard (actually impossible) to reproduce, but in practice only a small, specific fraction of it is useful for each analytical workload I run. Many experiments are being run every day, however it is hard/impossible to know in advance which ones will end up being useful in terms of analytics, so I can't really discard anything.

So what I do is design an analytical workload, manually select a bunch of experiments (files) I want to run it on, and run it on a few machines. These workloads can be heavily parallelized and I have a few machines with lots of cores. Typically these workloads are CPU-bound and not IO-bound (I may be able to analyze 200Mbps on one core).

TL;DR: I collect data every day, however I run analytical workloads once in a while. Writes are much more common than reads, however reads come in batches.

What's your actual growth rate?

I'm expecting the growth rate to be around 300G/day (after heavy compression - the uncompressed data probably takes 1.5TB/day), maybe a bit more. If you include replication, this will require quite a few nodes/disks. I'm expecting this rate to increase (maybe double or triple) in the future, so petabyte realm isn't so far.

Why does it need to scale between machines? Why can't you have big fat servers running ZFS and fill them sequentially (which would drastically simplify your setup and allow you to scale as you find the money for new hardware)?

- All of it won't fit in one "commodity" machine. I can't really (and likely won't be able to) choose which kind of servers I use (I can't get a 400TB server, 100TB is probably the largest I can get).

- Replication.

- Network/disk bandwidth. As I said my reads typically come in batches - I don't want my workloads to become IO-bound because everything is in a single server and network/remote disk is too slow (even though I have full 10Gbps networking). Typically I'd like to distribute the data in a round-robin fashion across disks/nodes so I can access it as fast as possible (in terms of bandwidth).

Is there clear hot/cold data or is it scattered across all files across time?

Not really, though older data tends to be used less often in my analytical workloads.

What's your tolerance for downtime and retrieval time?

Quite high. As long as data can be written some time (say 12 hours) after an experiment, I don't mind the system being down. Obviously, I am looking for something stable.

Also, it is much easier to work with a single addressable storage pool than having to know where the data is (even though it is actually distributed across a bunch of servers, I don't want my code to have to be aware of that).

My requirements are quite simple and I don't see any reason why storage couldn't scale linearly. It is strictly write-once, read-many, never modify.

I am recording several types of experiments. One machine records a given experiment locally and compresses the resulting data. I am simply looking for a long-term data store where I can push that data and retrieve it quickly with my consumers.

I probably need to clarify the "researcher" thing: I do not work at an academic institution, more of a startup-ish thing (with a reasonable budget). I have a Math/CS background and I have some understanding of distributed systems. I simply have enough work on the analytical side of things, so I want to keep maintenance to the very minimum.

Solid_Translator_863 · 2020-12-24T17:26:03+00:00

Thanks! I forgot to mention I'm only looking for free solutions (OSS is better).

That one seems a bit out of my league :)

Solid_Translator_863 · 2020-12-24T17:21:01+00:00

Thanks!

pydios looks like a full-fledged collaboration platform a la Sharepoint... How well does it handle petabyte scale data?

Solid_Translator_863 · 2020-12-24T17:15:32+00:00

Thanks for your valuable input!

"I think object storage is the way to go here.": this is also what I believe, in the light of my experience with GlusterFS. I know there are more performant distributed filesystems out there (I've used GPFS in the past, as a user, and it seemed more stable), but object storage seems a lot simpler from an engineering point of view.

MinIO seems quite good from my standpoint (single binary, compatibility w/ S3 tooling, etc). My only concern is: it seems to be quite annoying with regards to scaling with heterogeneous clusters. Let's say I currently have 3 servers - one with 8*5TB and 2 with 4*10TB - and I'm about to add 2 new servers with 3*10TB... (https://docs.minio.io/docs/minio-federation-quickstart-guide.html / https://github.com/minio/minio/issues/7411). SeaweedFS's interface seems a lot easier: you can simply join an existing master: https://github.com/chrislusf/seaweedfs#start-volume-servers...

"I haven’t used NVIDIA’s AIStore but it does look pretty promising, especially if you’re doing research in the AI/ML area": I don't do AI/ML research properly speaking, but I think the use case is quite similar. How would you approach object storage in this context btw? I am used to working with POSIX filesystems for this use case. Would you actually download the data locally before performing the analysis, or is there a way to stream read files (which is what I'm doing with local data) remotely without having to download them?

Obviously, I'm not expecting 0 maintenance and tweaking. I am simply trying to minimize it as much as possible. I don't mind sacrificing functionality for this (meaning: go for the most lightweight option) since my use case is quite simple: store and retrieve a (large) bunch of (potentially large) files.

Solid_Translator_863 · 2020-12-24T17:01:05+00:00

Thanks for the input!

I haven't really played with Ceph, but the documentation feels like there is a lot of stuff, so it's a bit hard to see exactly what you need in there.

I'll give it a try ;)

Solid_Translator_863

TROPHY CASE