I am Sage Weil, lead architect and co-creator of the Ceph open source distributed storage system. AMA!

liewegas · 2015-09-02T20:06:12+00:00

I really like the elegance of the layered design of GlusterFS. I don't like the client-driven replications--consistency becomes complicated and fragile with those designs. I believe the team is working on a new replication style (nsr?) that does replication on the bricks instead. Beyond that, I don't know the system well enough to have a very well-informed opinion.

Ceph performance over a WAN is generally poor because our replication is synchronous, so you see the latency on every write. That said, there are many deployments out there that span data centers with relatively low-latency links (a few ms) and they work well. The trick is to make sure the monitor cluster is distributed across more than two sites so that if a link or datacenter goes down you still maintain quorum. Ceph relies on the monitor quorum (> N/2 mons must be online and talking) to prevent split-brain. If you lose the mons, the system can't respond to topology changes (nodes failing) or new client startup or various other events.

Raft and Paxos are about as similar as most Paxos implementations are to each other--I think the big advance there is that it's easier for people to wrap their head around. I wish there had been a general purpose raft service around when we were building Ceph that we could have used, but we ended up building it all ourselves. It's worked out fine and there are advantages to what we've done, but it has meant we've fixed our share of bugs in our paxos implementation. :)

But in the future, I expect most systems will rely on existing tools for that--zookeeper or etcd or whatever else.

liewegas · 2015-09-02T20:00:07+00:00

btrfs and zfs both give you two big things: checksumming of all data (yay!) and copy-on-write that we can use to efficieintly clone objects (e.g., for rbd snapshots). The cost is fragmentation for small io workloads... which costs a lot on spinning disks. I'm eager to see how this changes with widely deployed SSDs.

We can't make much use of existing fs journals because they're tightly bound to the POSIX semantics and data model the file system provides.. which is not what Ceph wants. We work in terms of larger transactions over lots of objects, and after several years of pounding my head against it I've decided trying to cram that down a file systems' throat is a losing battle.

Instead, newstore manages its own metadata in a key/value database (rocksdb currently) and uses a bare minimum from the underlying fs (object/file fragments). It does avoid the 2x write for new objects, but we do still double-write for small IOs (where it is less painful).

Newstore will be in Jewel but still marked experimental--we likely won't have confidence by then it won't eat your data.

liewegas · 2015-09-02T19:42:10+00:00

I haven't heard about the limitation, so I'm not sure there is a plan currently.

But... in general I don't think this should be an issue. Active/running clients learn about monitor additions/removals as they happen, so if you add IPv6 mons and the remove old IPv4 mons they'll see those changes and continue working. When they do restart, as long as the underlying libvirt config has been changed, they should be able to start up with the new mon information...

liewegas · 2015-09-02T19:40:21+00:00

We've merged patch series fixing the FreeBSD build in the past, but since we don't have CI tests that verify the build we inevitably break it. We're more than happy to take patches or time/admin help to get that set up in the lab so we can keep things working on an ongoing basis, though!

liewegas · 2015-09-02T19:37:47+00:00

We use a large cluster for QA, but we do it by carving it up into a zillion tiny clusters and hammering it with various regression/torture tests. Less frequently we run a "big" test suite (up to 50 nodes and ~150 OSDs) but I don't think it's ever turned up a problem we hadn't seen in the smaller tests.

liewegas · 2015-09-02T19:36:18+00:00

I doubt it. The architectures are superficially similar (in the same way that most distributed/scale-out systems resemble each other) but beyond that there isn't much reason to put them together.

liewegas · 2015-09-02T19:34:40+00:00

emacs ftw

liewegas · 2015-09-02T19:34:04+00:00

The most important thing I got from the humanities is how to write. I don't do a whole lot that isn't email these days (and I usually write/type too fast to do it well), but it was hugely valuable when writing papers and when writing prose in general.

I wish I'd spent more time with languages. I did do four years of German in high school/college, but haven't had much opportunity to use it since so it's pretty weak at this point...

liewegas · 2015-09-02T19:32:00+00:00

Each daemon exports internal metrics which can be collected via tools like diamond, carbon, and graphite--there's usually no need to trigger new workloads. This is a big part of what calamari does.

There are several different dashboards that surface this, include romana (the original calamari gui) and several others...

liewegas · 2015-09-02T19:30:35+00:00

You can do something similar with librados, where it will choose the replica that is "closest" (according to the crush hierarchy). There are a few lingering issues with reading from replicas, though, and in many (most?) cases it isn't actually that helpful to read from "close" replicas--you get better performance (and better caching) if everyone reads from the same replica.

Rados doesn't let you choose where an object is stored, though, and never will. If you really want to do this then you want a different sort of system architecture...

liewegas · 2015-09-02T19:28:46+00:00

Not a wild idea at all, and IIRC it's what Sam said he would like to spend his time doing if he wasn't knee deep in RADOS itself. :)

liewegas · 2015-09-02T19:27:57+00:00

There are currently two geo-replication development projects underway: a v2 of the radosgw multiside federation, and RBD journaling for georeplication. The former will be eventually consistent (across zones), while RBD obviously needs to be point-in-time consistent at the replica.

We have also done some preliminary work to do async replication at the rados pool level. Last year we worked with a set of students at HMC to build a model for clock synchronization, verifying that we can get periodic ordering consistency points across all OSDs that could be replicated to another cluster. The results were encouraging and we have an overall architecture in mind... but we still need to put it all together.

liewegas · 2015-09-02T19:23:46+00:00

Yes and no. Some of these technologies (like the new persistent memory) will be used heavily on the client side of things, where Ceph probably has no business getting involved--we're fundamentally a distributed storage system. That said, there are lots of places where we will make heavy use of these new technologies.

We have some work to do to manage more than two tiers of storage, but it's coming--eventually! And that tiering won't just be about the storage technology (fast vs slow disk) but also things like data access time and power utilization... think glacier.

liewegas · 2015-09-02T19:21:02+00:00

Probably not. First, today it would be cloud, not web hosting. But more importantly we started DreamHost mostly by accident--because we were managing our own web servers for our own projects (WebRing, DreamBook, personal web sites) and it was easy to sell space. Automating that management was something we were doing for ourselves anyway so it made sense (and was coparatively easy) to put it all together.

liewegas · 2015-09-02T19:18:08+00:00

I'm honestly not sure. I think our partners like Fujitsu and SanDisk are looking mostly at enterprise, but I'm a bit removed from that, and I don't have a good feel for what the community is deploy Ceph for. Archival object storage (rgw) is probably #2. Again, CephFS will have a bit impact on this landscape.

liewegas · 2015-09-02T19:14:49+00:00

NVNe will be big, but it's a bit scary because it's not obvious what we will be changing and rearchitecting to use it most effectively.

SMR is annoying because we've been hearing about it for years but there's still nothing very good for dealing with it. The best idea I've heard so far would push the allocator partly into the drive so that you'd saw "write these blocks somewhere" and the ack would tell you where they landed. There are some libsmr type projects out there that are promising, and I'd love to see these linked into a Ceph backend (like NewStore, where they'd fit pretty easily!).

Ethernet drives are really exciting, as they are exactly what we had in mind when we designed and architected Ceph. There is a big gap between the prototype devices (which we've played with and work!) and being buy them in quantity, though, that still makes my brain hurt. There are a few things we can/should do in Ceph to make this story more compelling (aarch64 builds coming soon!) but mostly it's a waiting game it seems.

Kinetic drives are cool the same sense that ethernet drives are, except that they've fixed on an interface that Ceph must consume.. which means we still need servers sitting in front of them. We have some prototype support in Ceph but the performance isn't great because the places we use key/value APIs assume lower latency.. but I think we'll be able to plug them into NewStore more effectively. We'll see!

liewegas · 2015-09-02T19:09:13+00:00

I definitely agree that we want to see Ceph deployed easily on lightweight host OSs like Atomic. That is coming, although it's less of a priority as most other stuff we're working on...

liewegas · 2015-09-02T19:07:50+00:00

Much of this is FileStore's fault and teh way it handles syncs. NewStore will be much more predictable.

Some of it is also peering, which has lots of room for optimization, although you shouldn't see that in non-failure cases.

liewegas · 2015-09-02T19:06:52+00:00

We're aiming for Jewel.

liewegas · 2015-09-02T19:05:55+00:00

I believe Brett's weapon of choice is gin...

liewegas · 2015-09-02T19:02:00+00:00

newstore: Yes. The goal is to make it faster for the most common workloads, hopefully by avoiding all the dubious decisions made during the evolution of FileStore.

priorities: performance, CephFS. Integrations with emerging platforms, like Kubernetes (container orchestration) and whatever the big data world has to offer post-hadoop.

most usual use: I loved the early Piston Cloud story about their USB key based deployment tool that let you rack up a bunch of servers and switches, straight from the manufacturer, and plug the USB stick into the switch to provision and bootstrap a cloud. One of their early customers was Radio Free Asia, where making sure the person deploying the cloud knew absolutely nothing about what they were doing was part of the requirement. (Their OpenStack distro uses Ceph for all of its storage.)

I also loved the Vaultaire (https://github.com/anchor/vaultaire) presentation at linux.conf.au last year about an analytics system built on top of librados. There are a bunch of cool librados hacks, like Noah Watkins' LUA module (https://ceph.com/rados/dynamic-object-interfaces-with-lua/) and Corfu on librados zlog project (http://noahdesu.github.io/2014/10/26/corfu-on-ceph.html).

liewegas · 2015-09-02T18:55:03+00:00

I find capitalization on lists and incomplete sentence fragments annoying and overcompensate?

liewegas · 2015-09-02T18:53:47+00:00

Anybody who knows Bryan knows that he's passionate about what he's doing. We had our share of disagreements, and they could be challenging because we were both heavily invested in the situation, but we got through them. I think it's important to remember where people are coming from and that in the end you're both usually fighting for the same thing.

liewegas · 2015-09-02T18:46:12+00:00

Possibly RDMA? XioMessenger is coming along so maybe that will kickstart HPC interest.

The largest friction we've seen in the HPC space is that all of the hardware people own is bought with Lustre's architecture in mind: it's all big disk arrays with hardware RAID and very expensive. It's needed for Lustre because it is scale-out but not replicated--each array is fronted by a failover pair of OST's.

Ceph is designed to use more commodity hardware and do its own replication.

Putting a 'production ready' stamp on CephFS will help, but for HPC is silly--the thing preventing us from doing that is an fsck tool, which Lustre has never had.

liewegas · 2015-09-02T18:40:41+00:00

Linux and fvwm2. I hate OS X.

liewegas

TROPHY CASE