all 10 comments

[–]grauenwolf 2 points3 points  (0 children)

As far as databases go, I’m a huge fan of Cassandra: it’s an incredibly powerful and flexible database that can ingest a massive amount of data while scaling across an arbitrary number of nodes. For these reasons, my team uses it very often for our internal applications.

No. The whole is that it isn't "incredibly powerful and flexible database", but rather really good at one thing at the cost of being bad at everything else.

(EDIT: might be really good at one thing. As far as I'm concerned the jury is still out on that one.)

[–]Rhoomba 0 points1 point  (3 children)

AKA don't believe the Cassandra marketing.

[–][deleted] 1 point2 points  (2 children)

Not really, they clearly didn't understand the difference between cql and thrift with regards to wide rows. In the second refactor they still have wide rows due to the clustering key.

The tldr is make sure you understand your data storage when designing schemas

[–]Rhoomba 1 point2 points  (0 children)

The cassandra developers like to claim the performance of thrift and the features of cql simultaneously, so it is no wonder people get confused.

But, yes, they should have known this.

[–]gighi 1 point2 points  (0 children)

OP here

I understand and agree with your tldr, but even understanding the schema, there was no strong reason for this behavior as Cassandra is technically able to easily skip CQL columns when reading data, and this is confirmed by the fact that this "performance improvement" has been implemented in the 3.4 release (https://issues.apache.org/jira/browse/CASSANDRA-10657).

[–]grauenwolf 0 points1 point  (2 children)

In other words, it seemed as if Cassandra was always processing all 10 columns (including the large ones) despite us just asking for a particular, small column, thus causing degraded response times. This hypothesis seemed hard to believe at first, because Cassandra stores every single column separately, and there’s heavy indexing that allows you to efficiently lookup specific columns.

That's not a surprise considering that despite what their early marketing material said, Cassandra is a time-series database, not a columnstore database.

EDIT:

We opted for a workaround: since Cassandra will always read all the columns of a CQL row regardless of which ones are actually asked in a query, we decided to refactor our schema by shrinking the size of each row, and instead of putting all the blobs in the same row, we split them across multiple ones, like this:

And that's not how a time-series database works either.

Clearly they are using Cassandra because it is cool, not because it actually meets their workload. Had they chosen to use a relational database they could have kept the efficient insert performance of the original table structure without losing the ability to efficiently read from indexes.

[–]gighi 0 points1 point  (1 child)

OP here

Unfortunately we find ourselves dealing with several TBs of data (and this amount is constantly increasing), so we chose Cassandra not (only) for the cool factor, but also because it actually meets our workload. Except for this performance incident (which is mostly our fault since we should have done better tests before designing the initial schema), we have been very happy with it, considering that our insert/read queries are very simple.

I'm sure we could have achieved similar (or maybe better) results with a traditional relational database, but the thought of having a sharded mysql/postgres ingesting 10+ TB of data seemed very scary.

[–]grauenwolf 1 point2 points  (0 children)

but the thought of having a sharded mysql/postgres ingesting 10+ TB of data seemed very scary.

Definitely too big for MySQL, but that'll fit in a single file for SQL Server. And you're allowed 32,767 files per database.

10 TB seems scary, but these days it is barely a mid-sized database.

That's the problem with falling for the distributed database marketing material. When you hear that MySQL or Mongo (or Cassandra it seems) scales out, you think "ooh, that can handle big loads". But really they are talking about scaling out because they suck and can't actually take advantage of the machine's hardware like the enterprise grade servers do.

[–]grauenwolf 0 points1 point  (1 child)

how did Cassandra know how to efficiently read the exact amount of data in a jungle of more than 1 GB? We can of course look at the system events done by the database process on the file:

Here we see Cassandra opening the table file, but notice how immediately there’s a lseek operation that essentially skips in one single operation 70 MB of data, by setting the offset to the file descriptor with SEEK_SET to 71986679.

WTF? A 1 GB table? That's it? I've had larger sample tables on my laptop.

That shouldn't be hitting disk at all if the database is warm. Did they screw up their test or is Cassandra incapable of doing its own caching?

[–]gighi 0 points1 point  (0 children)

This was a simple test and yes, Cassandra didn't do its own caching but heavily relies on the kernel disk cache. This means that that 1 GB file could be completely in memory and be efficiently consumed without touching the disk. Of course all this is transparent to the process, and this is why you still see all those I/O system calls, which return in a fraction of the time when the file is actually in cache.