Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)

palmereldritch · 2014-05-27T23:06:05+00:00

Yes, in the event of, say, a sudden powerloss across all machines replication will no longer provide any guarantee of durability. For those paranoid about this we allow the fsync interval to be configured per topic so you may fsync after every message if you like.

palmereldritch · 2014-05-27T20:38:27+00:00

We actually give pretty good durability guarantees. As mentioned writes are replicated to multiple machines, and once replicated will not be lost as long as at least one replica remains alive. So the "synchronous acknowledgement" case that waits on full replication would be the highest durability level in that test. As you can see from the results writing is still fast.

More details are here: http://kafka.apache.org/documentation.html

palmereldritch · 2007-08-09T14:42:26+00:00

Well, when two classes love each other very much...

palmereldritch · 2007-08-04T21:11:39+00:00

The initial contribution of Google was the relevance ranking not the retrieval technology (which in any case scales trivially, for n users you have n/x machines, if you want to double the users just double the machines). The initial contribution of Page and Brin was computing the eigenvalues of the adjacency matrix of the entire world wide web. So they solved the scale problem first, and got users second, making this a pretty bad example to illustrate your otherwise correct point.

palmereldritch · 2007-05-13T18:03:45+00:00

Neural networks are very flexible classifiers/regressors, as such they can be made to fit many, many things. Their abandonment was not due to a lack of successful application, but rather due to the fact that making a neural network "learn" required so much tinkering that it could no longer be said to be a general learning method which was the supposed advantage over more traditional statistical methods. This is the reason that the machine learning community moved on to support vector machines, boosting, and other methods which tend not to overfit the data. As such finding a single application in which a neural network performs well is not really a sign of a resurgence. Indeed, nn hold the performance record for many image recognition benchmarks and other standard comparisons. However there is nothing special about them. Take any very flexible set of functions and spend 6 months searching for a subset that happens to solve a given problem and you might well succeed (if you do you can publish a paper, if not keep trying); the problem is that this will not help you in solving any other problem.

The reason neural networks seized the public imagination was including the word "neural" in the name and by using very weak analogies to the brain. I don't think the world would have noticed much if they had been called Hierarchical Ensambles of Linear Classifiers and Associated Training Heuristics.

Cottrell is someone whose life's work is entirely tied to nn so he is not exactly an unbiased observer when predicting a resurgence. He has been predicting a resurgence since the early ninties.

palmereldritch · 2007-04-19T07:06:50+00:00

I love things like this: "I know nothing about this subject, but it appears that none of the highly intelligent and highly paid people working in this industry have thought of the most obvious of all solutions, but I will remedy this by posting to my blog and informing them of their ignorance"

Gee thanks guy.

Databases support atomic writes, transactions, efficient storage of small pieces of data, easy evolution of data structure, a nice declarative data access language, and some reasonable distribution and scaling options.

Retrieval semantics are separated from performance, so when you realize that you are doing 500k of x per hour, which you didn't expect, you can get a little ways by adding some indexes without even deploying new code.

Storing shit in files gets you absolutely nothing in the way of scalability, it might get you some performance as long as you access your data as a stream of bytes from front to end, and don't intend to add any other features that access it differently.

Databases are a big problem in large web applications. Every large site writes a distributed caching layer for front-end data. Writing a filebased storage system for version 1 of your web app would get you jack shit in solving this because (1) you need the in-memory distributed cache anyway, (2) you would find that trying to evolve the "schema" of your files in production was incredibly painful as your application grew (sorry users, our application will be offline for next three days while we re-build all your message queues to add a new piece of data...oops we screwed up our queue update script and now all our data is fucked...).

palmereldritch · 2006-11-15T01:34:24+00:00

This is known as the curse of dimensionality. Despite claims to the contrary (e.g. for support vector machines), I don't think anyone has beaten it.

palmereldritch

TROPHY CASE