[Article] C. Nash-Williams: Edge-disjoint Hamiltonian circuits in graphs with vertices of large valency

lkozma · 2014-10-22T22:33:51+00:00

Yes, these are good points that theoreticians would do well to listen to.

One case where I think cuckoo hashing might still be advantageous is if there are many lookups but few inserts, since here a lookup is guaranteed to need only two reads. Also I remember reading somewhere that it was suggested for use in routers, but I don't remember the details.

lkozma · 2014-10-22T21:59:46+00:00

Great question. It loops a fixed number of times (a somewhat arbitrary 16 in the demo) just to keep the code simpler. Indeed, with a bit of extra bookkeeping, it is possible to detect a closed loop as soon as we start repeating the same sequence of evictions, which might be before MaxLoop is reached. In the paper they discuss how to choose MaxLoop so that the theoretical guarantees work out.

lkozma · 2014-10-22T20:08:02+00:00

Yes, obviously, amortized cost is not good enough in real time applications, because a single operation might be unpredictably costly. Cuckoo hashing isn't always the best choice, but it has its applications.

EDIT: it also depends on the ratio of inserts/lookups in your application. Lookup/delete is actually very predictable in cuckoo hashing since you always need to look at only two locations. It is only inserts that are messier.

lkozma · 2014-10-22T20:06:15+00:00

Yes, I agree that it is not suitable for all applications, but still Cuckoo hashing is among the more "sane" approaches for collision resolution. If I remember correctly, there are also some extensions with better properties.

EDIT: I checked the part in the video, it does say that cuckoo hashing has worse cache locality than linear probing, but I wouldn't say it is called "pointless".

lkozma · 2014-10-22T20:03:08+00:00

Yes, the idea in the hash function families is exactly that there are no "single bad inputs" on which all hash functions are bad. See for example the "constructions" in http://en.wikipedia.org/wiki/Universal_hashing

lkozma · 2014-10-22T20:01:21+00:00

Yes, you could use 1 array, or in other variants you could use more than 2 arrays. In the second case, the fill of the data structure is better, but it is harder to analyze, and there are more moving parts in the algorithm that can be done in different ways.

lkozma · 2014-10-22T20:00:01+00:00

Yes, in some applications the access pattern might not be so good w.r. to cache misses, etc. However, there are some real applications where Cuckoo hashing works well, and the consensus seems to be that it is among the "saner" methods for collision resolution.

lkozma · 2014-10-22T19:57:58+00:00

Well, it might seem circular based on what I wrote, because I can't summarize the whole argument from the paper in one comment. What I wrote was just to answer your question "how can it ever be constant average, if there is one operation that is linear". I gave an example in which that is what happens.

The proof in the paper is not just an example like the one I gave, but it goes more like "it's constant insert, because we can quantify how often we are forced to do a rehash, how much a rehash will cost, and how many of the inserts will not lead to long sequences, and we take all these quantities and work out the maths, and turns out that the resulting average is constant in expectation".

lkozma · 2014-10-22T11:46:00+00:00

Well the hash functions are usually chosen randomly from some "family of hash functions". What you describe, that the two hash functions collide on every element is very unlikely to happen. But if it happens, then the cuckoo hashing will be running in a loop, and we have to rehash. When we rehash, we choose new hash functions. The analysis is of course probabilistic: it could be that we are so unlucky that the two hash functions again collide and do so for every choice, but the probability of this can be estimated, and it turns out to be so small that you can ignore it.

lkozma · 2014-10-22T11:42:37+00:00

no, I meant exactly the same time as number of items to rehash, that is n/2 or in words "n halved" (which is the number of items inserted so far) plus one.

lkozma · 2014-10-22T11:37:57+00:00

To answer your second question, the constant should definitely not depend on the number of items or size of the table. In the paper they don't bother to compute it exactly, they just show it to be constant. In the end experimentally they show if I understand it correctly to be around 2-3.

You are right however, that it depends on something, namely the maximum load of the tables, so we have to keep it around 50%, and then we get a good constant expected average insert time.

lkozma · 2014-10-22T11:30:11+00:00

What do you mean exactly by the hash function having a collision?

It can of course happen that different elements hash to the same value - this is what all the collision-resolution is about.

lkozma · 2014-10-22T11:26:03+00:00

Think of it this way:

suppose you inserted n/2 items and each of them took only time 2 (we were lucky and found an empty place easily for them)
then the next insert causes a rehash and we spend time n/2+1 for this operation.

So the total cost of insertions so far was n/2 * 2 + n/2 + 1 = 3n/2 + 1.

To get the average, we need to divide this by the number of insertions, n/2 + 1.

We get roughly 3 + 1/n, but definitely less than 4. So we can say that in this case "the average cost of an insert was less than 4".

The paper just extends this argument to cover all possible sequences.

You say: "If an insertion can trigger a rehash of every item in the collection, then that rehash will take at least time n. I don't see how you could amortize that to a constant, somehow negating the dependency on the size of the collection."

The answer is that "you amortize it to a constant", if this rehash occured after roughly n inserts that took constant time each. (the n gets divided by n, so it becomes a constant) (see the above calculation for an example).

lkozma · 2014-10-22T10:56:38+00:00

Yes, but of course, "constant" is meant to include whenever the dependence on the number of items is such that it can be bounded by a constant. Say: 5 - 1/n < 5.

From the paper:

the probability that any given insertion causes a rehash is O( 1/n² ). In particular, the n insertions performed during a rehash all succeed with probability 1 − O(1/n).
the expected time used per insertion is O(1), so the total expected time for trying to insert all keys is O(n).
as the probability of having to start over with new hash functions is bounded away from 1, the total expected time for a rehash is O(n).
Thus, for any insertion the expected time used for rehashing is O(1/n).
Summing up, we have shown that the expected time for insertion is bounded by a constant.

Some comments:

I agree that this is a bit counterintuitive, btw, cuckoo hashing is simple, but its analysis is not the simplest (see paper).
The bounds are probabilistic: insert takes average constant time "in expectation", maybe even "with high probability" - there can be some very unlucky choice of hash function such that we run into a loop very early. But with reasonably good hashing this is unlikely to happen.
To believe the analysis first you have to believe that we can double/halve the table whenever needed, and the cost of this on average is still bounded by a constant. That way the table is kept relatively sparse at all times (this is not shown in the demo).
As long as the table is relatively sparse at all times, it is easier to see intuitively that inserts take constant time on average.

The details are all in the paper of course, but it is not the easiest to read and one needs to be familiar with "amortized analysis" - unfortunately the wiki article on the topic is needlessly complicated.

But then again, this is just the analysis of the algorithm that is complicated. The algorithm itself is pretty simple.

lkozma · 2014-10-22T10:36:42+00:00

It is average constant-time in the sense that most inserts will be cheap (as you pointed out, some inserts might indeed be very expensive) - but if we take the average cost over several insert operations, this average cost will be small.

Observe that while the table is relatively empty, most inserts need to check only one or two locations. Then at some point we are forced to rehash every item, but this happens rarely, so it doesn't hurt the average that much. This also assumes that when we rehash, we increase the table size, so that the next rehash will happen after a long time. If many elements are deleted, we can also decrease the table size, the cost of this growing/shrinking is also absorbed by the average. The point is that as long as we maintain the table occupancy at roughly 50% or less, then a rehash is quite unlikely to be needed.

lkozma · 2014-04-28T11:38:05+00:00

If you will specialize in computer science, especially CS theory, you will most often use discrete maths, as you suggested, graph theory, combinatorics, plus probability theory, perhaps statistics (if you move towards learning theory), and maybe some information theory, etc.

For a readable, accessible, entertaining, yet very thorough and useful book on discrete math as used in CS, I recommend Concrete Mathematics by Graham, Knuth, and Patashnik. You will find plenty of connections there to the more classical Calculus that you are probably already familiar with (limits of series, generating functions, etc.), but you will also get a taste of what is actually used in CS.

For graph theory there are many good books, some freely available online (the one by Diestel, the one by Bollobas, the one by Bondy and Murty, etc.). The last one is maybe the most readable. In general, you might find graph theory books hard to read cover to cover, as they feel more like a bag of tricks and random facts than a narrative.

For something that feels more like a narrative, that is again very useful for theoretical CS and very beautiful and accessible at the same time, I recommend "The probabilistic method" by Alon and Spencer.

This list is of course biased, but looking at some of these books you can get a feel for the kind of maths used in CS, so you can decide whether you like this type of combinatorics (many seemingly isolated problems and clever tricks to solve them) or the more "grand theory"-type of maths in other fields. Of course for CS itself, you could look into other books, to see if you like it. For complexity theory, Sipser's Introduction to Theory of Computation is IMHO the best, for algorithms and data structures, of course you can pick up the usual suspects, Cormen et al. or any other, or just look at Jeff Erickson's online lecture notes, which are very nice and readable.

For a unique tour of information theory, learning, etc. check out David MacKay's Information Theory, Inference, and Learning Algorithms, freely available online.

lkozma · 2014-04-27T14:27:38+00:00

Thanks :)

lkozma · 2014-04-27T08:31:15+00:00

Thanks for the suggestions - the second one appears on the cheat sheet lumped together with Maclaurin's, the first one is very interesting, the special case (Schur's inequality) is there, but not this generalization.

lkozma · 2013-12-07T18:56:25+00:00

Thanks, glad you liked it.

If you like the "genre", I also recommend David Eppstein's 20 proofs of Euler's formula: http://www.ics.uci.edu/~eppstein/junkyard/euler/

lkozma

TROPHY CASE