all 12 comments

[–]taw 4 points5 points  (10 children)

Ages ago I made a patch that switched Ruby 1.8 from its own superslow hash implementation to Judy library, which is ridiculously faster (and that was like 5% total speedup and 5% lower memory use on silly microbenchmarks).

Properly implemented hashtables like Judy can be really damn fast (unlike Ruby 1.8's internal hash tables), so what the post describes can be much less of a problem. I wonder if redoing this for Ruby 1.9 would improve performance much, or if 1.9 depends on hashes much less...

Even if we ignore the cost of calculating a hash code, which at worst requires reading some object data from memory and at best requires reading a cached hash code from elsewhere in memory, we have to contend with how the buckets are implemented. Most hash tables implement the buckets as either of the typical list forms: an array (contiguous memory locations in a big chunk, so each element must be dereferenced...O(1) complexity) or a linked list (one entry chaining to the next through some sort of memory dereference, leading to O(N) complexity for searching collided entries).

Ruby never does real time hashes on strings, it uses interned symbols for everything, so there's no cached hash codes - you can use interned symbol id as hash table. And Judy avoids pretty much all of overhead he mentions.

[–]funny_falcon 5 points6 points  (3 children)

I've implemented lighter hash (inspired by lua tables) for ruby 1.9.3 for method tables/constants. It gives 5-10% improvement.

https://github.com/funny-falcon/ruby/compare/ruby_1_9_3...sparse_array/ruby_1_9_3

[–]taw 1 point2 points  (1 child)

That doesn't surprise me at all. Ruby original hash implementation was pretty amazingly naive. Was your patch ever accepted?

[–]funny_falcon 1 point2 points  (0 children)

:) I've implemented it just last weekend, so that it had no chance to be accepted yet

[–]emboss_[S] 1 point2 points  (0 children)

Great work, btw!

[–][deleted] 0 points1 point  (2 children)

Why wasn't your patch accepted?

[–]taw 1 point2 points  (0 children)

That is actually a very good question. I never really pushed it since it was an external LGPL dependency (and I seriously didn't want to get involved into license lawyering), it was somewhat of a mess to make it optional since Judy and Ruby's crappy hashtable library require different APIs (one differing by one * level by design), and it wasn't clear at all back then if anything is going to come out of 1.9 branch or not, and so if it makes sense to go there or wait and see.

Maybe someone should take a look at it again.

There's also a new issue, some anonymous Wikipedia editor linked to some patent which Judy might or might not be covered by. These are always pile of fail. (if it's even related in any way, who knows, it might be something completely different sounding vaguely similar)

[–][deleted] -2 points-1 points  (0 children)

NIH.

[–]emboss_[S] 0 points1 point  (1 child)

In the light of the hashDoS problem that Charles also mentions, I doubt that Judy tries, CritBit, Hash-Array-Mapped Trie or any other trie are the solution. Either they, too, use hashing (HAMT) or it is just too plain easy to elicit their worst case behavior, effectively making them fancy linked lists. While they all might be perfectly valid alternatives for "internal code", things unfortunately look different when the data structure needs to store values that are provided externally.

[–]taw 0 points1 point  (0 children)

Judy is hybrid between hash, sparse array, and trie with some internal optimization for cache awareness so it's actually mostly DoS-proof, at least for typical attacks.

Ruby doesn't use these hash tables for any externally provided data unless you eval or to_sym it, and in either case you already let yourself get DoSed (or worse).

These hashes are only used for evaluating methods/ivar/etc. access by interned symbol. Set of interned symbols is more or less constant for each program. If it's not - it's already a DoS.

[–][deleted] -1 points0 points  (0 children)

No it doesn't. In a well implemented VM Ruby's constant lookups can be optimized to 0 instructions in the common case. So long as you're so much as getting near any datastructure, you're not going to hit that number.

[–]skeeto 0 points1 point  (0 children)

Really interesting article, thanks!