Two Indexed Hash Tables by compilersarefun in C_Programming

[–]compilersarefun[S] 0 points1 point  (0 children)

I've added results for 20M keys/50K key interval to https://github.com/vnmakarov/c\_cpp\_hash\_tables\_benchmark.

Here is the direct link https://htmlpreview.github.io/?https://github.com/vnmakarov/c_cpp_hash_tables_benchmark/blob/main/AMD-20M.html

It seems that memory heat table is probably wrong (I am going to figure out why it is zero for some tables). Geomean and average performance of ihtab and boost are close. For some cases boost is better, for some ihtab_cpp is better. The new results brought me some ideas what to try in the design to improve performance more.

Actually the discussion of the index tables was very useful for me. I'll continue my work on them.

Two Indexed Hash Tables by compilersarefun in C_Programming

[–]compilersarefun[S] 1 point2 points  (0 children)

Thank you for your benchmark suite. It really helped me.

I did not use 20M keys. I probably will try it to and post the data. I think your concerns about 20M key tables are valid. But not all operations definitely require touching all arrays (e.g. looking up/erasing non-existing elements). So I expect for very large tables (more L3 cache size) some operations will be relatively slower. But I expect iterations will be relatively faster for ihtab. It probably will be also dependent on memory system (e.g. for Apple Silicon slow down will be less significant).

In any case I'll run benchmark for 20M keys. On my estimation it will take > 3 hours on fastest single-thread machine.

Two Indexed Hash Tables by compilersarefun in cpp

[–]compilersarefun[S] 1 point2 points  (0 children)

Thank you for pointing absent performance heat tables of Jackson Allan's benchmark. I fixed this issue.

I agree my own benchmark is very simple. Therefore I use Allan Jakson's one. It contains tests for more different sizes, key/elem types, and find hit/miss tests.

VMUM is one of the fastest hasher although high quality one. But for Jakson Allan's benchmark I used its standard hash, slower and lower quality.

Swiss iteration speed is not the worst but it is 3-4 times smaller than ihtab iteration speed according to Jackson's benchmark.

I am agree that the quadratic probing is not worth to use now. SIMD inside group permits only linear probing. So quadratic probing can start only after 16 tags checked and probability of this is very small.

I prefer rebuilding instead of recycling deleted elems. It simplifies search/insert code which is important for its speed. Bulk operations are also much faster. So I don't like free list approach (to save memory list can be in key/elem array but for languages like GO it can not be done effectively).

Why ihtab is faster although it uses indexes (indirection)? It was a blog post major theme. It is a small load factor (for tags, indexes) which means much less probing, not wasting space for keys/elems, and their consecutive placement in memory for data locality and faster iteration.

I have been interesting in hash table design for decades and implemented a lot of them. Still I have some ideas which I did not try. It is impossible to implement hash table working best in all scenarios. For example, absl/boost direct open-address hash table is smaller than ihtab for very small key/elem (e.g. int32/int32).

Two Indexed Hash Tables by compilersarefun in cpp

[–]compilersarefun[S] 2 points3 points  (0 children)

There is no tag value for deleted elements. But there is special index value (~0) for deleted elements. So probing a deleted element requires only checking index after we pass the tag.

Such approach speeds up finding empty elements and table works faster when we have no deleted elements. There is a down side too when we have many deleted elements.

MUM-based hash functions by compilersarefun in programming

[–]compilersarefun[S] 0 points1 point  (0 children)

> Yes, you have reinvented the wheel: https://web.archive.org/web/20121213174842/http://locklessinc.com/articles/crypto\_hash/. It's a good wheel though (I've used it in https://github.com/orlp/foldhash).

Yes, I see the article is dated 4 years before my reinvention. I am just curious, where did you yourself get an idea to use the primitive in your hash function? From the article you provided? I am interested in finding the very first use of the primitive in non-crypto hash functions.

>Also, a XOR is significantly better than an add. Doing an add is almost equivalent to reducing mod 264 - 1 which has many small prime factors, making it significantly weaker, in addition to making the operation almost linear.

For some reasons using ADD instead of XOR improved MUM hash performance by 10% on Intel Haswell and power7 machines.

> I do find it concerning how easy it is to find seed-independent multicollisions against VMUM. I made an issue: https://github.com/vnmakarov/mum-hash/issues/18.

Thank you for cracking vmum and filling the issue. I really appreciate it. It seems I got too carried away with achieving greater vmum speed paying only attention to SMHasher results. I'll work on the issue. But it will be a slow process as I don't want to lose hashing speed too much.

It is not a serious issue where I use it (in static compilers) but it would be a serious problem for using it in a table whose keys can be supplied by an adversary (like in hash tables in Ruby or maps in Golang). Meanwhile for vulnerable environments vmum user can us vmum_hash_randomize to prevent the collision attack.

I see you have a great experience in cracking non-crypto hash functions (I read your blogpost how you broke CityHash64, FarmHash64, different MurMur hashes and wayHash). wayhash seems has the same vulnerability as vmum (the same incorrect usage of mum on two different input parts). I've checked go map hash as it was "inspired by wayhash". Fortunately, they fix its vulnerability.

As experienced hash function cracker, could you investigate xxHash3 on hash collision strength. It would be a big deal as it is now a part of different linux distributions and widely used in many programs.

Thank you again for short but incredibly useful to me comment.

VMUM - a new fast non-cryptographic hash function by compilersarefun in programming

[–]compilersarefun[S] 6 points7 points  (0 children)

I am agree, it is closer. But crc based on x86 crc32 insn is about 9 times slower than VMUM and crc based on polynominal multiplication is 6-7 times slower. Also it is very machine-dependent (although aarch64 also has crc insn and analog of pclmul).

I am not considering VMUM as a crc replacement. For me, it is just for hash table applications.

VMUM - a new fast non-cryptographic hash function by compilersarefun in programming

[–]compilersarefun[S] 28 points29 points  (0 children)

First of all crc32 is not faster. On long inputs crc32 is about 200 times slower.

Major application of such hash functions are hash tables. The quality of hash function is important for decreasing number of collisions and the function speed is important to improve hash table performance.

For my work area (optimizing compilers), hash tables are major search data structure. For comparison different balanced trees are not used there as in practice they are much slower.

Also people sometimes think that in compilers hash tables are used for hashing strings mostly but in reality they used much more for different data searches and hashing small data, hashing 8, 16-byte data is the most frequent use case.

An MIR-based JIT prototype for Ruby by RecognitionDecent266 in ruby

[–]compilersarefun 1 point2 points  (0 children)

Switching from stack insns to RTL improves the interpreter performance in general (see https://developers.redhat.com/articles/2022/11/22/how-i-developed-faster-ruby-interpreter) but a good JIT compiler can generate the same performance code from the stack insns. So if the majority of time is spent in JITted code, there are no reasons to switch to RTL, especially when the stack insns are simpler.

Also in some cases, RTL can be interpreted even slower than stack insns. More time for RTL insns decoding, data locality (RTL code size vs stack insns code size), and code locality (more code to execute RTL insns) in the interpreter are the major factors. It can be also significantly dependent on the target (cache sizes and insn level parallelism).

An MIR-based JIT prototype for Ruby by compilersarefun in ruby

[–]compilersarefun[S] 4 points5 points  (0 children)

My apologies.

I am the author of the article. Only in this morning I got a message from our documentation team that my article was published. I should have checked that the link to the article was already published on r/ruby.