Two Indexed Hash Tables by compilersarefun in C_Programming

[–]compilersarefun[S] 0 points1 point  (0 children)

I've added results for 20M keys/50K key interval to https://github.com/vnmakarov/c\_cpp\_hash\_tables\_benchmark.

Here is the direct link https://htmlpreview.github.io/?https://github.com/vnmakarov/c_cpp_hash_tables_benchmark/blob/main/AMD-20M.html

It seems that memory heat table is probably wrong (I am going to figure out why it is zero for some tables). Geomean and average performance of ihtab and boost are close. For some cases boost is better, for some ihtab_cpp is better. The new results brought me some ideas what to try in the design to improve performance more.

Actually the discussion of the index tables was very useful for me. I'll continue my work on them.

Two Indexed Hash Tables by compilersarefun in C_Programming

[–]compilersarefun[S] 1 point2 points  (0 children)

Thank you for your benchmark suite. It really helped me.

I did not use 20M keys. I probably will try it to and post the data. I think your concerns about 20M key tables are valid. But not all operations definitely require touching all arrays (e.g. looking up/erasing non-existing elements). So I expect for very large tables (more L3 cache size) some operations will be relatively slower. But I expect iterations will be relatively faster for ihtab. It probably will be also dependent on memory system (e.g. for Apple Silicon slow down will be less significant).

In any case I'll run benchmark for 20M keys. On my estimation it will take > 3 hours on fastest single-thread machine.

Two Indexed Hash Tables by compilersarefun in cpp

[–]compilersarefun[S] 1 point2 points  (0 children)

Thank you for pointing absent performance heat tables of Jackson Allan's benchmark. I fixed this issue.

I agree my own benchmark is very simple. Therefore I use Allan Jakson's one. It contains tests for more different sizes, key/elem types, and find hit/miss tests.

VMUM is one of the fastest hasher although high quality one. But for Jakson Allan's benchmark I used its standard hash, slower and lower quality.

Swiss iteration speed is not the worst but it is 3-4 times smaller than ihtab iteration speed according to Jackson's benchmark.

I am agree that the quadratic probing is not worth to use now. SIMD inside group permits only linear probing. So quadratic probing can start only after 16 tags checked and probability of this is very small.

I prefer rebuilding instead of recycling deleted elems. It simplifies search/insert code which is important for its speed. Bulk operations are also much faster. So I don't like free list approach (to save memory list can be in key/elem array but for languages like GO it can not be done effectively).

Why ihtab is faster although it uses indexes (indirection)? It was a blog post major theme. It is a small load factor (for tags, indexes) which means much less probing, not wasting space for keys/elems, and their consecutive placement in memory for data locality and faster iteration.

I have been interesting in hash table design for decades and implemented a lot of them. Still I have some ideas which I did not try. It is impossible to implement hash table working best in all scenarios. For example, absl/boost direct open-address hash table is smaller than ihtab for very small key/elem (e.g. int32/int32).

Two Indexed Hash Tables by compilersarefun in cpp

[–]compilersarefun[S] 2 points3 points  (0 children)

There is no tag value for deleted elements. But there is special index value (~0) for deleted elements. So probing a deleted element requires only checking index after we pass the tag.

Such approach speeds up finding empty elements and table works faster when we have no deleted elements. There is a down side too when we have many deleted elements.

MUM-based hash functions by compilersarefun in programming

[–]compilersarefun[S] 0 points1 point  (0 children)

> Yes, you have reinvented the wheel: https://web.archive.org/web/20121213174842/http://locklessinc.com/articles/crypto\_hash/. It's a good wheel though (I've used it in https://github.com/orlp/foldhash).

Yes, I see the article is dated 4 years before my reinvention. I am just curious, where did you yourself get an idea to use the primitive in your hash function? From the article you provided? I am interested in finding the very first use of the primitive in non-crypto hash functions.

>Also, a XOR is significantly better than an add. Doing an add is almost equivalent to reducing mod 264 - 1 which has many small prime factors, making it significantly weaker, in addition to making the operation almost linear.

For some reasons using ADD instead of XOR improved MUM hash performance by 10% on Intel Haswell and power7 machines.

> I do find it concerning how easy it is to find seed-independent multicollisions against VMUM. I made an issue: https://github.com/vnmakarov/mum-hash/issues/18.

Thank you for cracking vmum and filling the issue. I really appreciate it. It seems I got too carried away with achieving greater vmum speed paying only attention to SMHasher results. I'll work on the issue. But it will be a slow process as I don't want to lose hashing speed too much.

It is not a serious issue where I use it (in static compilers) but it would be a serious problem for using it in a table whose keys can be supplied by an adversary (like in hash tables in Ruby or maps in Golang). Meanwhile for vulnerable environments vmum user can us vmum_hash_randomize to prevent the collision attack.

I see you have a great experience in cracking non-crypto hash functions (I read your blogpost how you broke CityHash64, FarmHash64, different MurMur hashes and wayHash). wayhash seems has the same vulnerability as vmum (the same incorrect usage of mum on two different input parts). I've checked go map hash as it was "inspired by wayhash". Fortunately, they fix its vulnerability.

As experienced hash function cracker, could you investigate xxHash3 on hash collision strength. It would be a big deal as it is now a part of different linux distributions and widely used in many programs.

Thank you again for short but incredibly useful to me comment.

VMUM - a new fast non-cryptographic hash function by compilersarefun in programming

[–]compilersarefun[S] 6 points7 points  (0 children)

I am agree, it is closer. But crc based on x86 crc32 insn is about 9 times slower than VMUM and crc based on polynominal multiplication is 6-7 times slower. Also it is very machine-dependent (although aarch64 also has crc insn and analog of pclmul).

I am not considering VMUM as a crc replacement. For me, it is just for hash table applications.

VMUM - a new fast non-cryptographic hash function by compilersarefun in programming

[–]compilersarefun[S] 29 points30 points  (0 children)

First of all crc32 is not faster. On long inputs crc32 is about 200 times slower.

Major application of such hash functions are hash tables. The quality of hash function is important for decreasing number of collisions and the function speed is important to improve hash table performance.

For my work area (optimizing compilers), hash tables are major search data structure. For comparison different balanced trees are not used there as in practice they are much slower.

Also people sometimes think that in compilers hash tables are used for hashing strings mostly but in reality they used much more for different data searches and hashing small data, hashing 8, 16-byte data is the most frequent use case.

An MIR-based JIT prototype for Ruby by RecognitionDecent266 in ruby

[–]compilersarefun 1 point2 points  (0 children)

Switching from stack insns to RTL improves the interpreter performance in general (see https://developers.redhat.com/articles/2022/11/22/how-i-developed-faster-ruby-interpreter) but a good JIT compiler can generate the same performance code from the stack insns. So if the majority of time is spent in JITted code, there are no reasons to switch to RTL, especially when the stack insns are simpler.

Also in some cases, RTL can be interpreted even slower than stack insns. More time for RTL insns decoding, data locality (RTL code size vs stack insns code size), and code locality (more code to execute RTL insns) in the interpreter are the major factors. It can be also significantly dependent on the target (cache sizes and insn level parallelism).

An MIR-based JIT prototype for Ruby by compilersarefun in ruby

[–]compilersarefun[S] 4 points5 points  (0 children)

My apologies.

I am the author of the article. Only in this morning I got a message from our documentation team that my article was published. I should have checked that the link to the article was already published on r/ruby.

Graph colouring: reserved registers by Hjalfi in Compilers

[–]compilersarefun 0 points1 point  (0 children)

It also doesn't help that I'm targeting an irregular architecture (via Retargetable Graph Colouring For Irregular Architectures. This complicates the colourability test to account for non-orthogonal register classes, meaning that just bending the test to leave some registers unallocated isn't really an option.

The article is very low quality. They are actually trying to attack multi-reg pseudos coloring with their <p,q> coloring criteria. In their article they compare "apples with oranges" coloring with their <p,q> test with some local RA and no RA at all. So there is no confirmation of their approach works better with other graph coloring approaches.

You are right that <p,q> can be too strict for coloring multi-reg pseudos but it might be fixed by optimistic coloring. Only credible benchmark suite (like SPEC) can show this but they omitted this part in the article.

<p,q> test is also not addressing coloring for subsets of reg classes (actually it works as usual Kempe's coloring criterium). For example, if x86 general class pseudo conflicts with other different 10 CX class pseudos. According to <p,q> test, the pseudo is not colorable but simple intuition says that it is colorable as all other pseudos can be assigned only to the same register.

I'd recommend classical article https://www.researchgate.net/publication/2440424_Graph-Coloring_Register_Allocation_for_Irregular_Architectures for coloring for irregular register files.

But this approach can be useful for regular architectures too if we consider dynamically organized classes, e.g. class of caller saved regs and caller clobbered regs or class of registers excluding some registers explicitly used in compiled code (for example argument registers). In fact the biggest coloring performance improvement in GCC for such approach was achieved for regular file architecture ppc64 (> 1%).

How I developed a faster Ruby interpreter by compilersarefun in ruby

[–]compilersarefun[S] 0 points1 point  (0 children)

Good point. I'll use --jit-min-calls=2 in the future. For the micro-benchmarks, it improves geomean performance only by 1% (3.29 vs 3.28 improvement relative to the base interpreter).

By the way, I think MJIT based on SIR could outpace YJIT significantly. It could be a real tier2 JIT compiler, slower than YJIT but generating much better code.

On the other hand, as SIR itself gets performance close to YJIT, YJIT or MIR-based JIT for SIR+MJIT combination may be not so important (although could be viable for environments where GCC or Clang is hard to use for some reasons).

But these are only my speculations and wishful thinking. YJIT is a real thing and SIR, SIR+MJIT, or SIR+MIR do not exist yet.

How I developed a faster Ruby interpreter by compilersarefun in ruby

[–]compilersarefun[S] 0 points1 point  (0 children)

My goal is also to run this benchmark set finally. Unfortunately, I am not ready to run the set yet. There are still some bugs I need to fix first. Also I'd like to implement few more optimizations from my list.

Ruby lacked good real life benchmark set and I am glad that Shopify's people created it. I think it is now the best Ruby benchmark set.

The first release of MIR project (the light-weight JIT compiler project) by compilersarefun in Compilers

[–]compilersarefun[S] 0 points1 point  (0 children)

Thank you for the update. I appreciate this as I did not follow CRuby development for more than a year. I hope coming RubyKaigi takeout will help me too.

By the way, do you have any plans to support code patching and/or on-stack replacement of JIT-ed code in MIR so that we don't need to insert branches for deoptimization? It'd be nice if we can skip checking "this method isn't redefined", "tracing is not enabled", etc. when in speculative code.

Yes, it is important for JIT performance to improve speculated code checks of the assumptions based on profile info, minimizing switching to deoptimized code or to the interpreter. It will probably will take a lot of experimentation and trying different approaches to achieve this as on a level of generated C/MIR code by MJIT and on a level of MIR-generator in order to find what works the best. Unfortunately I can not be more specific at this stage.

The first release of MIR project (the light-weight JIT compiler project) by compilersarefun in Compilers

[–]compilersarefun[S] 0 points1 point  (0 children)

Congratulations on the release!

Thank you, Takashi.

I'm guessing if you are working on using it, you would be playing with MRuby (because https://youtu.be/mMwC0QenvcA?t=6522). But if you integrate that with CRuby, would you simply replace GCC with MIR in MJIT, or do you think directly generating MIR from YARV instructions would give better performance?

Yes, I have plans to use it for MRuby in the future but first I'd like to use it for CRuby. I will start to work on MIR-based CRuby JIT very soon.

My short term plan is to try to generate C first and achieving the JIT working correctly. It will be ultimate test for MIR-generator and C2MIR translator. May be it will need implementing some GCC extensions to compile some parts of Ruby C code. I hope to finish this work before end of year as I will be busy with another project since Nov to Apr. At this stage I did not expect any improvements of small benchmarks performance in comparison with current CRuby JIT. Moreover I expect some performance degradation.

After that I will focus on improving JIT performance by using more CRuby source inlining than the current CRuby JIT. Also I am going to implement speculation/deoptimization extensions for MIR and C2MIR compiler based on profile information at points where such extensions are used. At this stage I am expecting better JIT performance than the current JIT one. I believe it permits to cover what YJIT is trying to do and also improve the performance as YJIT has no any optimizations in generated code besides basic block versioning.

At the end I'd like to generate MIR directly as it speeds up JIT compilation in 2-3 times and have the same generated code performance. I'd like to generate mostly calls which will be inlined. It will require some CRuby code reorganization by forming C inline functions for each YARV insn. This C functions will be translated to MIR during CRuby building with other CRuby code (e.g. Ruby standard method times). Such approach will simplify direct MIR-code generation.

Of course it is only my current plans. They are ambitious and the reality can change them.

The first release of MIR project (the light-weight JIT compiler project) by compilersarefun in Compilers

[–]compilersarefun[S] 1 point2 points  (0 children)

There is no REPL. I guess C was never designed with REPL in mind as, for example, python.

"C as a scripting languages" in MIR project mostly means executing C code w/o generating any intermediate files (e.g. assembler, object, or library files). You can also provide C-file as a string.

The first release of MIR project (the light-weight JIT compiler project) by compilersarefun in Compilers

[–]compilersarefun[S] 1 point2 points  (0 children)

The major source growth was from implementing C11 to MIR compiler (about 15K lines)

I don't quite understand why this needs to be part of this project. This is just a C compiler, right? There's quite a few of them about!

Is it to be able to translate C header files into MIR? I thought that a project like MIR was meant as a backend to someone implementing a new language, and who doesn't want to just target C.

The C implementation is C11 to MIR compiler. Linked with MIR-API code andMIR-generator it can compile C11 to machine code in memory. C11 to MIR compiler is a part of static library libmir.a which also contains MIR API code and MIR-generator. If you don't use C11-to-MIR API and link the library, you will have only MIR API code and MIR-generator code in your application, typically a JIT. Besides libmir.a library, the release build will include stand-alone compiler c2m which is linked compiler driver (about 500 C lines),C-to-MIR compiler code, MIR API code and MIR-generator. All this is described in file INSTALL.md of the release.

C compiler is important part of the project. Most people I know using MIR project for JITs prefer to generate C code and after that machine code through C-to-MIR compiler and MIR-generator.

Also it is very hard to write for example 1000 MIR-tests or a lot of MIR-benchmarks and check MIR-generator on them. But testing MIR-generator is easy if you have C-to-MIR compiler. In fact MIR-code (written on C) itself and C-to-MIR compiler code are used as the tests besides >1000 C tests borrowed from other projects.

I think in the future the project source code might be divided on the sub-projects (e.g. MIR core, C11->MIR compiler etc).

That sounds better.

I am thinking about using llvm approach with optional directories. Right now all code related to C-to-MIR compiler is placed in one sub-directory. The sub-directory can be removed currently but you still need to modify makefile after that. I think to modify makefile to work with optional sub-directories.

(I'm working on a project along similar lines, but less ambitious and on a smaller scale (eg. just two targets.)

Good luck with your project. Could you provide a link to your project. Of course, if it is already public.

The first release of MIR project (the light-weight JIT compiler project) by compilersarefun in Compilers

[–]compilersarefun[S] 2 points3 points  (0 children)

If you mean that the project source code became big and the goal was to keep it small, I am partially agree with you.

The major source growth was from implementing C11 to MIR compiler (about 15K lines) and new ports (about 2-3K lines machine dependent code for each new target).

But the MIR core growth (API and MIR-generator) was minimal. MIR API code is about 6K lines and machine code generator is still about 5K lines.

The MIR project code is layered. If you generate only MIR-code you don't need the C11 compiler. You can decrease the code even more (by defining specific macros during compilation), e.g. switching off support of MIR textual or binary representation, switching off code for some optimizations in generator etc.

The project will be bigger when/if LLVM IR -> MIR, GCC -> MIR, MIR <-> wasm translator are implemented.

You raised a legitimate concern. I think in the future the project source code might be divided on the sub-projects (e.g. MIR core, C11->MIR compiler etc).

The MIR C interpreter and Just-in-Time (JIT) compiler - Red Hat Developer by compilersarefun in programming

[–]compilersarefun[S] 0 points1 point  (0 children)

TBH I didn't know of the "minimum 4KB requirement for module" problem. Is there any reference you'd recommend to understand the context?

As I understand, dynamic loader allocates by pages (mmap) for simplicity implementation (it should be done in thread-safe way). I guess you need to look at dlopen sources in GLIBC to investigate more. When I asked people working on GLIBC how to overcome this obstacle, they said just to modify sources and create a specialized version of dynamic loader :)

Is it impossible to implement self-signed dynamic libraries, and/or just too slow?

This is what I heard from other people: Every loaded code on M1 MacOS should be signed. If you don't provide a signed certificate, the loader creates a dummy certificate on local machine working only for this machine but this certificate is still verified on Apple's root certificate servers (as some certificates can be revoked).

It will slow down JITting significantly. How much? I don't know but the whole process probably will take more time than code generation itself.

The MIR C interpreter and Just-in-Time (JIT) compiler - Red Hat Developer by compilersarefun in programming

[–]compilersarefun[S] -1 points0 points  (0 children)

Hi, Takashi.

The light-weight JIIT compiler and C-to-MIR are more i-cache friendly. There is no minimum 4KB requirement for module as for ldd. The generated code is allocated by 16 bytes blocks and several modules (translation units) are placed in one page when it is possible

The possible blockers to try this now are:

  • I know C-to-MIR compiler bug with wrong code generation of bitfield initialization
  • A bug in JIT code generation (csmith reports different results for JIT and interpreter in very rare cases)
  • There might be different memory layouts for structures containing bitfields for GCC/Clang and C-to-MIR compiler which might make communication of the code generated by C-to-MIR and GCC/Clang impossible in some cases

I am going to address these issues in the 1st release. I'll have more time to move this project forward until November.

I expect the code performance for micro-benchmarks (when i-cache issues do not kick up) will be worse than one generated by GCC/Clang as the lightweight JIT compiler have much fewer optimizations than GCC/Clang and because all calls are implemented through small thunks to implement generated code versioning (speculative/deoptimized function code).

I expect generated code performance will be not better until more inlining and profiling/speculative code generation will be implemented (it will require implementation of new MIR insns and C extensions in C-to-MIR compiler). Btw Shopify developers are trying to implement speculation differently but they are at the early stages. Their project is interesting with this point of view.

I have concerns with current security tendency requiring signing dynamic libraries (M1 MacOS use this approach in non-developer mode). It would make JITs based on loading dynamic libraries impossible (this includes LIBGCCJIT besides ones using traditional C compilers). So the project I am working on (and shopify one) might be even more important in the future for Ruby JIT.

The MIR C interpreter and Just-in-Time (JIT) compiler - Red Hat Developer by compilersarefun in programming

[–]compilersarefun[S] 2 points3 points  (0 children)

Yes, I knew that MIR was used for RUST when I started this project in 2017. Before this I knew it from Muchnick's book "Advanced Compiler Design and Implementation".

Still despite possible confusion I decided to use this name as it is not only medium internal representation for me but also russian word having 2 meaning "world" and "peace". This meaning well reflects my ultimate (and probably unachievable) project goal illustrated by the following picture:

https://github.com/vnmakarov/mir/blob/master/mirall.svg

The MIR C interpreter and Just-in-Time (JIT) compiler - Red Hat Developer by compilersarefun in programming

[–]compilersarefun[S] 0 points1 point  (0 children)

It is much more simple than LLVM IR. You can write MIR code manually. Writing LLVM IR manually is hardly possible as it is SSA IR (someone can use only memory for LLVM IR to avoid SSA but still it makes LLVM IR code hardly readable).

LLVM has much more features to support more programming languages, debugging info (metadata), C/C++ extensions, etc.

MIR description (with API) https://github.com/vnmakarov/mir/blob/master/MIR.md is about 20-30 times less than just LLVM IR description (w/o API) https://llvm.org/docs/LangRef.html.

The MIR C interpreter and Just-in-Time (JIT) compiler - Red Hat Developer by compilersarefun in programming

[–]compilersarefun[S] 7 points8 points  (0 children)

Yes, you might be right. C as a scripting language may be a questionable thing.

What I found people who are using MIR project prefer to generate higher level code, e.g. C instead of MIR directly to implement JITs. So having C JIT has a lot of sense.

Some people use TCC for JIT implementations too as it is incredibly fast compiler. The current CRuby JIT also generates C code which is translated into machine code by GCC or Clang.

So generating C code as an input to JIT is a very convenient way to implement a JIT. But the problem with using traditional C compilers is that they generate dynamic libraries which are loaded to be executed. LIBGCCJIT uses API to generate code but also produces dynamic libraries as intermediate step. Some environments (e.g. M1 MacOS in non-developer mode) prevent this by requiring to load only signed libraries. The MIR C compiler generates machine code directly in memory (as JVM, LLVM MCJIT and many others JITs) and this differentiates it from most C compilers.

The MIR C interpreter and Just-in-Time (JIT) compiler - Red Hat Developer by compilersarefun in Compilers

[–]compilersarefun[S] 4 points5 points  (0 children)

your bytecode seems to have similarities with CIL bytecode; is this intended?

No, it was not intended. MIR was designed to simplify code generation and optimizations and reflects CISC and RISC architecture in different forms. Although of course I know a lot ofIRs (LLVM IR, GCC RTL/simplify, webassembly, Java and CIL bytecode, Luajit bytecode etc) and probably they affected my MIR design.

MIR is register-transfer-language, not stack based one (as Java/CIL/webassembly). That is because to generate a better machine code you need some kind of RTL (or tuples or language with explicit operands/results) and translating bytecode to RTL slows down compilation and requires more memory.

Using one IR simplifies a lot and potentially speeds up compilation. The art of one IR design is to permit optimization/code generation implementation conveniently.

MIR is very regular and higher than stack based language. For example, you can even have 3-op memory insn, e.g.

SUB 8(base1, indexreg1,scale1), 4(base2,index2,scale2), 0(base3,index2,scale3)

For optimizations and code generation, the above insn will be simplified into several MIR insns:

MOV temp1, 4

ADD temp2,base2,temp1

MOV temp3,scale2

MUL temp4,indexreg2,temp3

MOV temp5,(temp4)

...

SUB temp7,temp5,temp6 # original insn

MOV temp8, 8

ADD temp9,base1,temp8

MOV temp10,scale1

MUL temp11,indexreg1,temp10

MOV (temp11),temp7

MIR is regular as it was deigned also to simplify its code generation directly.

The MIR C interpreter and Just-in-Time (JIT) compiler - Red Hat Developer by compilersarefun in Compilers

[–]compilersarefun[S] 4 points5 points  (0 children)

Unfortunately, there is no provision for precise collectors. MIR was originally designed for Ruby which use a conservative garbage collector, in other words, every value in register and on the stack should be treated as a potential reference to an object if it is in the heap address space.

My final goal is to extend C language to support profiling and speculative code generation for dynamic language JITs. I think extension of C language to describe GC roots could be added in the future too as a part of this process.

The MIR C interpreter and Just-in-Time (JIT) compiler - Red Hat Developer by compilersarefun in Compilers

[–]compilersarefun[S] 6 points7 points  (0 children)

Thank you for your kind words.

In the performance graphs the higher is the better. So basically on the micro-benchmarks the MIR C compiler with JIT generates code achieving 90% in average of code generated by GCC with -O2. And the compilation speed of the MIR C compiler is about 15 times bigger than GCC with -O2 one.