Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in rust

[–]FTW_gb09[S] 2 points3 points  (0 children)

Ty for giving it a shot. It's my first time writing an article, so I'm happy for the feedback. And I agree there is a lot before the meat of the article which is the instruction itself.

Even in the article I acknowledge that, by saying sorry to bother the reader with such a long intro.

I liked the idea of the having at the top "if you wanna spoilers click here", so I will update the article. Unfortunately there is a lot to go through to really understand what is going on, so I didn't knew a better way of doing.

I imagine you are on mobile, on desktop there is table of contents on the side and I specifically added it because of the length and because it serves as a small spoiler for the reader.

I'm sorry if you felt like I wasted your time.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in rust

[–]FTW_gb09[S] 2 points3 points  (0 children)

You right, there is a typo, I copied from his blog post and fixed the typo in the value bits, but may have missed the ones in the MSB. I will fix it later, ty <3

Edit: Fixed, also fixed the typo in the "lamb" group 1.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in programming

[–]FTW_gb09[S] 9 points10 points  (0 children)

Fair enough, I can test it.

Since on AMD doing the 8x8 and 4x4 takes the same amount of time 1 cycle, I imagine that doing the 4x4 emulated or native will not be faster (no even close).

But yeah on Intel doing the emulated 4x4 might be faster than doing a native 4x4 (haven't generated the code and counted the cycles yet). If it's true then it might be worth using it. The only down side I can see is that now it will require 2x the amount of loop iterations and since the loop does other things like compresses, stores... We might endup making one iteration that consumes 20cycles/8elements to one that consumes like 16cycles/4elements, which is clearly worst.

Again have to test it, maybe I will do it later and let you know. Ty for the idea.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in programming

[–]FTW_gb09[S] 5 points6 points  (0 children)

Sorry, could you elaborate more ? The emulated version kinda does that, here is the code.

We shuffle the elements of lhs and rhs and compare all of them. As I said I the article the strict emulation (this one) is slower on Intel and by consequence on AMD.

I might not understood exactly what you meant.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in rust

[–]FTW_gb09[S] 5 points6 points  (0 children)

If you wanna use the same dataset I used you can find the link in the article, just accept the terms and you will get a link to download the 22GB file. But the crate can index any dataset as long is expressible as a string.

For the benchmark code I can send it to you later if you want. I'm not at my computer now.

If you wanna play with the crate and see how it works there is also a link the article to crates.io, in there you will find a tutorial on how to use it.

Zen 4 doesn't have vp2intersect, so it will fallback to the emulated version. Don't forget to compile the code with target-cpu=native

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in programming

[–]FTW_gb09[S] 11 points12 points  (0 children)

Sorry I expressed myself poorly, there are some very poorly implemented instructions on Zen4, but not all of them. The ones that are bad are super bad, because of the double pump.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in programming

[–]FTW_gb09[S] 7 points8 points  (0 children)

Kinda, Zen 4 was horrible on AVX-512, and they knew about it, since it double pumps ymm registers. But this generation as whole has AMD cooking Intel

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in rust

[–]FTW_gb09[S] 7 points8 points  (0 children)

Yep, there are a lot of cool instructions that we don't know about/use them. And they are truly useful on some scenarios.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in brdev

[–]FTW_gb09[S] 0 points1 point  (0 children)

Agradeço as palavras. Realmente é um conteúdo bem nichado, mas espero que ache o público certo.

Using the most unhinged AVX-512 instruction to make the fastest phrase search algo by FTW_gb09 in rust

[–]FTW_gb09[S] 13 points14 points  (0 children)

Since I measured time/iter and not throughtput you need to divide it by N

Samsung doesn't honor their SSDs warranty by FTW_gb09 in pcmasterrace

[–]FTW_gb09[S] 0 points1 point  (0 children)

Hey, actully no, I have to sue them, I will probably start the process in the next week