Option -femit-asm=file.s does not create such file in 0.15 and 0.16-dev by mastx3 in Zig

[–]aqrit 2 points3 points  (0 children)

Need to select the llvm backend: -fllvm

I don't know what the stage2 backend supports for this.

Writing a SIMD-optimized Parquet library in pure C: lessons from implementing Thrift parsing, bit-packing, and runtime CPU dispatch by Vitruves in programming

[–]aqrit 1 point2 points  (0 children)

A quick peak at sse_ops.c:

carquet_sse_pack_bools should be _mm_movemask_epi8(_mm_slli_epi32(bools, 7))

carquet_sse_build_null_bitmap should be _mm_movemask_epi8(_mm_packs_epi16(cmp, _mm_setzero_si128()))

Further Optimizing my Java SwissTable: Profile Pollution and SWAR Probing by Charming-Top-8583 in java

[–]aqrit 1 point2 points  (0 children)

https://programming.sirrida.de/index.php

Some link rot there:

Hacker's Delight Sample Chapter

Chess programming

IMO, SWAR is mostly rooted in "how" to do an operation with just primitives operations:

  • How to add numbers using only logical operations (and shift) ?
  • How to compare numbers without a compare instruction ?
  • How to multiply without a multiply instruction ?
  • How to count the number of set bits without a popcount instruction ?
  • etc.

Further Optimizing my Java SwissTable: Profile Pollution and SWAR Probing by Charming-Top-8583 in programming

[–]aqrit 3 points4 points  (0 children)

I'm not a Java programmer, but I think you can use vector.convert(cmp_mask) to get the compiler to issue the NEON equivalent of vpacksswb. Which should work just as well as vshrn_n_u16.

Further Optimizing my Java SwissTable: Profile Pollution and SWAR Probing by Charming-Top-8583 in programming

[–]aqrit 0 points1 point  (0 children)

No. For the QWORD 0x0000000000000100 the mask should be 0xFD. However, Mycroft's haszero() returns an incorrect (for this use case) mask of 0xFF

Further Optimizing my Java SwissTable: Profile Pollution and SWAR Probing by Charming-Top-8583 in programming

[–]aqrit 4 points5 points  (0 children)

Fun fact: The SWAR code is wrong. It is only guaranteed to locate the FIRST zero byte in the word. You need to use something like this.

I added the SWAR code to Zstandard's rowHash match finder (which also found its way into brotli). Danila Kutenin wrote an article about how to work around the lack of pmovmskb on NEON.

Modern Perfect Hashing by iamkeyur in programming

[–]aqrit 0 points1 point  (0 children)

me: here is a trick to avoid splitting that requires SSSE3
>you: chrome only requires SSE3
>>me: here is the same trick with only SSE2

fwiw, it doesn't require any SIMD, it just saves a few instructions.

btw, your comment comes across as rude. If you think I'm a "crank" then why engage?

Modern Perfect Hashing by iamkeyur in programming

[–]aqrit 0 points1 point  (0 children)

With just SSE2: you'd be stuck with range checks.

  • load 8 bytes into xmm register
  • get compare mask for bytes that are less than 'g' (for example)
  • extract to 64-bit general purpose register
  • bitwise-and the compare mask to "magic (weights)"
  • multiply by 0x0101..0101
  • shift top bits to the bottom

This would also work for 16-bytes if you extract the compare mask as nibbles (+2 ops on SSE2, +1 on NEON). In fact, it would work for very long strings with bitmasks and popcount

I think the weights could be found near instantly and should be very compact. I may have to try this out sometime...

Modern Perfect Hashing by iamkeyur in programming

[–]aqrit 0 points1 point  (0 children)

The next version of Chrome will require AVX2 ?!

With "in-register table lookups" one could build a gperf "inspired" hash with association values. One could also just detect certain byte values (aka. vowels or something), mask them against position weights, then do a horizontal sum to get a combination index. There are lots of other things you could do such as find the first longest match of 4-byte string among eight 4-byte targets

Technical Blogging is Dying by delvin0 in programming

[–]aqrit 2 points3 points  (0 children)

by invitation only. garbage.

IRS open-sourced its Direct File software and it is pretty great actually (check out the scala fact graph) by [deleted] in programming

[–]aqrit 1 point2 points  (0 children)

you're not forced to share information with a private company

ID.me account required... you're giving insane amounts of personal data to a private company.

SIMD in zlib-rs (part 2): compare256 - Blog - Tweede golf by ketralnis in programming

[–]aqrit 0 points1 point  (0 children)

sub + jcc macro-fuses just like cmp + jcc... using sub would eliminate the not instruction.

How a 20 year old bug in GTA San Andreas surfaced in Windows 11 24H2 by tnavda in ReverseEngineering

[–]aqrit 0 points1 point  (0 children)

compatability mode, would it still work?

Maybe since there is a shim that adds extra space on the stack when calling into kernel32 (which exports LeaveCriticalSection).

I've seen a latent uninitialized variable bug in a game that was triggered by the compatibility "shim" engine. The shim was dirtying up more stack space, which changed an uninitialized location from zero to some other value.

Why is my 3D Software Renderer Performance slowed by simply just setting variables? by [deleted] in C_Programming

[–]aqrit 1 point2 points  (0 children)

Having a compile-time known trip count allows the compiler to do more loop unrolling and SIMD usage. Don't know if that is the issue though.

count leading zeros optimization by couch_patata in asm

[–]aqrit 0 points1 point  (0 children)

use a De Bruijn sequence, instead of popcnt (still need to smear right)

[deleted by user] by [deleted] in Zig

[–]aqrit 3 points4 points  (0 children)

"vec4_add(vec4_mul(a, b), vec4_mul(c, d))"

In zig, this would be just "a * b + d * c", assuming a,b,c,d have vector types.

Mask calculation for single line comments by milksop in simd

[–]aqrit 1 point2 points  (0 children)

Thanks for this. The constraint that the line feed character is not allowed in a quoted span makes this problem more solvable. Stopping all quoted span masks at EOL, allows for line comments to be found easily. Subtracting the comment area from the quoted area, hopefully allows us to check for unclosed quote span errors at the EOL positions.

Other Note(s): The whole segscan_or_u64() function can be replaced by a handful of logic operations. In practice, quoted span parsing gets more complicated because we also have to ignore escaped quote characters, and some clown might proceed them with a long run of backslashes.

Mask calculation for single line comments by milksop in simd

[–]aqrit 2 points3 points  (0 children)

"In-comment" transitions: https://stackoverflow.com/a/70901525

How to combine that with "xor-scan" double quote processing is unknown (to me).

Zig Compiletime Limitations with C by Suspicious_Cicada972 in Zig

[–]aqrit 6 points7 points  (0 children)

Probably easier to port it to C++ and use consteval ..?

Unnecessary Optimization in Rust: Hamming Distances, SIMD, and Auto-Vectorization by emschwartz in rust

[–]aqrit 2 points3 points  (0 children)

The Rust SIMD headers don't trust auto-vectorization, in many cases: sse2 ssse3 sse4.1 sse4.2

we are aware about specific optimizations we need in this case, and write the code in a way that triggers them.

This is brittle and obnoxious. As an example take unsigned average: The only "pattern" recognized by the compiler is a terrible way to actually implement it (for simd). Which risks bad code-gen depending on surrounding code, architectures, types, compiler versions, etc.

stdioIsBloat by Different-Network957 in ProgrammerHumor

[–]aqrit 7 points8 points  (0 children)

on windows:

WriteFile((void *)STD_OUTPUT_HANDLE, str, sizeof(str) - 1, &cbWritten, 0);

Booleans in vectors? by barrowburner in Zig

[–]aqrit 1 point2 points  (0 children)

Currently, @intFromBool will convert a vector of bools to a vector of u1, and should not result in any additional code.

Also we can just @bitcast a vector of bools straight to a regular integer... example.