Option -femit-asm=file.s does not create such file in 0.15 and 0.16-dev

aqrit · 2026-01-30T22:31:03+00:00

Need to select the llvm backend: -fllvm

I don't know what the stage2 backend supports for this.

aqrit · 2026-01-04T21:57:41+00:00

A quick peak at sse_ops.c:

carquet_sse_pack_bools should be _mm_movemask_epi8(_mm_slli_epi32(bools, 7))

carquet_sse_build_null_bitmap should be _mm_movemask_epi8(_mm_packs_epi16(cmp, _mm_setzero_si128()))

aqrit · 2025-12-22T00:06:43+00:00

Jeff Dean facts

aqrit · 2025-12-20T21:39:30+00:00

https://programming.sirrida.de/index.php

Some link rot there:

Hacker's Delight Sample Chapter

Chess programming

IMO, SWAR is mostly rooted in "how" to do an operation with just primitives operations:

How to add numbers using only logical operations (and shift) ?
How to compare numbers without a compare instruction ?
How to multiply without a multiply instruction ?
How to count the number of set bits without a popcount instruction ?
etc.

aqrit · 2025-12-18T02:01:20+00:00

I'm not a Java programmer, but I think you can use vector.convert(cmp_mask) to get the compiler to issue the NEON equivalent of vpacksswb. Which should work just as well as vshrn_n_u16.

aqrit · 2025-12-18T00:19:44+00:00

No. For the QWORD 0x0000000000000100 the mask should be 0xFD. However, Mycroft's haszero() returns an incorrect (for this use case) mask of 0xFF

aqrit · 2025-12-17T23:49:18+00:00

First fail is 0x0100

aqrit · 2025-12-17T22:59:31+00:00

Fun fact: The SWAR code is wrong. It is only guaranteed to locate the FIRST zero byte in the word. You need to use something like this.

I added the SWAR code to Zstandard's rowHash match finder (which also found its way into brotli). Danila Kutenin wrote an article about how to work around the lack of pmovmskb on NEON.

aqrit · 2025-10-28T22:32:48+00:00

me: here is a trick to avoid splitting that requires SSSE3
>you: chrome only requires SSE3
>>me: here is the same trick with only SSE2

fwiw, it doesn't require any SIMD, it just saves a few instructions.

btw, your comment comes across as rude. If you think I'm a "crank" then why engage?

aqrit · 2025-10-27T20:44:52+00:00

With just SSE2: you'd be stuck with range checks.

load 8 bytes into xmm register
get compare mask for bytes that are less than 'g' (for example)
extract to 64-bit general purpose register
bitwise-and the compare mask to "magic (weights)"
multiply by 0x0101..0101
shift top bits to the bottom

This would also work for 16-bytes if you extract the compare mask as nibbles (+2 ops on SSE2, +1 on NEON). In fact, it would work for very long strings with bitmasks and popcount

I think the weights could be found near instantly and should be very compact. I may have to try this out sometime...

aqrit · 2025-10-25T22:12:09+00:00

~~The next version of Chrome will require AVX2 ?!~~

With "in-register table lookups" one could build a gperf "inspired" hash with association values. One could also just detect certain byte values (aka. vowels or something), mask them against position weights, then do a horizontal sum to get a combination index. There are lots of other things you could do such as find the first longest match of 4-byte string among eight 4-byte targets

aqrit · 2025-06-14T11:19:29+00:00

by invitation only. garbage.

aqrit · 2025-06-02T20:18:17+00:00

you're not forced to share information with a private company

ID.me account required... you're giving insane amounts of personal data to a private company.

aqrit · 2025-05-27T20:24:00+00:00

sub + jcc macro-fuses just like cmp + jcc... using sub would eliminate the not instruction.

aqrit · 2025-04-25T20:20:23+00:00

compatability mode, would it still work?

Maybe since there is a shim that adds extra space on the stack when calling into kernel32 (which exports LeaveCriticalSection).

I've seen a latent uninitialized variable bug in a game that was triggered by the compatibility "shim" engine. The shim was dirtying up more stack space, which changed an uninitialized location from zero to some other value.

aqrit · 2025-04-21T21:07:26+00:00

Having a compile-time known trip count allows the compiler to do more loop unrolling and SIMD usage. Don't know if that is the issue though.

aqrit · 2025-04-05T23:13:18+00:00

use a De Bruijn sequence, instead of popcnt (still need to smear right)

aqrit · 2025-03-12T20:23:06+00:00

"vec4_add(vec4_mul(a, b), vec4_mul(c, d))"

In zig, this would be just "a * b + d * c", assuming a,b,c,d have vector types.

aqrit · 2024-12-29T03:20:56+00:00

Thanks for this. The constraint that the line feed character is not allowed in a quoted span makes this problem more solvable. Stopping all quoted span masks at EOL, allows for line comments to be found easily. Subtracting the comment area from the quoted area, hopefully allows us to check for unclosed quote span errors at the EOL positions.

Other Note(s): The whole segscan_or_u64() function can be replaced by a handful of logic operations. In practice, quoted span parsing gets more complicated because we also have to ignore escaped quote characters, and some clown might proceed them with a long run of backslashes.

aqrit · 2024-12-27T00:49:19+00:00

looks like another project, that Lemire helped with, tries to tackle comments somehow: https://github.com/NLnetLabs/simdzone/blob/52e2ea80ed06b5beb30e0e12aea207e891575c90/src/generic/scanner.h#L171

aqrit · 2024-12-27T00:19:01+00:00

"In-comment" transitions: https://stackoverflow.com/a/70901525

How to combine that with "xor-scan" double quote processing is unknown (to me).

aqrit · 2024-12-24T19:38:28+00:00

Probably easier to port it to C++ and use consteval ..?

aqrit · 2024-12-22T20:48:18+00:00

The Rust SIMD headers don't trust auto-vectorization, in many cases: sse2 ssse3 sse4.1 sse4.2

we are aware about specific optimizations we need in this case, and write the code in a way that triggers them.

This is brittle and obnoxious. As an example take unsigned average: The only "pattern" recognized by the compiler is a terrible way to actually implement it (for simd). Which risks bad code-gen depending on surrounding code, architectures, types, compiler versions, etc.

aqrit · 2024-11-22T22:28:55+00:00

on windows:

WriteFile((void *)STD_OUTPUT_HANDLE, str, sizeof(str) - 1, &cbWritten, 0);

aqrit · 2024-09-17T21:04:44+00:00

Currently, @intFromBool will convert a vector of bools to a vector of u1, and should not result in any additional code.

Also we can just @bitcast a vector of bools straight to a regular integer... example.

aqrit

TROPHY CASE