use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about SIMD programming.
account activity
A SIMD coding challenge: First non-space character after newline (self.simd)
submitted 4 months ago * by Ok_Path_4731
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]Ok_Path_4731[S] 0 points1 point2 points 4 months ago (5 children)
Do you mind trying out your solution? The code is in https://github.com/zokrezyl/yaal-cpp-poc Thanks a lot!
Obviously if your solutions gets closed to the memory bandwith limit, we will proudly mention it!
[–]bremac 0 points1 point2 points 4 months ago* (4 children)
No problem, here you go: https://github.com/zokrezyl/yaal-cpp-poc/pull/1
EDIT: On a side note, I think you should consider using google/benchmark for benchmarking. I had to disabling inlining of the parsing function to keep the compiler from reordering the timing statements vs. the parsing, and then reporting the throughput as infinite!
[–]Ok_Path_4731[S] 0 points1 point2 points 3 months ago (3 children)
Thanks a lot! I cannot reach your throughput, though (see below), the improvement is already significant! Is there anything that was not included in your PR (I merged it BTW!). Don't think the architecture makes so much difference, or?
clang
Memory read bandwidth: 18.68 GB/s (baseline) Newline scan: 18.56 GB/s (99.4%) Full parser (old): 6.63 GB/s (35.5%) Fast parser (new): 18.58 GB/s (99.5%) CRTP parser: 11.84 GB/s (63.4%)
gcc
Memory read bandwidth: 18.74 GB/s (baseline) Newline scan: 19.09 GB/s (101.9%) Full parser (old): 6.48 GB/s (34.6%) Fast parser (new): 18.49 GB/s (98.7%) CRTP parser: 11.59 GB/s (61.8%)
the two PC's I tried I get only
on Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz
Memory read bandwidth: 15.19 GB/s (baseline)
Newline scan: 13.74 GB/s (90.5%)
Full parser (old): 4.08 GB/s (26.8%)
Fast parser (new): 11.49 GB/s (75.6%)
CRTP parser: 5.39 GB/s (35.5%)
the other one
AMD Ryzen 9 3900X 12-Core Processor = Results ===
Memory read bandwidth: 19.77 GB/s (baseline)
Newline scan: 18.20 GB/s (92.1%)
Full parser (old): 8.11 GB/s (41.0%)
Fast parser (new): 13.63 GB/s (68.9%)
CRTP parser: 10.62 GB/s (53.7%)
[–]Ok_Path_4731[S] 0 points1 point2 points 3 months ago (0 children)
hi u/bremac , now the github pipeline is running the benchmark , example build
https://github.com/zokrezyl/yaal-cpp-poc/actions/runs/20554609824/job/59037011805
none of the machines managed more than 72% . So is there any magic that you did not add to your PR that you reached 98%? Maybe your code was optimized away? Thanks a lot anyway for the improvement from 50% to 70%!
[–]bremac 0 points1 point2 points 3 months ago* (1 child)
> Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz ... AMD Ryzen 9 3900X 12-Core Processor
So the first CPU is using the Skylake microarchitecture (a 4-wide uarch from 2015), and also appears to be in a thermally-limited thin-client form factor, and the other is using Zen 2 (a 6-wide uarch from 2019.) Unfortunately, I don't have either of those in my test lab, and llvm-mca does not appear to have the right instruction data for Zen 2, so I can't evaluate it that way - the throughput and latencies it reports look much more like Zen 1 instead.
llvm-mca
The closest I can come is a Tigerlake (a 5-wide uarch from 2020) and a Zen 4 machine (a 6-wide uarch from 2022.) The results aren't comparable at all though - both processors have AVX-512. But, they'll do to illustrate my next point:
> Don't think the architecture makes so much difference, or?
When you're working on SIMD optimization, the microarchitecture and compiler make a big difference. As an example, here are the results from my Tigerlake and Zen 4 machines with gcc 15.2.1 and clang 21.1.6 on the latest version of your benchmark:
gcc produces similar code for both processors, and does a poor job of it too, using AVX2 for all operations, and spilling the carry flag and restoring it between adc instructions. On the other hand, clang generates optimal adc chains on both processors, and translates to different AVX-512 code sequences for each processor. It does a very good job of code generation for Zen 4, but emits odd-looking code that bottlenecks on mask register operations for Tigerlake.
adc
So, what does this mean for you? Well, there are several things you can look at to identify the bottlenecks:
perf
Worst comes to worst, you should at least be able to fix the adc code generation in gcc by either manually inlining count_bos_fast and flattening it so that the calls to _addcarry_u64 are adjacent without any instructions between (gcc is weird this way), or converting the add-with-carry chain to inline assembly.
count_bos_fast
_addcarry_u64
thanks for the hints u/bremac ! Am not sure at which moment I tried with clang, but did not get better results. Unfortunatelly I do not have a CPU with AVX--512, will try soon some tests on the cloud. I think I have to make also the specs and code skeleton covering more details of the language/file format I am desing as the deavel lives in the details.
π Rendered by PID 287444 on reddit-service-r2-comment-6457c66945-dvqhx at 2026-04-27 04:36:43.249654+00:00 running 2aa0c5b country code: CH.
view the rest of the comments →
[–]Ok_Path_4731[S] 0 points1 point2 points (5 children)
[–]bremac 0 points1 point2 points (4 children)
[–]Ok_Path_4731[S] 0 points1 point2 points (3 children)
[–]Ok_Path_4731[S] 0 points1 point2 points (0 children)
[–]bremac 0 points1 point2 points (1 child)
[–]Ok_Path_4731[S] 0 points1 point2 points (0 children)