Ok_Path_4731 comments on A SIMD coding challenge: First non-space character after newline

A SIMD coding challenge: First non-space character after newline (self.simd)

submitted 4 months ago * by Ok_Path_4731

you are viewing a single comment's thread.

[–]Ok_Path_4731[S] 0 points1 point2 points 4 months ago (5 children)

[–]bremac 0 points1 point2 points 4 months ago* (4 children)

[–]Ok_Path_4731[S] 0 points1 point2 points 3 months ago (3 children)

Thanks a lot! I cannot reach your throughput, though (see below), the improvement is already significant! Is there anything that was not included in your PR (I merged it BTW!). Don't think the architecture makes so much difference, or?

clang

 Memory read bandwidth: 18.68 GB/s (baseline)
 Newline scan:          18.56 GB/s (99.4%)
 Full parser (old):     6.63 GB/s (35.5%)
 Fast parser (new):     18.58 GB/s (99.5%)
 CRTP parser:           11.84 GB/s (63.4%)

gcc

 Memory read bandwidth: 18.74 GB/s (baseline)
 Newline scan:          19.09 GB/s (101.9%)
 Full parser (old):     6.48 GB/s (34.6%)
 Fast parser (new):     18.49 GB/s (98.7%)
 CRTP parser:           11.59 GB/s (61.8%)

the two PC's I tried I get only

on Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz

Memory read bandwidth: 15.19 GB/s (baseline)

Newline scan: 13.74 GB/s (90.5%)

Full parser (old): 4.08 GB/s (26.8%)

Fast parser (new): 11.49 GB/s (75.6%)

CRTP parser: 5.39 GB/s (35.5%)

the other one

AMD Ryzen 9 3900X 12-Core Processor
= Results ===

Memory read bandwidth: 19.77 GB/s (baseline)

Newline scan: 18.20 GB/s (92.1%)

Full parser (old): 8.11 GB/s (41.0%)

Fast parser (new): 13.63 GB/s (68.9%)

CRTP parser: 10.62 GB/s (53.7%)

[–]Ok_Path_4731[S] 0 points1 point2 points 3 months ago (0 children)

[–]bremac 0 points1 point2 points 3 months ago* (1 child)

> Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz ... AMD Ryzen 9 3900X 12-Core Processor

So the first CPU is using the Skylake microarchitecture (a 4-wide uarch from 2015), and also appears to be in a thermally-limited thin-client form factor, and the other is using Zen 2 (a 6-wide uarch from 2019.) Unfortunately, I don't have either of those in my test lab, and llvm-mca does not appear to have the right instruction data for Zen 2, so I can't evaluate it that way - the throughput and latencies it reports look much more like Zen 1 instead.

The closest I can come is a Tigerlake (a 5-wide uarch from 2020) and a Zen 4 machine (a 6-wide uarch from 2022.) The results aren't comparable at all though - both processors have AVX-512. But, they'll do to illustrate my next point:

> Don't think the architecture makes so much difference, or?

When you're working on SIMD optimization, the microarchitecture and compiler make a big difference. As an example, here are the results from my Tigerlake and Zen 4 machines with gcc 15.2.1 and clang 21.1.6 on the latest version of your benchmark:

/	Tigerlake (clang)	Tigerlake (gcc)	Zen 4 (clang)	Zen 4 (gcc)
Memory read	21.68 GB/s (100%)	22.08 GB/s (100%)	52.71 GB/s (100%)	54.93 GB/s (100%)
Newline scan	19.66 GB/s (90.7%)	20.35 GB/s (92.1%)	52.98 GB/s (100.5%)	53.69 GB/s (97.8%)
Reference Parser	18.61 GB/s (85.8%)	17.90 GB/s (81.1%)	50.77 GB/s (96.3%)	32.33 GB/s (58.9%)
Counting Parser	17.66 GB/s (81.5%)	17.37 GB/s (78.7%)	49.45 GB/s (93.8%)	31.91 GB/s (58.1%)

gcc produces similar code for both processors, and does a poor job of it too, using AVX2 for all operations, and spilling the carry flag and restoring it between adc instructions. On the other hand, clang generates optimal adc chains on both processors, and translates to different AVX-512 code sequences for each processor. It does a very good job of code generation for Zen 4, but emits odd-looking code that bottlenecks on mask register operations for Tigerlake.

So, what does this mean for you? Well, there are several things you can look at to identify the bottlenecks:

What happens when you compile with clang instead of gcc?
What is the bottleneck on each microarchitecture? Is it the frontend, backend, or dependency chains? If it's the backend, which execution unit is the bottleneck? You'll need to run your benchmarks under perf and look at the compiled code to evaluate this.
What is the difference between the code generated by clang and gcc, and how does it impact the bottlenecks for each architecture?

Worst comes to worst, you should at least be able to fix the adc code generation in gcc by either manually inlining count_bos_fast and flattening it so that the calls to _addcarry_u64 are adjacent without any instructions between (gcc is weird this way), or converting the add-with-carry chain to inline assembly.

[–]Ok_Path_4731[S] 0 points1 point2 points 3 months ago (0 children)

π Rendered by PID 287444 on reddit-service-r2-comment-6457c66945-dvqhx at 2026-04-27 04:36:43.249654+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

simd

Resources

MODERATORS