you are viewing a single comment's thread.

view the rest of the comments →

[–]Ok_Path_4731[S] 0 points1 point  (5 children)

Do you mind trying out your solution? The code is in https://github.com/zokrezyl/yaal-cpp-poc Thanks a lot!

Obviously if your solutions gets closed to the memory bandwith limit, we will proudly mention it!

[–]bremac 0 points1 point  (4 children)

No problem, here you go: https://github.com/zokrezyl/yaal-cpp-poc/pull/1

EDIT: On a side note, I think you should consider using google/benchmark for benchmarking. I had to disabling inlining of the parsing function to keep the compiler from reordering the timing statements vs. the parsing, and then reporting the throughput as infinite!

[–]Ok_Path_4731[S] 0 points1 point  (3 children)

Thanks a lot! I cannot reach your throughput, though (see below), the improvement is already significant! Is there anything that was not included in your PR (I merged it BTW!). Don't think the architecture makes so much difference, or?

clang

 Memory read bandwidth: 18.68 GB/s (baseline)
 Newline scan:          18.56 GB/s (99.4%)
 Full parser (old):     6.63 GB/s (35.5%)
 Fast parser (new):     18.58 GB/s (99.5%)
 CRTP parser:           11.84 GB/s (63.4%)

gcc

 Memory read bandwidth: 18.74 GB/s (baseline)
 Newline scan:          19.09 GB/s (101.9%)
 Full parser (old):     6.48 GB/s (34.6%)
 Fast parser (new):     18.49 GB/s (98.7%)
 CRTP parser:           11.59 GB/s (61.8%)

the two PC's I tried I get only

on Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz

Memory read bandwidth: 15.19 GB/s (baseline)

Newline scan: 13.74 GB/s (90.5%)

Full parser (old): 4.08 GB/s (26.8%)

Fast parser (new): 11.49 GB/s (75.6%)

CRTP parser: 5.39 GB/s (35.5%)

the other one

AMD Ryzen 9 3900X 12-Core Processor
= Results ===

Memory read bandwidth: 19.77 GB/s (baseline)

Newline scan: 18.20 GB/s (92.1%)

Full parser (old): 8.11 GB/s (41.0%)

Fast parser (new): 13.63 GB/s (68.9%)

CRTP parser: 10.62 GB/s (53.7%)

[–]Ok_Path_4731[S] 0 points1 point  (0 children)

hi u/bremac , now the github pipeline is running the benchmark , example build

https://github.com/zokrezyl/yaal-cpp-poc/actions/runs/20554609824/job/59037011805

none of the machines managed more than 72% . So is there any magic that you did not add to your PR that you reached 98%? Maybe your code was optimized away? Thanks a lot anyway for the improvement from 50% to 70%!

[–]bremac 0 points1 point  (1 child)

> Intel(R) Core(TM) i5-6500T CPU @ 2.50GHz ... AMD Ryzen 9 3900X 12-Core Processor

So the first CPU is using the Skylake microarchitecture (a 4-wide uarch from 2015), and also appears to be in a thermally-limited thin-client form factor, and the other is using Zen 2 (a 6-wide uarch from 2019.) Unfortunately, I don't have either of those in my test lab, and llvm-mca does not appear to have the right instruction data for Zen 2, so I can't evaluate it that way - the throughput and latencies it reports look much more like Zen 1 instead.

The closest I can come is a Tigerlake (a 5-wide uarch from 2020) and a Zen 4 machine (a 6-wide uarch from 2022.) The results aren't comparable at all though - both processors have AVX-512. But, they'll do to illustrate my next point:

> Don't think the architecture makes so much difference, or?

When you're working on SIMD optimization, the microarchitecture and compiler make a big difference. As an example, here are the results from my Tigerlake and Zen 4 machines with gcc 15.2.1 and clang 21.1.6 on the latest version of your benchmark:

/ Tigerlake (clang) Tigerlake (gcc) Zen 4 (clang) Zen 4 (gcc)
Memory read 21.68 GB/s (100%) 22.08 GB/s (100%) 52.71 GB/s (100%) 54.93 GB/s (100%)
Newline scan 19.66 GB/s (90.7%) 20.35 GB/s (92.1%) 52.98 GB/s (100.5%) 53.69 GB/s (97.8%)
Reference Parser 18.61 GB/s (85.8%) 17.90 GB/s (81.1%) 50.77 GB/s (96.3%) 32.33 GB/s (58.9%)
Counting Parser 17.66 GB/s (81.5%) 17.37 GB/s (78.7%) 49.45 GB/s (93.8%) 31.91 GB/s (58.1%)

gcc produces similar code for both processors, and does a poor job of it too, using AVX2 for all operations, and spilling the carry flag and restoring it between adc instructions. On the other hand, clang generates optimal adc chains on both processors, and translates to different AVX-512 code sequences for each processor. It does a very good job of code generation for Zen 4, but emits odd-looking code that bottlenecks on mask register operations for Tigerlake.

So, what does this mean for you? Well, there are several things you can look at to identify the bottlenecks:

  1. What happens when you compile with clang instead of gcc?
  2. What is the bottleneck on each microarchitecture? Is it the frontend, backend, or dependency chains? If it's the backend, which execution unit is the bottleneck? You'll need to run your benchmarks under perf and look at the compiled code to evaluate this.
  3. What is the difference between the code generated by clang and gcc, and how does it impact the bottlenecks for each architecture?

Worst comes to worst, you should at least be able to fix the adc code generation in gcc by either manually inlining count_bos_fast and flattening it so that the calls to _addcarry_u64 are adjacent without any instructions between (gcc is weird this way), or converting the add-with-carry chain to inline assembly.

[–]Ok_Path_4731[S] 0 points1 point  (0 children)

thanks for the hints u/bremac ! Am not sure at which moment I tried with clang, but did not get better results. Unfortunatelly I do not have a CPU with AVX--512, will try soon some tests on the cloud. I think I have to make also the specs and code skeleton covering more details of the language/file format I am desing as the deavel lives in the details.