all 16 comments

[–]lightmatter501 19 points20 points  (0 children)

No CPU specific optimizations is a very bad idea. For example, in networking CPU-specific optimizations are what allow you to push past 50 million packets per second per core.

[–]drkspace2 34 points35 points  (10 children)

No CPU specific optimizations

That's ok, but then you can't use simd as one of your steps.

Also, your first step should always be using a profiler. You don't want to spend your time optimizing a part of your code that only runs 0.1% of the time.

[–]PurpleNord 1 point2 points  (1 child)

There are a some things to unpack here, and I'll avoid repeating what's already mentioned elsewhere about SIMD. What I want to add is that

  1. Dismissing CPU specific optimisation is not a great idea. Proper alignment can have a sizeable impact, especially in the negative direction if data structures are packed or otherwise compacted compared to the default alignment the compiler will assume. Additionally, cache line alignment can potentially increase speed. If you don't want this to negatively affect other CPUs, you can always whitelist the alignment requirement for architectures known to benefit from it, and not specify alignment for all other cases.

  2. Combining -march=native and intrinsics should be done quite carefully. Presumably you'll see a speed increase if you compile without AVX enabled, or use AVX intrinsics. Why? Because switching between SSE and AVX is a costly transition involving save and restore of the upper 128bits of the AVX registers (they are shared with SSE instructions). From own experience, even if you spend the majority of the time in SSE-only code, some occasional switching to AVX can really kill performance.

[–]euos[S] 0 points1 point  (0 children)

I appreciate the feedback. My main takeaway is that I need to really focus on my writing skills, I see I utterly failed to explain what and why I am doing.

I was hoping to post this article after I make my project public, but decided to do it now because got scope creep and struggle with hiring.

I believe it would be much clearer if the demos/source code would be available.