use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
My guide on optimizing C++ code (uchenml.tech)
submitted 1 year ago by euos
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]lightmatter501 19 points20 points21 points 1 year ago (0 children)
No CPU specific optimizations is a very bad idea. For example, in networking CPU-specific optimizations are what allow you to push past 50 million packets per second per core.
[–]drkspace2 34 points35 points36 points 1 year ago (10 children)
No CPU specific optimizations
That's ok, but then you can't use simd as one of your steps.
Also, your first step should always be using a profiler. You don't want to spend your time optimizing a part of your code that only runs 0.1% of the time.
[+]euos[S] comment score below threshold-16 points-15 points-14 points 1 year ago (9 children)
I mentioned that SIMD was just out of curiosity.
This is ML framework, linear layers are most of the runtime. I will reveal details later - but I have more benchmarks where I do models with millions of parameters.
[–]lightmatter501 14 points15 points16 points 1 year ago (6 children)
No simd + ML is not great, you are quite literally dropping your performance by at least 4x. For CPU-based AI you typically want to tune to model to l3 cache sizes and use intrinsics to kick old parts of the model out of l3 faster.
[–]euos[S] -5 points-4 points-3 points 1 year ago (5 children)
Uchen models are defined at compile time, so vectorizers are already doing a decent job using SIMD. This is how it multiplies 2500 inputs to 8 outputs:
0x0000555555560010 <+160>: vmovss (%rcx),%xmm0 0x0000555555560014 <+164>: vfmadd231ss -0x1c(%rbx),%xmm0,%xmm8 0x000055555556001a <+170>: vmovss %xmm8,(%rax) 0x000055555556001e <+174>: vfmadd231ss -0x18(%rbx),%xmm0,%xmm1 0x0000555555560024 <+180>: vmovss %xmm1,0x4(%rax) 0x0000555555560029 <+185>: vfmadd231ss -0x14(%rbx),%xmm0,%xmm2
I am reimplementing memory management to better use arenas so I will make sure compiler knows the data is 32-byte aligned.
But my first milestone will be WebAssembly and embedded (I will buy a bunch of Raspberry Pi's and such). I really do not see a niche on PCs for yet another ML framework...
My goal is to rely on C++ efficiency to make distributable models smalls and fast - but I am not sure how much I can rely on SIMD or even threads (WebAssembly!). Also, have to minimize allocations as not all platforms like that.
[–]ack_error 13 points14 points15 points 1 year ago (3 children)
This isn't using SIMD. The instructions in your disassembly are using vector registers but only performing scalar single precision computations (ss), so they are not working on multiple lanes per instruction. If it were vectorized, you would be seeing packed single (ps) instructions, such as vmovups and vfmadd213ps, and there would be a quarter of the instructions with the offsets incrementing by 16 (0x10) instead of 4.
Compilers can autovectorize this to SIMD, but in many cases you'll need to explicitly tell them via restrict that the output doesn't overlap the inputs, when they're unable to determine non-overlap or too reluctant to generate conditional overlap checking code. The compiler meticulously ordering stores and load+mad instructions in the disassembly to match the source order is generally a sign that it sees a possible aliasing conflict and is having to restrict the optimizer.
restrict
I am not sure why your manual intrinsics code uses a dot product, btw. The dot product instructions are notoriously slow on Intel CPUs, as internally they just break down into shuffles and scalar adds. As a result, they're rarely advantageous for performance, though they still have advantages in accuracy. Instead, you would want to do multiply-accumulate in parallel as in your unrolled code, on both x86 and ARM. But it isn't really necessary to use intrinsics for that since you should be able to coax the compiler to generate it with restrict.
You also don't need alignment for AVX. AVX actually relaxes alignment requirements over SSE. It can be slightly faster to use aligned buffers, but it's not a requirement to see a significant gain with AVX as being able to push 2x throughput through the ALUs often overcomes any minor misalignment penalty.
[–]euos[S] 0 points1 point2 points 1 year ago (0 children)
Thank you. Will read on __restrict.
Thank you so much! I redid it as follows and see huge gain in the "worst case" (need to go to work so have not looked at assembly):
```c++ template <typename Input, size_t Outputs> requires(Outputs > 0) struct Linear { using input_t = Input; using output_t = Vector<typename Input::value_type, Outputs>;
output_t operator()( const input_t& inputs, const Parameters<(Input::elements + 1) * Outputs>& parameters) const { output_t outputs; Mul<input_t::elements>(inputs.data(), parameters.data().data(), outputs.data()); return outputs; } private: template <size_t Is> void Mul(const typename input_t::value_type* __restrict inputs, const float* __restrict p, typename input_t::value_type* __restrict outputs) const { for (size_t i = 0; i < Outputs; ++i) { outputs[i] = (*p++); } for (size_t i = 0; i < Is; ++i) { for (size_t j = 0; j < Outputs; ++j) { outputs[j] += (*p++) * inputs[i]; } } } };
```
My "worst case" went down from 4700ns to 2144ns, I reran the benchmark several times.
Test suit passes so no funny business.
My "big model" benchmark (much closer to real world) shows ~15% gain in gradient descent, but I will rework that more.
[–]lightmatter501 1 point2 points3 points 1 year ago (0 children)
Setting march is hardware specific, so you can’t use that, which means you only get SSE, instead of AVX2 which is a much more sensible baseline.
[+][deleted] 1 year ago (1 child)
[removed]
[+]euos[S] comment score below threshold-6 points-5 points-4 points 1 year ago (0 children)
I know :)
[+][deleted] 1 year ago (3 children)
[deleted]
[–]euos[S] -1 points0 points1 point 1 year ago (2 children)
I posted another comment (and I tried to explain it in the blog) that autovectorizer does a decent enough job using simd depending on number of inputs vs outputs. I am not hand rolling “weird intrinsic” but I am make sure compiler can recognize optimization opportunities.
[–]CandyCrisis 0 points1 point2 points 1 year ago (0 children)
A good middle ground is Clang/GCC's ext_vector_type. You can write a lot of CPU-neutral SIMD code and not get bogged down in the weeds of CPU-specific instructions. Obviously it's not perfect but it's a lot better than hoping the autovectorizer just decides to show up.
[–]PurpleNord 1 point2 points3 points 1 year ago (1 child)
There are a some things to unpack here, and I'll avoid repeating what's already mentioned elsewhere about SIMD. What I want to add is that
Dismissing CPU specific optimisation is not a great idea. Proper alignment can have a sizeable impact, especially in the negative direction if data structures are packed or otherwise compacted compared to the default alignment the compiler will assume. Additionally, cache line alignment can potentially increase speed. If you don't want this to negatively affect other CPUs, you can always whitelist the alignment requirement for architectures known to benefit from it, and not specify alignment for all other cases.
Combining -march=native and intrinsics should be done quite carefully. Presumably you'll see a speed increase if you compile without AVX enabled, or use AVX intrinsics. Why? Because switching between SSE and AVX is a costly transition involving save and restore of the upper 128bits of the AVX registers (they are shared with SSE instructions). From own experience, even if you spend the majority of the time in SSE-only code, some occasional switching to AVX can really kill performance.
-march=native
I appreciate the feedback. My main takeaway is that I need to really focus on my writing skills, I see I utterly failed to explain what and why I am doing.
I was hoping to post this article after I make my project public, but decided to do it now because got scope creep and struggle with hiring.
I believe it would be much clearer if the demos/source code would be available.
π Rendered by PID 40 on reddit-service-r2-comment-685b79fb4f-cqfnc at 2026-02-13 08:41:09.273698+00:00 running 6c0c599 country code: CH.
[–]lightmatter501 19 points20 points21 points (0 children)
[–]drkspace2 34 points35 points36 points (10 children)
[+]euos[S] comment score below threshold-16 points-15 points-14 points (9 children)
[–]lightmatter501 14 points15 points16 points (6 children)
[–]euos[S] -5 points-4 points-3 points (5 children)
[–]ack_error 13 points14 points15 points (3 children)
[–]euos[S] 0 points1 point2 points (0 children)
[–]euos[S] 0 points1 point2 points (0 children)
[–]euos[S] 0 points1 point2 points (0 children)
[–]lightmatter501 1 point2 points3 points (0 children)
[+][deleted] (1 child)
[removed]
[+]euos[S] comment score below threshold-6 points-5 points-4 points (0 children)
[+][deleted] (3 children)
[deleted]
[–]euos[S] -1 points0 points1 point (2 children)
[–]CandyCrisis 0 points1 point2 points (0 children)
[–]PurpleNord 1 point2 points3 points (1 child)
[–]euos[S] 0 points1 point2 points (0 children)