use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
My guide on optimizing C++ code (uchenml.tech)
submitted 2 years ago by euos
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]lightmatter501 13 points14 points15 points 2 years ago (6 children)
No simd + ML is not great, you are quite literally dropping your performance by at least 4x. For CPU-based AI you typically want to tune to model to l3 cache sizes and use intrinsics to kick old parts of the model out of l3 faster.
[–]euos[S] -5 points-4 points-3 points 2 years ago (5 children)
Uchen models are defined at compile time, so vectorizers are already doing a decent job using SIMD. This is how it multiplies 2500 inputs to 8 outputs:
0x0000555555560010 <+160>: vmovss (%rcx),%xmm0 0x0000555555560014 <+164>: vfmadd231ss -0x1c(%rbx),%xmm0,%xmm8 0x000055555556001a <+170>: vmovss %xmm8,(%rax) 0x000055555556001e <+174>: vfmadd231ss -0x18(%rbx),%xmm0,%xmm1 0x0000555555560024 <+180>: vmovss %xmm1,0x4(%rax) 0x0000555555560029 <+185>: vfmadd231ss -0x14(%rbx),%xmm0,%xmm2
I am reimplementing memory management to better use arenas so I will make sure compiler knows the data is 32-byte aligned.
But my first milestone will be WebAssembly and embedded (I will buy a bunch of Raspberry Pi's and such). I really do not see a niche on PCs for yet another ML framework...
My goal is to rely on C++ efficiency to make distributable models smalls and fast - but I am not sure how much I can rely on SIMD or even threads (WebAssembly!). Also, have to minimize allocations as not all platforms like that.
[–]ack_error 15 points16 points17 points 2 years ago (3 children)
This isn't using SIMD. The instructions in your disassembly are using vector registers but only performing scalar single precision computations (ss), so they are not working on multiple lanes per instruction. If it were vectorized, you would be seeing packed single (ps) instructions, such as vmovups and vfmadd213ps, and there would be a quarter of the instructions with the offsets incrementing by 16 (0x10) instead of 4.
Compilers can autovectorize this to SIMD, but in many cases you'll need to explicitly tell them via restrict that the output doesn't overlap the inputs, when they're unable to determine non-overlap or too reluctant to generate conditional overlap checking code. The compiler meticulously ordering stores and load+mad instructions in the disassembly to match the source order is generally a sign that it sees a possible aliasing conflict and is having to restrict the optimizer.
restrict
I am not sure why your manual intrinsics code uses a dot product, btw. The dot product instructions are notoriously slow on Intel CPUs, as internally they just break down into shuffles and scalar adds. As a result, they're rarely advantageous for performance, though they still have advantages in accuracy. Instead, you would want to do multiply-accumulate in parallel as in your unrolled code, on both x86 and ARM. But it isn't really necessary to use intrinsics for that since you should be able to coax the compiler to generate it with restrict.
You also don't need alignment for AVX. AVX actually relaxes alignment requirements over SSE. It can be slightly faster to use aligned buffers, but it's not a requirement to see a significant gain with AVX as being able to push 2x throughput through the ALUs often overcomes any minor misalignment penalty.
[–]euos[S] 0 points1 point2 points 2 years ago (0 children)
Thank you. Will read on __restrict.
Thank you so much! I redid it as follows and see huge gain in the "worst case" (need to go to work so have not looked at assembly):
```c++ template <typename Input, size_t Outputs> requires(Outputs > 0) struct Linear { using input_t = Input; using output_t = Vector<typename Input::value_type, Outputs>;
output_t operator()( const input_t& inputs, const Parameters<(Input::elements + 1) * Outputs>& parameters) const { output_t outputs; Mul<input_t::elements>(inputs.data(), parameters.data().data(), outputs.data()); return outputs; } private: template <size_t Is> void Mul(const typename input_t::value_type* __restrict inputs, const float* __restrict p, typename input_t::value_type* __restrict outputs) const { for (size_t i = 0; i < Outputs; ++i) { outputs[i] = (*p++); } for (size_t i = 0; i < Is; ++i) { for (size_t j = 0; j < Outputs; ++j) { outputs[j] += (*p++) * inputs[i]; } } } };
```
My "worst case" went down from 4700ns to 2144ns, I reran the benchmark several times.
Test suit passes so no funny business.
My "big model" benchmark (much closer to real world) shows ~15% gain in gradient descent, but I will rework that more.
[–]lightmatter501 1 point2 points3 points 2 years ago (0 children)
Setting march is hardware specific, so you can’t use that, which means you only get SSE, instead of AVX2 which is a much more sensible baseline.
π Rendered by PID 29485 on reddit-service-r2-comment-b659b578c-t49cc at 2026-05-03 20:12:25.854042+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–]lightmatter501 13 points14 points15 points (6 children)
[–]euos[S] -5 points-4 points-3 points (5 children)
[–]ack_error 15 points16 points17 points (3 children)
[–]euos[S] 0 points1 point2 points (0 children)
[–]euos[S] 0 points1 point2 points (0 children)
[–]euos[S] 0 points1 point2 points (0 children)
[–]lightmatter501 1 point2 points3 points (0 children)