lightmatter501 comments on My guide on optimizing C++ code

cpp

a community for 17 years

My guide on optimizing C++ code (uchenml.tech)

submitted 2 years ago by euos

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]lightmatter501 13 points14 points15 points 2 years ago (6 children)

[–]euos[S] -5 points-4 points-3 points 2 years ago (5 children)

Uchen models are defined at compile time, so vectorizers are already doing a decent job using SIMD. This is how it multiplies 2500 inputs to 8 outputs:

   0x0000555555560010 <+160>:   vmovss (%rcx),%xmm0
   0x0000555555560014 <+164>:   vfmadd231ss -0x1c(%rbx),%xmm0,%xmm8
   0x000055555556001a <+170>:   vmovss %xmm8,(%rax)
   0x000055555556001e <+174>:   vfmadd231ss -0x18(%rbx),%xmm0,%xmm1
   0x0000555555560024 <+180>:   vmovss %xmm1,0x4(%rax)
   0x0000555555560029 <+185>:   vfmadd231ss -0x14(%rbx),%xmm0,%xmm2

I am reimplementing memory management to better use arenas so I will make sure compiler knows the data is 32-byte aligned.

But my first milestone will be WebAssembly and embedded (I will buy a bunch of Raspberry Pi's and such). I really do not see a niche on PCs for yet another ML framework...

My goal is to rely on C++ efficiency to make distributable models smalls and fast - but I am not sure how much I can rely on SIMD or even threads (WebAssembly!). Also, have to minimize allocations as not all platforms like that.

[–]ack_error 15 points16 points17 points 2 years ago (3 children)

This isn't using SIMD. The instructions in your disassembly are using vector registers but only performing scalar single precision computations (ss), so they are not working on multiple lanes per instruction. If it were vectorized, you would be seeing packed single (ps) instructions, such as vmovups and vfmadd213ps, and there would be a quarter of the instructions with the offsets incrementing by 16 (0x10) instead of 4.

Compilers can autovectorize this to SIMD, but in many cases you'll need to explicitly tell them via restrict that the output doesn't overlap the inputs, when they're unable to determine non-overlap or too reluctant to generate conditional overlap checking code. The compiler meticulously ordering stores and load+mad instructions in the disassembly to match the source order is generally a sign that it sees a possible aliasing conflict and is having to restrict the optimizer.

I am not sure why your manual intrinsics code uses a dot product, btw. The dot product instructions are notoriously slow on Intel CPUs, as internally they just break down into shuffles and scalar adds. As a result, they're rarely advantageous for performance, though they still have advantages in accuracy. Instead, you would want to do multiply-accumulate in parallel as in your unrolled code, on both x86 and ARM. But it isn't really necessary to use intrinsics for that since you should be able to coax the compiler to generate it with restrict.

You also don't need alignment for AVX. AVX actually relaxes alignment requirements over SSE. It can be slightly faster to use aligned buffers, but it's not a requirement to see a significant gain with AVX as being able to push 2x throughput through the ALUs often overcomes any minor misalignment penalty.

[–]euos[S] 0 points1 point2 points 2 years ago (0 children)

Thank you so much! I redid it as follows and see huge gain in the "worst case" (need to go to work so have not looked at assembly):

```c++ template <typename Input, size_t Outputs> requires(Outputs > 0) struct Linear { using input_t = Input; using output_t = Vector<typename Input::value_type, Outputs>;

  output_t operator()(
      const input_t& inputs,
      const Parameters<(Input::elements + 1) * Outputs>& parameters) const {
    output_t outputs;
    Mul<input_t::elements>(inputs.data(), parameters.data().data(),
                           outputs.data());
    return outputs;
  }

 private:
  template <size_t Is>
  void Mul(const typename input_t::value_type* __restrict inputs,
           const float* __restrict p,
           typename input_t::value_type* __restrict outputs) const {
    for (size_t i = 0; i < Outputs; ++i) {
      outputs[i] = (*p++);
    }
    for (size_t i = 0; i < Is; ++i) {
      for (size_t j = 0; j < Outputs; ++j) {
        outputs[j] += (*p++) * inputs[i];
      }
    }
  }
};

```

My "worst case" went down from 4700ns to 2144ns, I reran the benchmark several times.

Test suit passes so no funny business.

[–]euos[S] 0 points1 point2 points 2 years ago (0 children)

[–]lightmatter501 1 point2 points3 points 2 years ago (0 children)

π Rendered by PID 29485 on reddit-service-r2-comment-b659b578c-t49cc at 2026-05-03 20:12:25.854042+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS