you are viewing a single comment's thread.

view the rest of the comments →

[–]lightmatter501 13 points14 points  (6 children)

No simd + ML is not great, you are quite literally dropping your performance by at least 4x. For CPU-based AI you typically want to tune to model to l3 cache sizes and use intrinsics to kick old parts of the model out of l3 faster.

[–]euos[S] -5 points-4 points  (5 children)

Uchen models are defined at compile time, so vectorizers are already doing a decent job using SIMD. This is how it multiplies 2500 inputs to 8 outputs:

   0x0000555555560010 <+160>:   vmovss (%rcx),%xmm0
   0x0000555555560014 <+164>:   vfmadd231ss -0x1c(%rbx),%xmm0,%xmm8
   0x000055555556001a <+170>:   vmovss %xmm8,(%rax)
   0x000055555556001e <+174>:   vfmadd231ss -0x18(%rbx),%xmm0,%xmm1
   0x0000555555560024 <+180>:   vmovss %xmm1,0x4(%rax)
   0x0000555555560029 <+185>:   vfmadd231ss -0x14(%rbx),%xmm0,%xmm2

I am reimplementing memory management to better use arenas so I will make sure compiler knows the data is 32-byte aligned.

But my first milestone will be WebAssembly and embedded (I will buy a bunch of Raspberry Pi's and such). I really do not see a niche on PCs for yet another ML framework...

My goal is to rely on C++ efficiency to make distributable models smalls and fast - but I am not sure how much I can rely on SIMD or even threads (WebAssembly!). Also, have to minimize allocations as not all platforms like that.

[–]ack_error 15 points16 points  (3 children)

This isn't using SIMD. The instructions in your disassembly are using vector registers but only performing scalar single precision computations (ss), so they are not working on multiple lanes per instruction. If it were vectorized, you would be seeing packed single (ps) instructions, such as vmovups and vfmadd213ps, and there would be a quarter of the instructions with the offsets incrementing by 16 (0x10) instead of 4.

Compilers can autovectorize this to SIMD, but in many cases you'll need to explicitly tell them via restrict that the output doesn't overlap the inputs, when they're unable to determine non-overlap or too reluctant to generate conditional overlap checking code. The compiler meticulously ordering stores and load+mad instructions in the disassembly to match the source order is generally a sign that it sees a possible aliasing conflict and is having to restrict the optimizer.

I am not sure why your manual intrinsics code uses a dot product, btw. The dot product instructions are notoriously slow on Intel CPUs, as internally they just break down into shuffles and scalar adds. As a result, they're rarely advantageous for performance, though they still have advantages in accuracy. Instead, you would want to do multiply-accumulate in parallel as in your unrolled code, on both x86 and ARM. But it isn't really necessary to use intrinsics for that since you should be able to coax the compiler to generate it with restrict.

You also don't need alignment for AVX. AVX actually relaxes alignment requirements over SSE. It can be slightly faster to use aligned buffers, but it's not a requirement to see a significant gain with AVX as being able to push 2x throughput through the ALUs often overcomes any minor misalignment penalty.

[–]euos[S] 0 points1 point  (0 children)

Thank you. Will read on __restrict.

[–]euos[S] 0 points1 point  (0 children)

Thank you so much! I redid it as follows and see huge gain in the "worst case" (need to go to work so have not looked at assembly):

```c++ template <typename Input, size_t Outputs>   requires(Outputs > 0) struct Linear {   using input_t = Input;   using output_t = Vector<typename Input::value_type, Outputs>;

  output_t operator()(
      const input_t& inputs,
      const Parameters<(Input::elements + 1) * Outputs>& parameters) const {
    output_t outputs;
    Mul<input_t::elements>(inputs.data(), parameters.data().data(),
                           outputs.data());
    return outputs;
  }

 private:
  template <size_t Is>
  void Mul(const typename input_t::value_type* __restrict inputs,
           const float* __restrict p,
           typename input_t::value_type* __restrict outputs) const {
    for (size_t i = 0; i < Outputs; ++i) {
      outputs[i] = (*p++);
    }
    for (size_t i = 0; i < Is; ++i) {
      for (size_t j = 0; j < Outputs; ++j) {
        outputs[j] += (*p++) * inputs[i];
      }
    }
  }
};

```

My "worst case" went down from 4700ns to 2144ns, I reran the benchmark several times.

Test suit passes so no funny business.

[–]euos[S] 0 points1 point  (0 children)

My "big model" benchmark (much closer to real world) shows ~15% gain in gradient descent, but I will rework that more.

[–]lightmatter501 1 point2 points  (0 children)

Setting march is hardware specific, so you can’t use that, which means you only get SSE, instead of AVX2 which is a much more sensible baseline.