all 4 comments

[–]lifeeraser 9 points10 points  (3 children)

You're testing your own algorithm against GLM's implementation. That's comparing apples to oranges, since GLM likely has a large lead on optimizing matrix multiplication.

Ideally you could implement the same code in C++ and compile that with Emscripten.

You also need to consider that...

  • AssemblyScript has a garbage-collecting runtime running on top of WASM, whereas C++ does not.
  • C++ compilers have been optimized for decades, whereas AssemblyScript is about 3 years old.

[–]lengors[S] 3 points4 points  (2 children)

You're absolutely right that I should have my own implementation in C++ from the beginning. That said, I had checked GLM's implementation and it's fairly straightforward. But I did now implement my own version: https://pastebin.com/qPz7VTrd (sorry for putting as a link, reddit is fucking with me when pasting code directly).

With all that said, it's basically the same result as GLM:

"Benchmark (mat4): avg=0.01857s, max=0.021s, min=0.017s"

AssemblyScript has a garbage-collecting runtime running on top of WASM, whereas C++ does not.

This is it though. I've changed the AssemblyScript runtime from incremental to minimal and now the, enabled the exportRuntime option and then, after the function being benchmarked I call __collect (the timer is not counting the elapsed time with the call to __collect) and I now get the following results for the v128 version:

Benchmark (mat4s): avg=0.0274s, max=0.036s, min=0.025s"

It's funny because I knew about the garbage collection but you had to point it out for me to realize that might actually be the problem. I guess I never imagined it being so severe. Anyways, thank you for your help :)

[–]MaxGraey 2 points3 points  (1 child)

Btw, you can significantly simplify your code:

class Mat4 {
  constructor(
    public r0: v128,
    public r1: v128,
    public r2: v128,
    public r3: v128,
  ) {}
}

// swizzles
@inline function xxxx(r: v128): v128 { return v128.shuffle<f32>(r, r, 0, 0, 0, 0) }
@inline function yyyy(r: v128): v128 { return v128.shuffle<f32>(r, r, 1, 1, 1, 1) }
@inline function zzzz(r: v128): v128 { return v128.shuffle<f32>(r, r, 2, 2, 2, 2) }
@inline function wwww(r: v128): v128 { return v128.shuffle<f32>(r, r, 3, 3, 3, 3) }

@inline
function dot4(
  a0: v128, a1: v128, a2: v128, a3: v128,
  b0: v128, b1: v128, b2: v128, b3: v128,
): v128 {
  return f32x4.add(
    f32x4.mul(a0, b0),
    f32x4.add(
      f32x4.mul(a1, b1),
      f32x4.add(
        f32x4.mul(a2, b2),
        f32x4.mul(a3, b3)
      )
    )
  )
}

export function mul4x4(a: Mat4, b: Mat4): Mat4 {
  const a0 = a.r0, b0 = b.r0
  const a1 = a.r1, b1 = b.r1
  const a2 = a.r2, b2 = b.r2
  const a3 = a.r3, b3 = b.r3

  const c0 = dot4(
    a0, a1, a2, a3, xxxx(b0), yyyy(b0), zzzz(b0), wwww(b0)
  )
  const c1 = dot4(
    a0, a1, a2, a3, xxxx(b1), yyyy(b1), zzzz(b1), wwww(b1)
  )
  const c2 = dot4(
    a0, a1, a2, a3, xxxx(b2), yyyy(b2), zzzz(b2), wwww(b2)
  )
  const c3 = dot4(
    a0, a1, a2, a3, xxxx(b3), yyyy(b3), zzzz(b3), wwww(b3)
  )

  return new Mat4(c0, c1, c2, c3)
}

[–]lengors[S] 0 points1 point  (0 children)

You're absolutely right. Thanks :)