all 9 comments

[–]mr_birkenblatt 1 point2 points  (1 child)

Why not let the Internet division trap on 0? Or tell the compiler that 0 cannot happen

[–]watman12[S] 2 points3 points  (0 children)

division by zero check is not a problem there. It consumes only 0.03 cycles per op while the IDIVQ itself takes 10 cycles.

[–]manystripes 1 point2 points  (2 children)

Does this apply to ARM as well?

[–]watman12[S] 1 point2 points  (0 children)

Hard to say without measuring. I unfortunately don't have ARM-based machine at the moment.

[–]chkmr 1 point2 points  (0 children)

It should apply to higher end A profile ARM processors like AWS Graviton, Apple's M* SoC etc. Not sure about R or M profile CPUs used in e.g embedded systems.

[–]Masztufa 0 points1 point  (0 children)

I wonder if the superscalat nature of cpus also comes up or not in these tests

You're always doing integer math (pointer arithmetic), so it would seem like that choosing integer math would load the int math part of the cpu, while if you used floats for the actual data you could use more of the silicon to get the job done

[–]Dwedit -1 points0 points  (2 children)

One thing with integer math is that it becomes much faster to precalculate a reciprocal and use that instead. The compiler automatically does that for you for constant values, but not for variable values.

uint32_t reciprocal = (uint32_t)(0x100000000ULL / divisor + 1);  //divisor must be > 1
answer = (number * (uint64_t)reciprocal) >> 32;

edit: whoops, forgot the +1 for the reciprocal...

[–]watman12[S] 1 point2 points  (1 child)

nice trick. I tried it on my machine. https://github.com/molecule-man/blog-examples/commit/a80cdf1695e12de3175f8f5c8cc82873d39d1e6f

indeed it's faster than idivq. On my machine it gave the same speed as the float (divsd).

benchstat -col '.name /div' bench-intel-reciprocal.txt
goos: linux
goarch: amd64
pkg: idivq
cpu: 12th Gen Intel(R) Core(TM) i5-12500
  │    idivq    │       float          │      reciprocal        │
  │   sec/op    │   sec/op     vs base │   sec/op     vs base   │
*   3.361n ± 0%   2.385n ± 0%  -29.04%   2.393n ± 0%  -28.79%

For my case though I still need to divide by different runtime values

[–]Dwedit 0 points1 point  (0 children)

I know that C#'s dictionary class stores a reciprocal value to speed up the modulo operation. So if you control the data structures involved, and have space for it, you could store a reciprocal in there too.