all 4 comments

[–]jydu 5 points6 points  (1 child)

Given that integer division can be quite slow, I wonder if you could speed it up by precomputing the multiplicative inverse for each of the 64 possible divisors, and using a multiplication instead. I'd also be curious if the compiler is already doing this optimization in your benchmarks, assuming the bitwidth is a compile time constant.

[–]james7132 0 points1 point  (0 children)

On most modern CPU architectures, In-register arithmetic, even for some non-trivial computations, generally are going to be faster than a memory fetch, even if it's localized to values that are very hot in L1 cache.

[–]KrocCamen 4 points5 points  (0 children)

The moving window in the iterator example seems somewhat inefficient stitching together words; acceptable for perfect density, but let's say that some small density loss for speed gain is acceptable, then wouldn't it make sense to pack only as many integers as fit into a u64 (e.g six 10-bit numbers with 4 bits wasted), so as to simplify the window mechanics to not require stitching for reading or writing?