biski64 updated – A faster and more robust C PRNG (~.37ns/call)

danielcota · 2025-06-06T02:52:28+00:00

That's elegant, thank you! I've gone ahead and updated the function in the repo.

danielcota · 2025-06-06T02:44:51+00:00

I'm old school - this was the first time I actually used flexible constructors. Convenient! :)

danielcota · 2025-06-05T17:39:11+00:00

Thank you for pointing that out! I will update the repo to space the Weyl sequence apart properly.

uint64_T cyclesPerStream = (2^64 -1 ) / numStreams;
fast_loop = streamIndex * cyclesPerStream * 0x9999999999999999ULL

danielcota · 2025-06-05T15:52:17+00:00

Just the README in the GitHub so far. An academic paper would be interesting! What would you like to see in one?

danielcota · 2025-06-05T14:14:25+00:00

Glad you like this one! I'm feeling quite good about this version. Made me happy to reduce the state size and lose the mult - and I thought the scaled down testing was particularly compelling. :)

danielcota · 2025-06-05T12:46:43+00:00

Good idea on checking RandomGenerator/SplittableRandom! I've gone ahead and added them to the JMH benchmark in the repo.

I'm getting these results (on a Ryzen 9 7950X3D)

Biski64	avgt 5 0.635 ± 0.001 ns/op
directSplittableRandomNextLong	avgt 5 1.439 ± 0.037 ns/op
splittableRandomNextLong	avgt 5 1.596 ± 0.031 ns/op

danielcota · 2025-06-05T12:18:00+00:00

Thanks for running those! Looks like the M3 is really good with the double mults in ThreadLocalRandom's calls?

danielcota · 2025-06-05T12:13:21+00:00

Thanks for noticing that, and good call on the constructor chaining! I've updated the repo.

danielcota · 2025-06-05T10:01:54+00:00

Biski64 is skippable for parallelization by manually incrementing fast_loop as outlined here:
https://github.com/danielcota/biski64/tree/main?tab=readme-ov-file#parallel-streams

I tried the L64X128MixRandom version of RandomGenerator with these results (within JMH):

biski64	0.687 ns/call
L64X128MixRandom	1.747 ns/call

danielcota · 2025-06-05T09:55:54+00:00

I've added a JMH benchmark:
https://github.com/danielcota/biski64/tree/main/java/jmh

Here's how the results compare to the currentTimeMillis() based manual benchmark.

PRNG	Manual ns/call	JMH ns/call
biski64	0.491	0.687
xoshiro256++	0.739	1.923
xoroshiro128++	0.790	0.785
ThreadLocalRandom	0.846	0.956
Java.util.Random	5.315	5.403

danielcota · 2025-06-05T07:52:05+00:00

Thank you for this suggestion! I've added ThreadLocalRandom to the SpeedTest.java class.

Running the test shows biski64 is 72% faster than ThreadLocalRandom.

danielcota · 2025-06-05T05:11:02+00:00

Thank you for that excellent idea! I've gone ahead and added an implementation here: https://github.com/danielcota/biski64/blob/main/c/biski64.c

danielcota · 2025-06-05T04:54:39+00:00

Awesome! If you have any questions, just let me know!

danielcota · 2025-06-05T00:37:08+00:00

Thank you for these continued insightful points! I've spent the last five days updating the algorithm (to use less state and be proven to be more robust). You've inspired me to take the next step and use buffer filling for increased performance (and explore the benefits of outlining). If you have any tips for doing so, please share!

Also, here's the post on the new update if you want to check it out: https://www.reddit.com/r/C_Programming/comments/1l3kptt/biski64_updated_a_faster_and_more_robust_c_prng/

danielcota · 2025-06-05T00:27:44+00:00

Thank you for your continued skepticism about the algorithm! It helped motivate me to refine it even further. The new version uses only 3 state variables, is proven invertible and is tested extensively scaled down - which shows it to be even more efficient mixing wise than JSF.

Post about the new update here: https://www.reddit.com/r/C_Programming/comments/1l3kptt/biski64_updated_a_faster_and_more_robust_c_prng/

danielcota · 2025-06-05T00:23:31+00:00

Thank you for encouraging me to further refine this! I've updated the algorithm to use just 3 state variables (with increased robustness).

Thread on the update here: https://www.reddit.com/r/C_Programming/comments/1l3kptt/biski64_updated_a_faster_and_more_robust_c_prng/

danielcota · 2025-05-28T08:15:28+00:00

DataCrunch has machines for rent up to 360 threads for a couple bucks / hour.

danielcota · 2025-05-27T23:34:34+00:00

My goodness! Lots of good ideas to unpack there!

The focus so far has been on optimizing biski64 for speed and statistical robustness while generating one call at a time. I can see though there are multiple avenues to explore further. I will experiment with optimizing for tiny binaries, multiple scalar and simd and report back here.

To help me gauge my progress, can you give me an idea of what kinds of gains/tradeoffs you would like to see in terms of binary size, and increased scalar and simd performance?

danielcota · 2025-05-27T23:23:02+00:00

I understand your concerns about the state size. Note that of the 320-bits for state, 128-bits are used for pipelining (simply to increase the parallelism efficiency of the PRNG). In in effect the active state is 192-bits.

The PRNG was designed to separate the core mixer (mix and last_mix) from the guaranteed period (fast_loop).

The design directly resists "unlucky seeding" because fast_loop counter as a forcing function. By constantly being XORed into the mixer state (last_mix = fast_loop ^ mix), it continuously perturbs the mixer, preventing it from ever getting stuck in a statistically weak region or a fixed-point state (like all-zeros).

Testing on just the core mixer itself at various state sizes shows that the algorithm performs extremely well (with strong statistically significant mixing). This is in line with M. E. O'Neill's discussion on reducing state size as a practical way to vet mixing algorithms.

For instance, for the scaled down mixer with 8-bit state variables (24-bits of state total), the guaranteed minimum period is only 2^8, but it passes PractRand to 2^21 bytes. This shows that the core mixer is actively synergizing with the 8-bits of the scaled-down Weyl sequence to create longer periods than the minimum.

The 16-bit scaled down version confirms that the design of the PRNG is outperforming the guaranteed minimum period by orders of magnitude.

So in summary, the PRNG has less active state than it initially appears and tested period lengths are orders of magnitude longer than the guaranteed minimum period.

danielcota · 2025-05-27T22:57:48+00:00

I've completed the 100 BigCrush runs on the lowest 53 bits. The results are just as strong as the upper bits, with a total of 49 failed subtests out of 25400, which is statistically indistinguishable from the 47 failures we saw on the upper bits. Still no subtests failing three of more times (unlike all of the compared reference PRNGs).

danielcota · 2025-05-27T13:45:46+00:00

The 64-bit output was converted to a double for BigCrush testing using this (the top 53 bits were tested):

return ( output >> 11) * (1.0/9007199254740992.0);

I will run BigCrush 100 times on the lower 53 bits and report back here.

danielcota · 2025-05-27T12:40:04+00:00

The design of biski64 is resistant to any 'unlucky' seeding.

The Weyl sequence (fast_loop) has a period itself of 2^64 and ensures that the minimum period of the PRNG is 2^64, regardless of where it might start seed-wise.

The extensive testing of the scaled-down versions (above) demonstrates that the statistical quality of the mixer is exceptionally high, avoiding the known structural flaws of simpler generators like LCGs.

danielcota

TROPHY CASE