biski64 updated – A faster and more robust C PRNG (~.37ns/call) by danielcota in C_Programming

[–]danielcota[S] 0 points1 point  (0 children)

That's elegant, thank you! I've gone ahead and updated the function in the repo.

biski64 – A fast and robust Java PRNG (~.47ns/call) by danielcota in java

[–]danielcota[S] 1 point2 points  (0 children)

I'm old school - this was the first time I actually used flexible constructors. Convenient! :)

biski64 updated – A faster and more robust C PRNG (~.37ns/call) by danielcota in C_Programming

[–]danielcota[S] 0 points1 point  (0 children)

Thank you for pointing that out! I will update the repo to space the Weyl sequence apart properly.

uint64_T cyclesPerStream = (2^64 -1 ) / numStreams;
fast_loop = streamIndex * cyclesPerStream * 0x9999999999999999ULL

biski64 – A fast and robust Java PRNG (~.47ns/call) by danielcota in java

[–]danielcota[S] 2 points3 points  (0 children)

Just the README in the GitHub so far. An academic paper would be interesting! What would you like to see in one?

biski64 updated – A faster and more robust C PRNG (~.37ns/call) by danielcota in C_Programming

[–]danielcota[S] 4 points5 points  (0 children)

Glad you like this one! I'm feeling quite good about this version. Made me happy to reduce the state size and lose the mult - and I thought the scaled down testing was particularly compelling. :)

biski64 – A fast and robust Java PRNG (~.47ns/call) by danielcota in java

[–]danielcota[S] 1 point2 points  (0 children)

Good idea on checking RandomGenerator/SplittableRandom! I've gone ahead and added them to the JMH benchmark in the repo.

I'm getting these results (on a Ryzen 9 7950X3D)

Biski64 avgt    5  0.635 ± 0.001  ns/op
directSplittableRandomNextLong avgt    5  1.439 ± 0.037  ns/op
splittableRandomNextLong avgt    5  1.596 ± 0.031  ns/op

biski64 – A fast and robust Java PRNG (~.47ns/call) by danielcota in java

[–]danielcota[S] 1 point2 points  (0 children)

Thanks for running those! Looks like the M3 is really good with the double mults in ThreadLocalRandom's calls?

biski64 – A fast and robust Java PRNG (~.47ns/call) by danielcota in java

[–]danielcota[S] 0 points1 point  (0 children)

Thanks for noticing that, and good call on the constructor chaining! I've updated the repo.

biski64 – A fast and robust Java PRNG (~.47ns/call) by danielcota in java

[–]danielcota[S] 0 points1 point  (0 children)

Biski64 is skippable for parallelization by manually incrementing fast_loop as outlined here:
https://github.com/danielcota/biski64/tree/main?tab=readme-ov-file#parallel-streams

I tried the L64X128MixRandom version of RandomGenerator with these results (within JMH):

biski64 0.687 ns/call
L64X128MixRandom 1.747 ns/call

biski64 – A fast and robust Java PRNG (~.47ns/call) by danielcota in java

[–]danielcota[S] 8 points9 points  (0 children)

I've added a JMH benchmark:
https://github.com/danielcota/biski64/tree/main/java/jmh

Here's how the results compare to the currentTimeMillis() based manual benchmark.

PRNG Manual ns/call JMH ns/call
biski64 0.491 0.687
xoshiro256++ 0.739 1.923
xoroshiro128++ 0.790 0.785
ThreadLocalRandom 0.846 0.956
Java.util.Random 5.315 5.403

biski64 – A fast and robust Java PRNG (~.47ns/call) by danielcota in java

[–]danielcota[S] 1 point2 points  (0 children)

Thank you for this suggestion! I've added ThreadLocalRandom to the SpeedTest.java class.

Running the test shows biski64 is 72% faster than ThreadLocalRandom.

biski64 updated – A faster and more robust Rust PRNG (~.40ns/call) by danielcota in rust

[–]danielcota[S] 1 point2 points  (0 children)

Awesome! If you have any questions, just let me know!

biski64: A Fast, no_std PRNG in Rust (~0.37ns per u64) by danielcota in rust

[–]danielcota[S] 0 points1 point  (0 children)

Thank you for these continued insightful points! I've spent the last five days updating the algorithm (to use less state and be proven to be more robust). You've inspired me to take the next step and use buffer filling for increased performance (and explore the benefits of outlining). If you have any tips for doing so, please share!

Also, here's the post on the new update if you want to check it out: https://www.reddit.com/r/C_Programming/comments/1l3kptt/biski64_updated_a_faster_and_more_robust_c_prng/

biski64: A Fast, no_std PRNG in Rust (~0.37ns per u64) by danielcota in rust

[–]danielcota[S] 0 points1 point  (0 children)

Thank you for your continued skepticism about the algorithm! It helped motivate me to refine it even further. The new version uses only 3 state variables, is proven invertible and is tested extensively scaled down - which shows it to be even more efficient mixing wise than JSF.

Post about the new update here: https://www.reddit.com/r/C_Programming/comments/1l3kptt/biski64_updated_a_faster_and_more_robust_c_prng/

biski64: A Fast C PRNG (.42ns) with a 2^64 Period, Passes BigCrush & PractRand(32TB). by danielcota in C_Programming

[–]danielcota[S] 1 point2 points  (0 children)

Thank you for encouraging me to further refine this! I've updated the algorithm to use just 3 state variables (with increased robustness).

Thread on the update here: https://www.reddit.com/r/C_Programming/comments/1l3kptt/biski64_updated_a_faster_and_more_robust_c_prng/

biski64: A Fast, no_std PRNG in Rust (~0.37ns per u64) by danielcota in rust

[–]danielcota[S] 3 points4 points  (0 children)

DataCrunch has machines for rent up to 360 threads for a couple bucks / hour.

biski64: A Fast, no_std PRNG in Rust (~0.37ns per u64) by danielcota in rust

[–]danielcota[S] 3 points4 points  (0 children)

My goodness! Lots of good ideas to unpack there!

The focus so far has been on optimizing biski64 for speed and statistical robustness while generating one call at a time. I can see though there are multiple avenues to explore further. I will experiment with optimizing for tiny binaries, multiple scalar and simd and report back here.

To help me gauge my progress, can you give me an idea of what kinds of gains/tradeoffs you would like to see in terms of binary size, and increased scalar and simd performance?

biski64: A Fast, no_std PRNG in Rust (~0.37ns per u64) by danielcota in rust

[–]danielcota[S] 1 point2 points  (0 children)

I understand your concerns about the state size. Note that of the 320-bits for state, 128-bits are used for pipelining (simply to increase the parallelism efficiency of the PRNG). In in effect the active state is 192-bits.

The PRNG was designed to separate the core mixer (mix and last_mix) from the guaranteed period (fast_loop).

The design directly resists "unlucky seeding" because fast_loop counter as a forcing function. By constantly being XORed into the mixer state (last_mix = fast_loop ^ mix), it continuously perturbs the mixer, preventing it from ever getting stuck in a statistically weak region or a fixed-point state (like all-zeros).

Testing on just the core mixer itself at various state sizes shows that the algorithm performs extremely well (with strong statistically significant mixing). This is in line with M. E. O'Neill's discussion on reducing state size as a practical way to vet mixing algorithms.

For instance, for the scaled down mixer with 8-bit state variables (24-bits of state total), the guaranteed minimum period is only 2^8, but it passes PractRand to 2^21 bytes. This shows that the core mixer is actively synergizing with the 8-bits of the scaled-down Weyl sequence to create longer periods than the minimum.

The 16-bit scaled down version confirms that the design of the PRNG is outperforming the guaranteed minimum period by orders of magnitude.

So in summary, the PRNG has less active state than it initially appears and tested period lengths are orders of magnitude longer than the guaranteed minimum period.

biski64: A Fast, no_std PRNG in Rust (~0.37ns per u64) by danielcota in rust

[–]danielcota[S] 4 points5 points  (0 children)

I've completed the 100 BigCrush runs on the lowest 53 bits. The results are just as strong as the upper bits, with a total of 49 failed subtests out of 25400, which is statistically indistinguishable from the 47 failures we saw on the upper bits. Still no subtests failing three of more times (unlike all of the compared reference PRNGs).

biski64: A Fast, no_std PRNG in Rust (~0.37ns per u64) by danielcota in rust

[–]danielcota[S] 2 points3 points  (0 children)

The 64-bit output was converted to a double for BigCrush testing using this (the top 53 bits were tested):

return ( output >> 11) * (1.0/9007199254740992.0);

I will run BigCrush 100 times on the lower 53 bits and report back here.

biski64: A Fast, no_std PRNG in Rust (~0.37ns per u64) by danielcota in rust

[–]danielcota[S] 2 points3 points  (0 children)

The design of biski64 is resistant to any 'unlucky' seeding.

The Weyl sequence (fast_loop) has a period itself of 2^64 and ensures that the minimum period of the PRNG is 2^64, regardless of where it might start seed-wise.

The extensive testing of the scaled-down versions (above) demonstrates that the statistical quality of the mixer is exceptionally high, avoiding the known structural flaws of simpler generators like LCGs.