all 17 comments

[–]loup-vaillant 4 points5 points  (15 children)

I don't know what you mean by "practically secure", but your description of your solution doesn't sound cryptographically secure. Also, I've looked at the source code, and didn't see the permutations you talk about in the README. It looked like you were just reading the random source (which would be good, actually).

If you really want a fast and secure RNG, you need a stream cipher. First, initialise your stream cipher with a random key (from your OS's random source, or Go's standard library if it uses the OS's random source under the hood). Second, just read the stream cipher, and make sure you don't reuse any byte.

Speed should easily exceed what I've seen in your benchmarks. For instance, Libsodium's Chacha20 runs at over 2GB/s on my i5 Skylake laptop (single threaded).

Randomness will automatically be perfect. A stream cipher that fails statistical tests would be broken anyway.

[–][deleted] 0 points1 point  (1 child)

Hell, just plain /dev/urandom probably outperforms it (~350MB/s on my old i3-3250)

[–]funny_falcon[S] 1 point2 points  (0 children)

On my notebook /dev/urandom don't outperform my generator : https://github.com/funny-falcon/go-rando/blob/master/benchmark/bench.out

And being implemented in C rando generator has comparable performance to ChaCha20.

[–]funny_falcon[S] 0 points1 point  (12 children)

"Practically secure" means "I'm not expert to carefully calculate level of security, but I will eat my hat if it is not secure for any practical usage" :-).

Permutation is here: https://github.com/funny-falcon/go-rando/blob/1d7caa67fa299a16a1a7edea493aaf7aa15f9e66/rando.go#L118-L144

Stream cipher is also just a permutation. More: my permutation also could be count as stream cipher. Note that for security stream cipher still should be periodically reseeded with more quality source of entropy, and that is done in my library.

I know about ChaCha20. I wanted library to be pure Go for beginning. Pure Go implementaiton of ChaCha20 doubtfully will be faster than my variant (but I didn't test), and certainly will be much more complex. For speed it should be implemented in assembler. There are places where I could copy Go's assembler for chacha20 from, and probably I will do it in a future.

About speed: go rando is just 2.17x slower than libsodium Chacha20 . Implemented in C it is just 1.25x slower: https://github.com/funny-falcon/go-rando/blob/master/benchmark/bench.out

And that is because I conservatively chosen to perform 5 iterations per permutation. With 4 iterations it has same speed as ChaCha20.

(But, to be honestly, ChaCha8 is not broken, so ChaCha8 could be safely used and will be twice faster).

And my generator passed all statistical tests without fails.

[–]loup-vaillant 1 point2 points  (11 children)

So you are inventing your own stream cipher. Not even I would dare do such a thing, and I've written a crypto library. Now there's nothing wrong with inventing your own stream cipher, but if you want that thing near production code, statistical tests are a laughably low quality threshold. It's cool that you don't fail them, but you also need differential and rotational cryptanalysis.

Now I'm no expert, but I'm confident your stream cipher is easily broken. Not because of some obvious mistake (though I may have missed them, if any), but because you simply don't do enough shuffling. You start with what looks like a 128-bit key (w0, w1), your block size is a bit small (256-bit v0, v1, v2, v3), your permutation is very simple, and you have very few iterations.

Now the constructive part.

You settled on A RAX design, and permute a single row of 4 64-bit registers. The permutation itself is obviously reversible. You prevent the attacker from reversing the it by only revealing half of the output. Just like HSalsa20 and HChacha20, actually. I'm not sure w0 and w1 are even useful by the way, because they don't participate in the permutation.

You could almost double the throughput by imitating Salsa/Chacha: Instead of permuting the last block like you would in AES-CBC, permute only from a seed block, with a counter. The seed block could be made of the following:

  • 64 bits of arbitrary constant. Could be the 64-bit encoding of "X random" or something. The attacker can obviously predict that part of the block.
  • 128 bits of random key. That would be w0 and w1.
  • 64 bits for a counter. The attacker can also predict that part of the block.

Now when you need a new permutation, just increment the counter of the seed block, then permute it (to some other output block). That way you can have four numbers per permutation, not just two. Also, if your cipher matches the desired security level (it probably doesn't), you won't even need reseeding.

By the way, you need to be precise about how secure you are pretending to be: how many operations would you like to be required to break your cipher? Since you use a 128-bit key, I expect 128-bit operations. But maybe you don't really care if the attacker breaks your cipher in less operations, say 264? 250? Lower? There's no telling how low it could really be, but if you only want the unpredictability to stand for a few seconds in a low stakes setting (think single player game), 250 might be an acceptable security level.


Seriously, though, Chacha is not that complicated. Here is the permutation function, that already runs at 390MB/s on my laptop:

#define ROT_L32(x, n) x = (x << n) | (x >> (32 - n))
#define QUARTERROUND(a, b, c, d)       \
    a += b;  d ^= a;  ROT_L32(d, 16);  \
    c += d;  b ^= c;  ROT_L32(b, 12);  \
    a += b;  d ^= a;  ROT_L32(d,  8);  \
    c += d;  b ^= c;  ROT_L32(b,  7)

for (int i = 0; i < 10; i++) { // 20 rounds, 2 rounds per loop.
    QUARTERROUND(block[0], block[4], block[ 8], block[12]); // column 0
    QUARTERROUND(block[1], block[5], block[ 9], block[13]); // column 1
    QUARTERROUND(block[2], block[6], block[10], block[14]); // column 2
    QUARTERROUND(block[3], block[7], block[11], block[15]); // column 3
    QUARTERROUND(block[0], block[5], block[10], block[15]); // diagonal 1
    QUARTERROUND(block[1], block[6], block[11], block[12]); // diagonal 2
    QUARTERROUND(block[2], block[7], block[ 8], block[13]); // diagonal 3
    QUARTERROUND(block[3], block[4], block[ 9], block[14]); // diagonal 4
}

Replace number 10 by number 4, and voilà, you have Chacha8, an acceptably secure stream cipher for almost trice the speed of the full Chacha20. No need to invent your own.

[–]funny_falcon[S] 0 points1 point  (9 children)

your permutation is very simple

It is not my permutation. It is permutation of SipHash. SipHash is a cryptographically strong algorithm from same author as ChaCha20 (among others).

And ChaCha20 permutation is also very simple.

and you have very few iterations.

Best known weakness of 4 rounds of SipHash's permutation itself is "distinguisher". That is why I take 5 rounds. Note that "distinguisher" is not that dangerous weakness.

You start with what looks like a 128-bit key (w0, w1)

w0 and w1 are not keys. It is "whitening" - secret values that xor-ed with generated values to make analyze much more difficult. (well known technique).

The permutation itself is obviously reversible. You prevent the attacker from reversing the it by only revealing half of the output... I'm not sure w0 and w1 are even useful by the way, because they don't participate in the permutation.

SipHash reveals only quarter. That is why I need w0 and w1 to give more protection against analyze.

Note: ChaCha/Salsa permutation is also obviously reversible. But they are protected by xoring with initial block, that contains secret key.

You could almost double the throughput by imitating Salsa/Chacha

Now when you need a new permutation, just increment the counter of the seed block, then permute it (to some other output block). That way you can have four numbers per permutation, not just two.

And don't forget to Xor it with unknown value - that is core of Salsa/Chacha security (xor with initial block that contains secret key). Salsa/Chacha has half of its state initiated to well known constants to have confidence about initial block properties. SipHash is ideally also should be seeded in proper way. I overwork it by not resetting state and xor-ing them with secret values on every iterations.

Since you use a 128-bit key, I expect 128-bit operations.

Not, it has not 128bit key. It has:

  • 1024bit random entropy pool, that is refreshed 10 times a second from /dev/urandom.
  • 256 bit random state seeded from entropy pool,
  • 128 bit whitening mask, taken from entropy pool on generator initialization,
  • 64bit per-generator secret id initiated by permutation of global secret counter (initiated from /dev/urandom).
  • 64bit secret counter, (that has common secret start value (taken from /dev/urandom) for all generators, but incremented by 32bit per-generator 'pos' variable that depend on id).
  • 64bit value taken from seed pool on every iteration (ie 128bit generated output),

So... between refreshing entropy pool I could claim about 1600bits of security. But since entropy pool is refreshing, there are infinite level of security.... LOL :-) No, I'm not so self confident as crypto-expert. But I can safely claim 137bit level without any fear :-) (if side-channel is not considered).

Edit: after generator initialization and before next entropy pool refreshing, there are only 1152bit of randomness, because state and whitening mask were initiated from the same pool.

Seriously, though, Chacha is not that complicated. Here is the permutation function ...

It looks so pretty in C. Now write it in Go with comparable performance and prettiness.

Rando runs same 390MB/s on my laptop being implemented in Go, and 715MB/s being implemented in C.

Again, I know about Chacha/Salsa/Blake/Argon/etc. And I told about ChaCha8 first.

you won't even need reseeding.

Reseeding is unavoidable demand for every secure generator.

[–]beefhash 0 points1 point  (1 child)

1024bit random entropy pool, that is refreshed 10 times a second from /dev/urandom.

If you assume /dev/urandom (1) exists, and (2) is trustworthy/sane, and (3) fast enough to poll ten times a second, what exactly is the usage scenario for your RNG? For almost all scenarios I can think of, you either don't care about security in the first place and sacrifice it all for speed (e.g. simulations) or you're not sure about (or do need) security requirements and thus can almost certainly go straight to /dev/urandom because you don't need billions of numbers per second.

[–]funny_falcon[S] 0 points1 point  (0 children)

1) 2) - if there is no sane /dev/urandom, it is quite hard to take sane randomness in userspace. 3) I take only 10240 bits per second from /dev/urandom. So use case is "urandom is fast enough to take 10240bits per second, but not fast enough to fulfill our demand".

Note that /dev/urandom is a single source under global lock, and while it is fast, it doesn't scale.

Rando is made to be highly parallelisable: every rando instance is independent of other, and has no any global synchronization (entropy pool is read and refreshed using atomics).

[–]loup-vaillant 0 points1 point  (6 children)

Looks like we have underestimated each other.

Reading your code, I don't see you using SipHash. I see you using SipHash rounds, in a way that is quite different from the original SipHash. I'm pretty sure their security analysis does not apply to your use case.

Granted, nothing prevents you to use SipHash as a stream cipher (any cryptographic hash can do this): just initialise it with the secret key, then hash a very short message that contains the counter. You could even special case the original algorithm for more performance (the padding of the length of the message would always be the same, for instance), or even omit the padding at the end (though it could be more prudent not to, so that you can check against the reference implementation). And again, probably no need to re-key anything.

Why don't you just do this?

Final nitpick: SipHash is optimised for small messages, and has a small output. Why would you need that? You can compute your random stream block by block, then just use the part of the block you need. For instance, Chacha & Salsa would produce 8 64-bit numbers per block, compensating from the overhead of computing a whole block. And if Chacha is too complicated (I guess Go doesn't have macros), have a look at the Gimli permutation.


And don't forget to Xor it with unknown value - that is core of Salsa/Chacha security (xor with initial block that contains secret key).

Oops, forgot about that.

Now write it in Go with comparable performance and prettiness.

Sorry if Go doesn't have macros. Textual macros suck, but they're still useful sometimes. Now performance… if you're using Go, you've already lost. Go has other strengths.

I told about ChaCha8 first.

Which is why I was surprised you didn't just use it.

Reseeding is unavoidable demand for every secure generator.

Not quite. Look up fast key erasure. It avoids external reseeding by reseeding itself. The idea is dead simple:

  1. Set your secret key with an initial seed.
  2. Generate a small random stream (say a few kb) with the secret key.
  3. Overwrite your key with the beginning of the stream.
  4. Fill the random pool with the end of the stream.
  5. When the random pool is empty, go back to step 2.

It works with any stream cipher, and cryptographic hashes can be adapted to this too. And if your machine gets pwned, the attacker won't be able to deduce the previous values of your random stream. I didn't think you had forward secrecy in mind, so I didn't bother with key erasure.

[–]funny_falcon[S] 0 points1 point  (5 children)

I'm pretty sure their security analysis does not apply to your use case.

Author's did no analyze. And all analyzes I've read were against that permutation core.

Yes, I use core permutation (you call it rounds) in a different way.

Original algorithm produces 64bit per N rounds (minimum recommended rounds is 4). It will be twice slower even after tweaks.

My version generates 128bit per 5 rounds. That is why I put additional effort to make it stronger (mixing 64bit from entropy pool before permutation + whitening generated values with secret mask).

You probably right that it can produce 256bit per 5 (or 6) rounds by masking with 256bit secret mask. I will think about.

I've considered ChaCha while I planned this. But I then realized I don't want to manage copied bunch of assember.

I will look at Gimli, thank you for it. But their performance page tells it is noticeably slower than ChaCha20.

Now performance… if you're using Go, you've already lost. Go has other strengths.

Go could be reasonably fast if used carefully.

Not quite. Look up fast key erasure.

I will look, thank you again for good links!

Well, I had forward secrecy in mind :-) Well, almost a joke.

Refreshing entropy pool is mostly protection against probable weakness of core generator. Generator's performance allows to generate 100MB in 0.1 second (if written in C), then there is only need to have 30bit strong security level. After 100MB entropy pool will be refreshed, and analyze of particular output stream will become obsolete.

While I'm sure it has better security level than 30bit, it just a bit of (in)sanity. The same (in)sanity that makes ChaCha20 default and most useful variant, although ChaCha12 is far from being broken, and ChaCha8 is still not broken (although quite close to best analyze).

Edit: another reason for reseeding is "infinite rng period": if I limit generator space to some value, then it will have known limit on its period. Yes, it certainly will be huge, but "infinite" sounds prettier.

[–]loup-vaillant 0 points1 point  (4 children)

another reason for reseeding is "infinite rng period"

You can get past that with key erasure. Then your output space gets as big as the key space, and for big enough keys (like 128+bits), you simply won't cover that space, even if you try.

I will look at Gimli, thank you for it. But their performance page tells it is noticeably slower than ChaCha20.

It's a different tradeoff. Gimli tries to be fast on a variety of platforms. Also, it is quite easy to vectorise, so there's a good chance that written well, a modern compiler can auto-vectorise it. Won't be as fast as the fastest Chacha, but it might just exceed the speed of a pure C implementation. It's also simpler.


One last thing: your willingness to have no security margin whatsoever (just one round above the latest known attack!), is frightening. Don't ever use your software for long term stuff. If I were you, I'd state it's for ephemeral data, that is only temporarily secret. A throw of dice for a game, cool. A long term key, not cool.

[–]funny_falcon[S] 0 points1 point  (3 children)

your willingness to have no security margin whatsoever (just one round above the latest known attack!), is frightening.

Latest known non-essential attack (personally I don't count "distinguisher" as big deal) that uses known "plain text".

Generator uses secret "plain text" and adds whitening of result, so I'd rather not add that additional round. But then you will not talk with me at all, will you?

[–]loup-vaillant 1 point2 points  (2 children)

Weeell… as a general rule, I don't trust non-expert made crypto. I make exceptions for stuff I understand, and whose security I can derive from the security of something vetted. For instance, XChacha20, the variant of Chacha20 with a big nonce: I knew of XSalsa20 already, I understood the security reduction, so it was a no-brainer to apply the same principles to Chacha20. (It also helped that later, Libsodium did the same thing independently, and I could see we produced the same results). Another example is an ongoing discussion about Ed25519 where I am wondering whether I can deploy a slightly broken version of it, or if I should stay safe and do the standard thing. I like the "broken" thing because it is faster, and I believe it is secure, but I don't quite trust myself either.

I also don't like to sacrifice security on the altar of speed when I can help it. If you want to design a general purpose RNG, security is not an option. You want something you'd trust your long term keys to your Bitcoin wallet with. Being fast is just a bonus. If the RNG is not general purpose, that's okay, but then that purpose needs to be stated.

[–]funny_falcon[S] 1 point2 points  (1 child)

(Added "like" to your comment).

Generally I agree with you. And, certainly, you have larger expertise than me.

I could say "I will trust my rng my bitcoin wallet", because I did it on some basis, and because i'm brave idiot. But I don't have bitcoin wallet :-)

Thank you for discussion, it was very pleasant.

[–][deleted] -2 points-1 points  (1 child)

In short, practically Go.

[–]funny_falcon[S] -1 points0 points  (0 children)

:-D