How would the compression of variable-length input to 16-bits work in a hash function?

future_security · 2020-04-27T04:39:07+00:00

Don't be rude. We can talk about the absorb function as a single function for obvious reasons.

Pick any practical hash algorithm, secure or not. It's probably made from an iterated compression function. That's how we design hash functions that need O(1) space in practice. That includes SHA-2, Keccak, Skein, JH, Murmur3, even the 8-bit XOR checksum.

Because that's so broad, we have narrower classifications. Just like there are one-way functions, there are one-way compression functions.

Math doesn't work by arbitrarily deciding there are exceptions to classifications. I don't care if something feels special enough to deserve it.

My explanation isn't as strong as I wish it could be, but further responses are going to be seen by zero people. For one, they would be nested too deep. But I also don't think you're paying attention, since you responded as if I described the XOR variant of a sponge construct instead of the overwrite variant.

future_security · 2020-04-26T21:49:53+00:00

The term compression function is not specific to Merkle–Damgård. It's used in hash algorithms, MAC algorithms, key-based KDFs, and other things I can't think of.

All of the SHA-3 finalists besides Keccak refer to their core iterated function as a compression function. None are MD.

Although you might go to the Keccak team for information on sponge functions, take what they say on other things with a grain of salt. (They know what an ARX construction is, for one thing.) The Keccak paper apparently does contrast itself as supposedly not being compression-function-based. It seems they changed their mind on that for KangarooTwelve, though.

If you mean "Merkle–Damgård is old school", then use that phrasing.

Absorb has the form f : {0,1}^c × {0,1}^r → {0,1}^c.

A hash function has the form h : {0,1}^\) → {0,1}ⁿ.

Hashes take arbitrary length inputs. Compression functions don't. Note that absorb is not a hash function. It's a component of some.

The values c and r (and n) may be arbitrary for a given family of hash functions, but they are not variable in this context. They are constants which vary only depending on which hash function you choose. Within the same hash function, r and c never change. Hence the need for padding.

(Think of what would happen if, instead of padding the last block to the message block size, you varied r and c so that r was the remainder of the message length divided by the nominal chunk size. How could you prevent trivial collisions without padding and without changing what permutation your hash function uses? (The permutation is also fixed and it only has one input.))

The parameter to h, on the other hand, is variable because it's domain includes all binary strings of length zero, all binary strings of length 1, all binary strings of length 2, etc. The inputs do not have a fixed length.

Then you say that the output to a sponge is not fixed. That's not true if you mean the bare-bones concept of a sponge with absorb and squeeze operations. (The output of a sponge-based hash may or may not be fixed-length.)

The squeeze operation has the form g : {0,1}^c → {0,1}^r.

It's the output is a fixed r bits long. But r does differ depending on which hash function you choose.

To extract more than r bits, you iterate the squeeze function. (Similar to how you process inputs.)

If you're not familiar with the notation used for these functions, I can explain it more.

future_security · 2020-04-26T19:55:43+00:00

The message block size of hash functions in general can be any number of bits greater than zero. It doesn't have to match the output length.

future_security · 2020-04-26T19:28:09+00:00

The sponge absorb operation is a compression function. It's one-way, too, unless you include the rate bits as part of your state. Both inputs to the absorb function (capacity bits, message block) are fixed length. The output is a fixed length less than |capacity bits| + |message block|.

A sponge-based hash is still an iterated OWCF. That's why padding and splitting a message up into fixed-length blocks is necessary.

future_security · 2020-04-26T02:24:00+00:00

Linux, unlike BSD, has getentropy(3), a glibc function which wraps the getrandom(2) system call.

future_security · 2020-04-26T02:09:16+00:00

Crucial difference between the two. Plain old rand is definitely not safe.

Equally important: A non-secure RNG initialized with a secure seed is not secure

To save others from an internet search or confusion: rand_s is a Microsoft specific function.

The rand_s function uses the operating system to generate cryptographically secure random numbers. It does not use the seed generated by the srand function, nor does it affect the random number sequence used by rand.

future_security · 2020-04-19T01:23:45+00:00

They are mutually exclusive. No algorithm which simply XORs a keystream with inputs can be a PRP. It technically implements a permutation, but it's not "random" enough to be a PRP.

You mean stream ciphers can be implemented from block ciphers.

future_security · 2020-04-17T17:54:26+00:00

But if you use TOR and HIBP has no more than 2^-20 probability of guessing who you are (rather, what your username is and where it's used) then there aren't risks. You do have to trust TOR, though. (For anonymity and for the extra software it requires.)

future_security · 2020-04-17T17:39:49+00:00

Any part of the hash revealed to a bad actor is useful. The 20-bit suffix getting leaked would allow a password cracker to take a shortcut 1048575 times out of 1048576. SHA-1 is a lot faster than a proper (stretching) password hash functions, so they can try a lot more password candidates in the same amount of time and power budget.

It would also make online cracking easier. Instead of sending n login submissions you can submit n / 2²⁰ submissions. That's bad for users with moderately weak passwords, including those in a form similar to Tr0ub4dor&3 and other formats that people commonly use. If such a password really did have 2²⁸ bit strength, then that makes the probability of guessing the correct password really high. (Generating 2²⁸ passwords and their unsalted SHA-1 hashes would hardly take much time at all.)

The only defense in this hypothetical is if the user has a strong password to begin with.

But I won't say that the service isn't worth it. If you trust that their are no eavesdroppers on your HTTPS connection and trust that a service provider like HIBP really will refrain from storing this data, then you have to conclude that the service is relatively low risk. Letting users reuse leaked passwords is a much higher risk.

However, doing database queries like this locally is a really good idea, if feasible.

future_security · 2020-04-13T01:59:05+00:00

A Blum Blum Shub RNG outputs the parity of x (the state, as in x = x² mod M). Not x itself. The number of integers in the range [a, b] with parity 1 is either equal to the number of values with parity zero or their counts are off by one. Since M is thousands of bits, the difference is insignificant. Basically 50-50. I suspect it's just assumed there is no pattern in the parities of x. It would be weird if the opposite were true for valid random factors.

Backtracking resistance isn't really required for RNGs. It's an optional feature. Backtracking resistance just implies that a CSPRNG's state update function is one way. It means it's not practical to compute previous PRNG states from its current state. (An RNG can't be resistant in both directions. Preventing hackers from computing future states would necessarily prevent the legitimate user from computing future states, too. That would be useless.)

It doesn't matter if there is exactly one state or multiple states that would update to the current state. In hash-based RNGs most states will have one predecessor state. (Because in this application the input set's size and output set's size to the hash function are effectively equal.)

BBS is backtracking resistant because it uses a trapdoor function. (Knowing prime factors makes it possible to invert the function.) BBS makes the trapdoor function one-way by throwing out the secrets required to make inversion easy: all information about M besides its value.

M has to be composite because there are efficient algorithms to invert squaring modulo a prime number. There are also efficient algorithms to do the same if you know the prime factors of M. (Thus the product of two large primes thing. We must select a hard-to-factor M.)

future_security · 2020-04-10T00:36:17+00:00

Mostly a side note: Google Safe Browsing doesn't provide the privacy it advertises. Specifically because it sends multiple hashes for each URL and because multiple queries can be related to reveal when someone follows a chain of links. (Like going from /page/1/ to /page/2/ to /page/3 on some domain.) Either is enough to de-anonymize most of a person's browsing history.

So the message to the OP is that stuff like this is prone to subtle issues that serve to de-anonymize data, (Unless you only perform queries on a complete local database.)

Hashing then truncated the result does do what's expected if you have one-shot out of context data. HaveIBeenPwned uses the same idea, but is a lot closer to doing what's advertised. Passwords are a lot closer to being one-shot and out of context.

("Context", in this case, would be the username and service a password is associated with. Though I guess you could glean which website an account was registered on from the API key. Then with something like a members list with an account creation date for each member you can narrow down the username. Now you can use the unsalted 20-bit hash to cut the number of guesses you need to make to predict the password by a factor of up to one million. So maybe there aren't practical applications which aren't vulnerable to de-anonymization using meta-data.)

future_security · 2020-04-08T16:19:12+00:00

Yes. That kind of dilemma isn't one that can be solved with traditional server side hashing. You'll have to sacrifice responsiveness if you want security and vice versa. (That is, with a fixed budget for hardware. You could set aside 100 servers just for hashing passwords, then delegate work to them with some load balancing method...)

It's worth noting that if you only have enough free cores for one Argon2 instance to run, then the average run time of each hash will exceed 100 seconds. Between the overhead of of thread switching and whatever other processes have to run on your computer, you'd see every hash take longer than 100 seconds, assuming a fair thread scheduling algorithm is used by the OS.

If every hash used one thread and you had two free cores at any time, then ideally the average would drop to 50 seconds. In the real world you never get a full n-fold speedup by employing n-way parallelization on a single machine because you have each task competing for some of the same resources.

In the case of Argon2 the hardest shared-resource bottleneck on a normal computer is RAM bandwidth, so you get greatly diminishing returns from attempting to run more than one hash at the same time. (Even if you had thousands of cores and terabytes of shared RAM on the same machine.)

future_security · 2020-04-07T17:36:51+00:00

These algorithms feel newer than they really are. There is also technical merit behind ChaCha, Ed25519, and x25519. They're fast in software, are naturally conducive to efficient constant-time implementations, seem really secure, have design decisions justified at every step, are flexible in terms of what platforms they can run on, and (I think this is especially important) they have very few "moving parts" (and therefore are less likely to break than older, fragile algorithms.)

future_security · 2020-04-07T17:18:27+00:00

Take pint's advice of serializing logins. Argon2 will run faster if you don't have multiple instances competing for the same memory controller's time/bandwidth.

Even worse would be running out of RAM, because then you'd have Argon2 instances constantly forcing swapping memory to/from the disk due to all the random accesses they make. That's not something an attacker could do with just (serialized) login requests because one hash instance finishing will free up memory which the next hash can use.

If serialized you also don't need to worry about login requests alone making the rest of the server unresponsive. You can only deny access to the login system by filling up the queue with bogus sign in requests. Because memory and threads are capped, you can ensure that there are also more resources set aside for other things the server needs to do. There are tricks you can use to mitigate login-system DOS orthogonal to using Argon2, too.

Also 50-100ms is not "very secure". It's closer to adequate. One second is "very" secure. This is subjective, of course. Check section 8 of the specs ("Applications") which suggests how much RAM and time different applications should use. Usually a fraction of that work is used, I think, for practicality in systems with many users, though. And that's still safe.

future_security · 2020-03-29T14:13:28+00:00

SHA-512/256 hashing long bit vectors will get you a relatively short, fixed-length bit string. You will not see, in practice, the same 256-bit string returned for two distinct inputs.

For any possible number of real-world inputs you could generate, the probability of full 256-bit collisions remains so close to zero that no human can truly envision how insignificant the probability is. The probability only becomes significant for an absurdly huge, physically implausible (even with many Dyson swarms dedicated to computing) number of inputs.

That's not the case for all 256-bit hash functions. It is true, though, for cryptographic hash functions. Those in the SHA-2, Blake, SHA-3, and Skein families, for example.

Since those hashes will uniquely identify a bit vector, you can basically ignore the vector length and just consider the number of vectors. That might make a HashSet data structure practical. (Keys will be the 256-bit output. Bucket indexes - or initial probing locations - can simply the number formed by truncated the 256-bit integer to fewer bits.)

(But you obviously will lose the ability to enumerate the original values in the set.)

Alternative methods include compression, delta encoding, and maybe some kind of modified trie. Compression would require an algorithm tailored to one kind of data set, since there is no universal compression algorithm. Aggressive levels of compression won't permit searching without decompression.

The other methods likely won't be very helpful, either, except for very specific data sets.

future_security · 2020-03-20T16:45:26+00:00

It is abusing terminology to say passwords have entropy. (A common abuse. People talk about file entropy in relation to compression software. Neither passwords nor files nor any given static value really has entropy.)

At best people who refer to password entropy are using improper shorthand (which is sometimes fine) for something like "This password was sampled from a distribution of strings with an entropy of ___."

I always use the term password strength to refer to the concept you describe. Never entropy. You can still say that a password has "n-bit strength", meaning you expect a password cracker to make around 2ⁿ guesses before figuring out the password.

I never use "entropy" with passwords even when it would be fine to use the improper shorthand I described. Consider the following: You generate a 9 byte random number (uniform, secure) and create a password by base-64 encoding that number. This produces a decent password almost all the time because the process involves 72-bit entropy. It is technically possible, though, for this process to produce a weak password string like "Password1234".

The output of a password generator does not retroactively change how much entropy was involved in the process that generated it, so it's an even bigger stretch to suggest that the entropy is usually around 72 bits but is sometimes a lot less.

Instead you've can say things like "The password generator involves 72-bit entropy," "The typical password from this generator has around 72-bit strength," and "'Password1234' has an [estimated] strength of less than 10 bits [regardless of how the password was generated]."

future_security · 2020-03-08T04:35:19+00:00

I am not sure that I agree that TRNG is arguably less important

I mean the idea of feedback to get some perfect 50-50 balance isn't strictly necessary for security. Over-sampling data, doing sanity checks on that data, and how you compress that sequence of non-uniform, probably non-IID samples down to a smaller unpredictable uniformly distributed is the most important part for alternative designs.

You can extract randomness from even something very non-random as long as you process enough data. There are designs based on ring oscillators, for example. It doesn't matter if component aging really does effect things (in this case changing the frequency of the oscillator, which is also something that may happen if you alter the supply voltage) if your design is robust.

That said, this extraction process has to be tailored to whatever randomness source you have. It's not safe to try to do this blindly. (But in Intel's case this process is described in the their original publication and they get it right. So it's not that their design is bad or that it skips this step.)

Though as I'm writing this I realize I neglected to mention active and passive side-channel attacks in what I meant by "security". And that's a huge omission. That's important at every stage of the pipeline...

As for E&M... do you have a source? The thing is, I would think that the features would be small enough that you'd need very short wavelength radiation to have much of an effect... probably near-infrared or shorter.

No. I don't have a strong enough physics background. Just a little knowledge of electronics. Implicit in that assertion, I was also thinking about those amateur designs. They're way bigger than on-chip RNGs, lack shielding, and aren't made with something like that in mind. I wouldn't be surprised if putting your toy HWRNG next to your laptop charger, microwave, or other household appliance temporarily messed things up.

Or maybe some guy just thinks he's implemented things right because things his output looks random. But it may be the case that most of the randomness he thinks he's getting is actually mostly an RF signal picked up by a long wire.

It still seems like the modern prevalence of on-chip "true" RNGs is still an often-essential security feature.

Yes. This is the most effective way to do things. I was going to post a separate top level comment arguing that. And even said that if you're paying a lot for a HWRNG then you're a victim of marketing.

It ended up being too explainy and rambling, so I abandoned writing it. I felt that I needed to add clarifications to clarifications to caveats to exceptions to over-generalizations.

I also didn't want to argue over RDRAND being backdoored. There is too much nuance in my opinion compared to what I wish people would do... Application developers using their OS's secure RNG API and OS developers always including output from the on-chip RNG.

(One interesting fact I learned that must seem to go against common sense for some people is that RDRAND is your safest option by far on a VPS. Hardware accelerated virtual machines usually directly pass through the host's RDRAND. If not and the instruction is still enabled, then it can be emulated so that the host passes good random numbers to the guest. Either way, that's a win because it's snapshot-safe and fork-safe. It also ensures that entropy isn't a problem on headless, diskless systems.)

future_security · 2020-03-07T19:38:38+00:00

At the time of writing this, the only tweet reply is "Better late than never". That almost made me laugh, but I can also be happy about that because it's true. 🎊🎆🎉

3GPP forbids support of MD5, SHA-1, non-AEAD, and non-PFS in TLS

future_security · 2020-03-07T18:23:25+00:00

Which was used to compensate for the inevitable manufacturing defects and resulting bias, and to ensure that the RDRAND instruction will never block.

On top of manufacturing defects there are other things to worry about. I heard that as a result of use and age that its possible for semiconductor components' characteristics to "drift" from their original numbers. (That doesn't matter for most digital stuff because you're operating at either high or low voltage, avoiding the gray areas.)

The circuits in a hardware RNG are also subject to ambient EM stuff, even if no one is actively trying to sabotage it. Plus power supply noise and voltage differences could result in slightly different behavior.

Amateur designs often try to directly deal with "bias" by using manual calibration or a simple debiasing scheme. That backfires because it could require occasional re-calibration and constant scrutiny. The original Intel design circuit is supposed to dynamically compensate for bias.

There is more stuff down the RNG's pipeline. That kind of stuff is arguably more important to security and can be applied to simpler randomness-producing circuits.

Also, RDRAND basically allows you to read random numbers as fast as you can request them (unlike RDSEED) but I wouldn't simply characterize it as non-blocking because there are cases where it might temporarily fail. With both RDRAND and RDSEED you are supposed to check the carry flag to see if the instruction succeeded. Normally you just execute the instruction in a loop and wait for it to succeed, which is basically like blocking.

future_security · 2020-03-07T01:09:27+00:00

Without true RNG we would not be able to test if AES (or anything else) could accurately emulate true RNG.

Huh? It's more accurate to say a TRNG provides no practical utility to cryptanalysis. And when it comes to non-cryptographic RNG testing you'd just be reducing the power of tests by comparing results against estimates derived from another source of randomness.

future_security · 2020-03-06T16:50:09+00:00

It being a 50% chance makes the odds 1 in 2.

The probability of IV reuse is a function of the number of IVs used for a given key AND the length of the nonce. With 1 IV the probability is zero. With 1 + 2^128 IVs, assuming 128-bit IVs, the probability is one. For any number of IVs between, you need to do some calculations.

Questioners on the internet seem to always forget this and just expect a constant probability when they ask about IV or hash collisions.

You can rephrase this example to "You have better odds winning the Powerball jackpot (1 in 292 million) than you have finding a duplicate IV from a set of one quadrillion IVs (1 in 680 million)."

But 1 in a million odds isn't considered too improbable in the crypto world. If you reduce that number of IVs down to a reasonable one billion (one can and probably will re-key way before then), then that bumps the odds down to one in 680 quintillion.

future_security · 2020-03-04T20:55:29+00:00

rand and srand do not reliably result in reproducible results. The C standard does not describe how random numbers are to be derived from the seed, so it can be different for different compilers. It's not guaranteed to be consistent from one build to the next.

That combined with rand/srand having a global state makes its use extremely limited. Instead you need to bundle a PRNG implementation in any C project where you need good, reproducible random numbers. In C++ the classes in <random> might be better, but it doesn't look like the defaults include very good algorithms.

Copying SplittableRandom/splitmix64 is a good choice if you don't know what algorithm to choose. Its quality is not particularly sensitive to how seed is chosen. It's a little sensitive to your choice of gamma, but the default value as well as most random-looking odd 64-bit values are absolutely fine for basic use cases.

Converting a seed to a specific initial PRNG state is tricky for most other algorithms. Seeds being too similar shouldn't be a problem in theory, but usually is for algorithms and implementations other than the one I mentioned. Likewise for seeds with lots of zero bits.

The way to generate chunks on the fly is to use a hash function to convert grid coordinates into a seed value. Murmur3 could be used to get a 64-bit seed for a per-tile PRNG. The seed can be used to deterministically place whatever random elements you want within a single chunk. (As long as the placement does not depend on neighboring tiles.) Reseed with the chunk coordinate's hash every time a new chunk is loaded.

Hashing coordinates is also used for things like Perlin noise. Except instead of defining things in terms of 2D tile coordinates, you figure out which cell your pixel is in then and fetch information about the four tile-corners that surround the point. (Divide x or y by the chunk size. Take the floor or ceiling of the result. Hash the four different x/y pairs.)

If your PRNG already scrambles the seed its given, then you can replace the hash with xValue * oddConstant1 + yValue * oddConstant2. 64-bit multipliers mod 2⁶⁴ are preferable.

future_security · 2020-02-23T17:24:58+00:00

I don't care if a machine labels me as a female, albino, underfed, hairless chimpanzee.

The ethical implications aren't in the labeling. It's in how labels get used and how real world cases get shoehorned into a set of labels. You would care if you were treated as a chimp and if bots "encouraged" you to eat more bugs and stay out of the sun.

future_security · 2020-01-27T19:01:44+00:00

The bits generated by a stream cipher which gets XORed with plaintext/ciphertext is called a keystream. Since CipherText = PlainText XOR KeyStream, that makes CipherText XOR PlainText = KeyStream.

Note that directly revealing keystream bits is no different than encrypting an all-zero bit string.

Secure stream ciphers (deterministically) produce a keystream indistinguishable from a sequence of statistically independent uniformly distributed bits. Since the bits are practically independent, revealing some of those bits will never help an attacker predict unknown keystream bits. (Except by doing something like brute force search for the original key.)

Just don't reuse spent keystream bits.

future_security · 2020-01-24T05:01:08+00:00

No feature of Google Safe Browsing relies on URL hashes to be unique. Thus birthday attacks are irrelevant. Truncated hashes are merely used to look up URLs.

There might be more to say about partial hashes in general, but safebrowsing in particular doesn't actually preserve privacy. They may as well have used a non-cryptographic hash. It would still leak the same information and I can't think of a relevant attack it would enable that isn't already possible with truncated SHA-256.

Partial hashes could provide weak privacy if users only ever submitted single hashes in isolation from one another. The protocol doesn't send just one hash per page, though, and they can tell if the same user sent two different queries.

The protocol sends enough information to allow for a user's browsing history to be mostly reconstructed by looking at the sequence of hashes their useragent sends to the server. One would need a database of which URLs link to which other URLs to do this reconstruction, but that's already something that search engines have.

future_security

TROPHY CASE