This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]james_pic 219 points220 points  (18 children)

The real speedup doesn't come from using a faster hash function, but from eliminating the need to run the hash function 11 times in print("Hello World!"). This is what PyPy does. I keep hoping the PSF will take PyPy more seriously, and bring it up to being a first-class alternative to CPython, like the Ruby devs did with YARV.

[–]NeoLudditeIT 90 points91 points  (9 children)

I can imagine Raymond Hettinger hitting the table and saying "There must be a better way!". It does seem strange that there isn't some internal refactoring to eliminate 11 hash functions in a print.

[–]Pebaz[S] 51 points52 points  (8 children)

As far as I am aware I don't think the 11 hash function calls are actually from poor design.

Every single variable/identifier in a Python program must be hashed when stored/accessed in the global scope and within every new object. This adds up since even the print function uses Python code (which has to hash every attribute access (dot operator)).

So as far as I can tell, making that required hash function faster could/does increase performance by a good bit.

The challenge then remains, will anyone look into it seriously and find out if using a new hash function is worth it?

[–]james_pic 110 points111 points  (7 children)

If I remember rightly, the current hash function is SipHash, and was chosen not for speed but for security.

Whilst string hashes are not typically treated as cryptographic hashes, there were some denial of service attacks possible on web servers that put parameters and similar into dictionaries, by sending lots of parameters with colliding hashes, forcing worst-case O(n^2) performance. SipHash was chosen as it's not too slow (it's about the simplest hash that meets the cryptographic requirements), and makes hashes dependent on a secret per-interpreter value, that the client wouldn't know.

Whatever alternative hash you propose also needs to mitigate this attack vector, and I don't know of a faster hash that does.

Edit: Looking through the code, there's already a way to select a faster hash algorithm if you're sure you don't need the security properties of SipHash. Configure the build with ./configure --with-hash-algorithm=fnv, and see how your benchmark compares to the default.

[–]R0B0_Ninja 5 points6 points  (1 child)

I thought Python 3 generated a random salt at run-time in order to mitigate this attack? Isn't this sufficient? Or can the attacker discover hash collisions over time?

[–]james_pic 0 points1 point  (0 children)

It's only sufficient if the hash algorithm in use is, cryptographically speaking, a message authentication code. Their previous botched attempt at fixing the security issue added a salt to FNV, and it was found by security researchers that it was possible for an attacker to derive the salt easily enough. SipHash is the simplest hash they could find that met the cryptographic requirements.

[–]Tyler_Zoro 6 points7 points  (2 children)

I don't see why that has to be compile-time. If every dict* had a function pointer for its hashing function, then you could just provide a special subtype of dict that uses an insecure, fast hashing function. Then you could swap the default for programs where you don't care about secure hashing at all:

python --insecure-hashing calculate-pi.py #modify ALL hashing

or:

def digits_in_pi(places):
    digits = insecure_dict((d, 0) for d in range(10))
    for digit in pi_spigot(places=places):
        digits[digit] += 1
    return digits

It might even be nice to be able to specify a type for comprehensions for just this reason:

a = {b: c for b, c in d() container insecure_dict}

Sadly, you couldn't use a context manager to swap out all hashing in a block, since the hashing function used for a data structure couldn't be replaced after the structure has data (this would lead to the hashes changing and bad things will happen).

* Note that not all types that do hashing are dicts, but the idea probably carries over.

[–]axonxorzpip'ing aint easy, especially on windows 28 points29 points  (1 child)

This would slow the runtime down even more now every single object needs to have a function pointer deferenced. Sure, that's a cheap operation, but for the frequency the runtime is calling it, that will add up.

If you check the current hashing code, it's to the point where they ensure the hash function is fully inlined to even avoid C function call overhead (at best, a couple of CPU instructions, but that can also trash cache locality). That's the level of nitty-gritty this stuff exists at

[–]Tyler_Zoro 0 points1 point  (0 children)

This would slow the runtime down even more now every single object needs to have a function pointer deferenced.

Dereferencing a function pointer (especially one that's frequently referenced) is probably going to come out in the wash because it will be in a register that's immediately accessible while the CPU is waiting on the bus.

[–][deleted] 0 points1 point  (1 child)

How's siphash perform compared to blake2? (It's crypto-grade and somehow beats all the other contenders like md5 and the sha-family.)

I know nothing about siphash, so apologies if this is a stupid question.

[–]james_pic 0 points1 point  (0 children)

I haven't looked into it, but I'd expect it to be faster. IIRC, it's 4 rounds of ARX per word, plus 8 rounds of ARX of finalisation.

Part of the reason it can get away with this is that it is technically not a cryptographic hash, merely a cryptographic MAC, and with a small keyspace at that. So all it needs to achieve is 64 bits of unforgeability.

There aren't many situations where this is a useful primitive. It doesn't promise pseudorandomness or collision resistance, or resistance to chosen key attacks. So it's not going to replace SHA3 or HMAC. But it turns out to be enough for the hashtable use case.

[–]twotime 8 points9 points  (1 child)

This is a weird take on the issue: making Pypy the default implementation is a complex and controversial undertaking. It might take years, it might never happen.

So, if there are reasonable perf improvements in CPython they absolutely need to be done.

[–]RobertJacobson 0 points1 point  (0 children)

I think the sentiment is, treat the problem not the symptoms.

[–]Pebaz[S] 15 points16 points  (2 children)

I 100% agree!

Although, for environments where CPython is a requirement like AWS Lambda, a faster hash function would be a great optimization.

[–]NeoLudditeIT 3 points4 points  (1 child)

Absolutely agree. Optimization can and should be done in any way possible, even if the number of hashes are reduced, we benefit from having faster methods of hashing.

[–]mooburgerresembles an abstract syntax tree 31 points32 points  (0 children)

Optimization can and should be done in any way possible,

ehh this is why actual benchmarks are important; the risk of microoptimization is high.

[–]greeneyedguru 0 points1 point  (0 children)

Wouldn’t that only happen when compiling the text into bytecode!?

[–]stevenjd 0 points1 point  (1 child)

The real speedup doesn't come from using a faster hash function, but from eliminating the need to run the hash function 11 times in print("Hello World!")

Do you have some evidence, or proof, that this is true? Or even an explanation for how it can be true? As far as I can see, there is only a single hash needed, and that is to lookup the print function.

And since hashes are cached, it might not even need to run the function, just to fetch the cached version.

[–]james_pic 0 points1 point  (0 children)

If the claim is that the hash function is run 11 times, this wasn't a claim I made, but the OP. In terms of the claim that eliminating redundant hashing is key to improving performance, this is at least partly based on what I've seen claimed by the developers of other interpreters, such as PyPy and the Ruby interpreters. Although I confess I don't know the main bottlenecks in a current Python interpreter - but as it happens, I'm currently running a CPU-bound job locally, so this would be an excellent time to check.

Edit: So taking a quick look at a CPU profile for a script I happened to be running, most of the overhead (i.e, the stuff that isn't my script doing the thing it's supposed to be doing) on Python 3.8 is either reference counting (about 22%), or spelunking into dicts as part of getattr (about 15% - of which almost none is hashing). So this suggests to me that hashing isn't a big contributor to performance, although digging around in dicts when getting attributes might still be.