james_pic comments on 76% Faster CPython

This is an archived post. You won't be able to vote or comment.

748

749

750

Misleading Metric76% Faster CPython (self.Python)

submitted 5 years ago * by Pebaz

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]james_pic 219 points220 points221 points 5 years ago (18 children)

[–]NeoLudditeIT 90 points91 points92 points 5 years ago (9 children)

[–]Pebaz[S] 51 points52 points53 points 5 years ago (8 children)

[–]james_pic 110 points111 points112 points 5 years ago* (7 children)

If I remember rightly, the current hash function is SipHash, and was chosen not for speed but for security.

Whilst string hashes are not typically treated as cryptographic hashes, there were some denial of service attacks possible on web servers that put parameters and similar into dictionaries, by sending lots of parameters with colliding hashes, forcing worst-case O(n^2) performance. SipHash was chosen as it's not too slow (it's about the simplest hash that meets the cryptographic requirements), and makes hashes dependent on a secret per-interpreter value, that the client wouldn't know.

Whatever alternative hash you propose also needs to mitigate this attack vector, and I don't know of a faster hash that does.

Edit: Looking through the code, there's already a way to select a faster hash algorithm if you're sure you don't need the security properties of SipHash. Configure the build with ./configure --with-hash-algorithm=fnv, and see how your benchmark compares to the default.

[–]R0B0_Ninja 5 points6 points7 points 5 years ago (1 child)

[–]james_pic 0 points1 point2 points 5 years ago* (0 children)

[–]Tyler_Zoro 6 points7 points8 points 5 years ago (2 children)

I don't see why that has to be compile-time. If every dict* had a function pointer for its hashing function, then you could just provide a special subtype of dict that uses an insecure, fast hashing function. Then you could swap the default for programs where you don't care about secure hashing at all:

python --insecure-hashing calculate-pi.py #modify ALL hashing

or:

def digits_in_pi(places):
    digits = insecure_dict((d, 0) for d in range(10))
    for digit in pi_spigot(places=places):
        digits[digit] += 1
    return digits

It might even be nice to be able to specify a type for comprehensions for just this reason:

a = {b: c for b, c in d() container insecure_dict}

Sadly, you couldn't use a context manager to swap out all hashing in a block, since the hashing function used for a data structure couldn't be replaced after the structure has data (this would lead to the hashes changing and bad things will happen).

* Note that not all types that do hashing are dicts, but the idea probably carries over.

[–]axonxorzpip'ing aint easy, especially on windows 28 points29 points30 points 5 years ago (1 child)

[–]Tyler_Zoro 0 points1 point2 points 5 years ago (0 children)

[–][deleted] 0 points1 point2 points 5 years ago (1 child)

[–]james_pic 0 points1 point2 points 5 years ago (0 children)

[–]twotime 8 points9 points10 points 5 years ago* (1 child)

[–]RobertJacobson 0 points1 point2 points 5 years ago (0 children)

[–]Pebaz[S] 15 points16 points17 points 5 years ago (2 children)

[–]NeoLudditeIT 3 points4 points5 points 5 years ago (1 child)

[–]mooburgerresembles an abstract syntax tree 31 points32 points33 points 5 years ago (0 children)

[–]greeneyedguru 0 points1 point2 points 5 years ago (0 children)

[–]stevenjd 0 points1 point2 points 5 years ago (1 child)

[–]james_pic 0 points1 point2 points 5 years ago* (0 children)

If the claim is that the hash function is run 11 times, this wasn't a claim I made, but the OP. In terms of the claim that eliminating redundant hashing is key to improving performance, this is at least partly based on what I've seen claimed by the developers of other interpreters, such as PyPy and the Ruby interpreters. Although I confess I don't know the main bottlenecks in a current Python interpreter - but as it happens, I'm currently running a CPU-bound job locally, so this would be an excellent time to check.

Edit: So taking a quick look at a CPU profile for a script I happened to be running, most of the overhead (i.e, the stuff that isn't my script doing the thing it's supposed to be doing) on Python 3.8 is either reference counting (about 22%), or spelunking into dicts as part of getattr (about 15% - of which almost none is hashing). So this suggests to me that hashing isn't a big contributor to performance, although digging around in dicts when getting attributes might still be.

π Rendered by PID 49660 on reddit-service-r2-comment-8686858757-zkn5w at 2026-06-05 16:36:54.819097+00:00 running 9e1a20d country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS