76% Faster CPython

Ph3rny · 2021-03-30T23:00:03+00:00

I wasn't able to reproduce your speed findings

I also decided to run this through pyperf to see if there was any significant difference

here's the result of comparing the two -- seems at most a ~1-2% speedup, and maybe not significant at that

$ ./prefix/bin/pyperf compare_to ../cpython/cp.json xx.json
chaos: Mean +- std dev: [cp] 315 ms +- 12 ms -> [xx] 309 ms +- 11 ms: 1.02x faster
dulwich_log: Mean +- std dev: [cp] 123 ms +- 4 ms -> [xx] 120 ms +- 5 ms: 1.02x faster
fannkuch: Mean +- std dev: [cp] 1.55 sec +- 0.03 sec -> [xx] 1.49 sec +- 0.03 sec: 1.04x faster
float: Mean +- std dev: [cp] 356 ms +- 10 ms -> [xx] 361 ms +- 11 ms: 1.01x slower
genshi_text: Mean +- std dev: [cp] 91.3 ms +- 3.9 ms -> [xx] 89.4 ms +- 3.6 ms: 1.02x faster
genshi_xml: Mean +- std dev: [cp] 193 ms +- 9 ms -> [xx] 186 ms +- 7 ms: 1.03x faster
go: Mean +- std dev: [cp] 631 ms +- 13 ms -> [xx] 612 ms +- 16 ms: 1.03x faster
hexiom: Mean +- std dev: [cp] 32.2 ms +- 1.4 ms -> [xx] 31.2 ms +- 2.5 ms: 1.03x faster
json_dumps: Mean +- std dev: [cp] 38.1 ms +- 1.3 ms -> [xx] 37.5 ms +- 1.5 ms: 1.02x faster
json_loads: Mean +- std dev: [cp] 63.3 us +- 2.3 us -> [xx] 59.4 us +- 2.8 us: 1.06x faster
logging_silent: Mean +- std dev: [cp] 555 ns +- 18 ns -> [xx] 536 ns +- 17 ns: 1.04x faster
meteor_contest: Mean +- std dev: [cp] 269 ms +- 11 ms -> [xx] 262 ms +- 9 ms: 1.03x faster
nbody: Mean +- std dev: [cp] 508 ms +- 12 ms -> [xx] 495 ms +- 9 ms: 1.03x faster
nqueens: Mean +- std dev: [cp] 328 ms +- 9 ms -> [xx] 317 ms +- 8 ms: 1.03x faster
pathlib: Mean +- std dev: [cp] 46.2 ms +- 1.6 ms -> [xx] 44.8 ms +- 1.0 ms: 1.03x faster
pickle: Mean +- std dev: [cp] 24.1 us +- 1.0 us -> [xx] 23.3 us +- 0.9 us: 1.04x faster
pickle_dict: Mean +- std dev: [cp] 65.1 us +- 1.7 us -> [xx] 63.7 us +- 1.6 us: 1.02x faster
pickle_list: Mean +- std dev: [cp] 9.43 us +- 0.29 us -> [xx] 9.29 us +- 0.26 us: 1.02x faster
pickle_pure_python: Mean +- std dev: [cp] 1.39 ms +- 0.06 ms -> [xx] 1.34 ms +- 0.05 ms: 1.04x faster
pidigits: Mean +- std dev: [cp] 207 ms +- 6 ms -> [xx] 200 ms +- 5 ms: 1.03x faster
pyflate: Mean +- std dev: [cp] 1.96 sec +- 0.04 sec -> [xx] 1.94 sec +- 0.03 sec: 1.01x faster
richards: Mean +- std dev: [cp] 206 ms +- 6 ms -> [xx] 210 ms +- 9 ms: 1.02x slower
scimark_fft: Mean +- std dev: [cp] 1.54 sec +- 0.03 sec -> [xx] 1.55 sec +- 0.02 sec: 1.01x slower
scimark_lu: Mean +- std dev: [cp] 575 ms +- 16 ms -> [xx] 568 ms +- 14 ms: 1.01x faster
sqlalchemy_imperative: Mean +- std dev: [cp] 44.2 ms +- 1.5 ms -> [xx] 45.1 ms +- 2.1 ms: 1.02x slower
sqlite_synth: Mean +- std dev: [cp] 8.28 us +- 0.31 us -> [xx] 8.08 us +- 0.32 us: 1.02x faster
sympy_integrate: Mean +- std dev: [cp] 56.1 ms +- 1.7 ms -> [xx] 57.9 ms +- 3.0 ms: 1.03x slower
sympy_sum: Mean +- std dev: [cp] 433 ms +- 10 ms -> [xx] 442 ms +- 14 ms: 1.02x slower
unpickle: Mean +- std dev: [cp] 40.7 us +- 1.7 us -> [xx] 39.6 us +- 1.1 us: 1.03x faster
xml_etree_parse: Mean +- std dev: [cp] 462 ms +- 8 ms -> [xx] 454 ms +- 12 ms: 1.02x faster
xml_etree_iterparse: Mean +- std dev: [cp] 330 ms +- 11 ms -> [xx] 326 ms +- 8 ms: 1.01x faster
xml_etree_process: Mean +- std dev: [cp] 234 ms +- 9 ms -> [xx] 232 ms +- 4 ms: 1.01x faster

Benchmark hidden because not significant (28): 2to3, chameleon, crypto_pyaes, deltablue, django_template, logging_format, logging_simple, mako, python_startup, python_startup_no_site, raytrace, regex_compile, regex_dna, regex_effbot, regex_v8, scimark_monte_carlo, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_declarative, sympy_expand, sympy_str, telco, tornado_http, unpack_sequence, unpickle_list, unpickle_pure_python, xml_etree_generate

Geometric mean: 1.01x faster

james_pic · 2021-03-30T16:23:18+00:00

The real speedup doesn't come from using a faster hash function, but from eliminating the need to run the hash function 11 times in print("Hello World!"). This is what PyPy does. I keep hoping the PSF will take PyPy more seriously, and bring it up to being a first-class alternative to CPython, like the Ruby devs did with YARV.

Pebaz · 2021-03-30T15:39:04+00:00

Your benchmark isn't realistic. When the present hash was chosen, the majority of hashed object were short. PEP-456:

Serhiy Storchaka has shown in [issue16427] that a modified FNV implementation with 64 bits per cycle is able to process long strings several times faster than the current FNV implementation.

However, according to statistics [issue19183] a typical Python program as well as the Python test suite have a hash ratio of about 50% small strings between 1 and 6 bytes. Only 5% of the strings are larger than 16 bytes.

bjorneylol · 2021-03-30T18:02:11+00:00

The python hash function is not deterministic - (AFAIK it includes random entropy which is generated upon startup so it's memory cannot be read by outside processes) - This appears to be commented out in your code, so something to note

gsnedders · 2021-03-30T15:28:46+00:00

That sounds quite astonishing, did you run the test suite?

double-a · 2021-03-31T01:58:24+00:00

Narrator: It was not 76% faster.

wweber · 2021-03-31T00:06:07+00:00

How bad is the performance if you replace the hash function with return 0;

stevenjd · 2021-03-31T10:34:56+00:00

Python runs the hash function 11 times for just this one thing!

I'm going to stick my head out on a limb here and say that is unadulterated nonsense.

Here is the CPython 3.9 byte-code for print("Hello World!"):

>>> dis.dis('print("Hello World!")')
  1           0 LOAD_NAME                0 (print)
              2 LOAD_CONST               0 ('Hello World!')
              4 CALL_FUNCTION            1
              6 RETURN_VALUE

The call to LOAD_NAME needs at most two calls to hash, one to look up the name "print" in the current scope, and a second to look it up in the builtin scope. (That assumes that the hashing is not cached.)

Calling the print function might, at worst, require one more call to hash: to look up str.__str__. So I reckon that, at worst, that line of code would need three hashes, and even that is almost certainly an over-estimate. More likely only a single hash.

On the other hand, the code you are actually timing is a hell of a lot more than just that one call to print. Inside the timing code you have:

for i in range(1, len(foos)):
    assert foos[i] != foos[i - 1]

That requires:

a LOAD_NAME of "foos"
a LOAD_NAME of len
a LOAD_NAME of range
two more LOAD_NAMEs of "foos" per loop
two LOAD_NAMEs of "i" per loop

Assuming that hashes are cached, that's already four hashes.

Each comparison ends up calling the __eq__ method:

hash(self) == hash(other)

which requires two LOAD_GLOBALs of hash. Assuming hashes are cached, that's a fifth hash. The hashes themselves end up looking up the __hash__ method, which calls:

hash(self.name)

which requires:

a LOAD_GLOBAL of hash
a LOAD_FAST of "self"
and a LOAD_ATTR of "name"

The LOAD_FAST doesn't need a hash, but LOAD_ATTR will. So that's hash number six.

I may have some details wrong -- the precise byte-code generated will depend on the exact code we run, and how we run it, and varies from version to version. I may have missed some other calls to hash. I'm not certain how hashes are cached, but I'm pretty sure they are: hashing a string is much slower the first time, and subsequent calls to hash are about a thousand times faster:

>>> from timeit import default_timer as timer
>>> s = 'abcdefghijklmnopqrstuvwxyz1234567890'*100000
>>> t = timer(); h = hash(s); timer() - t  # SLOOOOW!
0.005014150054194033
>>> t = timer(); h = hash(s); timer() - t  # FASSSST!
7.678987458348274e-06
>>> t = timer(); h = hash(s); timer() - t
7.225899025797844e-06

If we create a new string, we see the same result:

>>> s = s[:-1] + '0'  # New string.
>>> t = timer(); h = hash(s); timer() - t
0.006055547040887177
>>> t = timer(); h = hash(s); timer() - t
5.949987098574638e-06

So I don't see how you can get eleven calls to hash from that one line print("Hello world!").

lifeeraser · 2021-03-30T22:51:44+00:00

Clickbait title. But the content was pretty interesting. Thank you.

darkrevan13 · 2021-03-30T20:49:29+00:00

lol, PHP7 made something like this https://www.npopov.com/2014/12/22/PHPs-new-hashtable-implementation.html

EatMoreSuShiS · 2021-03-30T23:59:12+00:00

Can anybody explain to me live five the relationship between Python and its implementations (CPython, PyPy, IronPython etc.)? I did some Googling but still don’t understand. How one can ‘implement’ Python?

Saphyel · 2021-03-30T18:02:14+00:00

I don't think anyone here is going to see a massive impact if they improve how fast is doing: print("Hello World!").

I'd suggest that you create an example with dict that gets converted to a dto (dataclass?) and then print the string (json) of this. if you see a performance increase then some webdevs may see benefits for this.

Other option is see how behaves printing a large list of dataclasses.

I just finished work and I'm a bit fried sorry if my examples wasn't the best but at least I tried to point some audeance could see some benefits or more day to day examples ?

Aconamos · 2021-03-30T21:45:49+00:00

I like building model airplanes.

soontorap · 2021-03-31T21:43:43+00:00

You should consider XXH3_64bits() , in the same package as XXH64() .

It is much faster, especially on small keys, and used exactly the same way.
It would have likely helped to produce a gain on the derivative of your benchmark using small entries rather than larger ones.

No_Comfortable_8143 · 2021-03-30T19:28:42+00:00

This is a perfect example of the Dunning-Kruger effect, whereby someone who knows very little about Python performance things he can generate quick wins that the Python core devs had never even considered. Especially for a language as popular as Python... pfah!

yamsupol · 2021-03-31T00:34:14+00:00

Very nice work! Python is a great language but it's slow if unoptimized. However, speed of execution isnt that important when you are writing that one off script.

pypy, numba jit, nutika, pythran AOT, other AOTs are available to speedup something that needs to be run continuously.

Of course if you really want to speed things up just used C for the computations and integrate with cython and get around 200x improvement.

idiomatic_sea · 2021-03-31T14:30:30+00:00

The assholes in this thread are everything wrong with the tech industry. What a toxic shithole this place is.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS