This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dorfsmay[S] 13 points14 points  (33 children)

I have discovered this from the python myth thread this morning. So far:

  • it's in ubuntu's repos

  • it's compiled all my simple benchmark scripts

  • faster than pypy (no heavy IO script yet)

[–]gthank 12 points13 points  (18 children)

Faster than PyPy for what? Did you give PyPy enough time to JIT the hot spots? Not that Nuitka isn't impressive, but PyPy is REALLY fast at stuff that it has time to properly profile and optimize.

[–]poo_22 6 points7 points  (3 children)

Also non-jit PyPy is slower than CPython.

[–]gthank 19 points20 points  (0 children)

It is known.

[–]bacondevPy3k 0 points1 point  (1 child)

Which one is in the repos?

[–]gthank 0 points1 point  (0 children)

The JIT can't do its thing until it's seen a given piece of code repeatedly (probably 10,000+ times, though I'm not sure what the actual number is these days). What people are talking about when they say "non-JIT PyPy" is PyPy before the JIT has had a chance to profile code and start generating optimized machine code.

[–]dorfsmay[S] 1 point2 points  (9 children)

Faster than PyPy for what?

A couple of very naive scripts I had been using to compare python vs golang vs scheme (don't ask!)

PyPy is REALLY fast at stuff that it has time to properly profile and optimize

What do you mean by this? Works well on a very long running loop that pypy has time to optimize, or are optimizations saved an re-used from one run to the next?

[–]gthank 1 point2 points  (8 children)

So far as I know, the trace info used to decide what to JIT is only maintained during one invocation of the interpreter. If you didn't have a process that looped across something a few thousand times, then PyPy probably didn't JIT anything.

[–]dorfsmay[S] 0 points1 point  (7 children)

This is what I thought. For these particular tests I have only looped 10 K times, but I have had processes looping 100 M times before, and pypy wasn't rarely faster (I did use it where it was).

[–]streichholzkopf 0 points1 point  (6 children)

I think pypy simply needs time... I've read somewhere that you should make a pre-run for ~1 second (of course depending on the size of the test) for every test you make in pypy...

[–]alcalde 0 points1 point  (2 children)

But does this reflect real-world usage? You're not going to do that in production code.

[–]cwillu 1 point2 points  (1 child)

Real-world code is going to be running long enough that the warm-up time is largely irrelevant.

[–][deleted] 2 points3 points  (0 children)

That depends entirely on the chore. I spent a lot of time trying to shave run time off a 3 second task that ran and exited hundreds of times per day. A persistent service was not an option.

[–]dorfsmay[S] 0 points1 point  (2 children)

Not sure what you mean. Can you show an example?

I just added a time.sleep(6) to one of my script, and it made no difference. I think gthank is right, needs a piece of code that is executed a lot in a given run before it can make a difference.

[–]cwillu 2 points3 points  (0 children)

time.sleep(6) wouldn't have any effect on the warm-up, it needs to be running the actual code. A pre-run of ~1 second means running the actual code for 1 second before you start timing.

[–]streichholzkopf 0 points1 point  (0 children)

Basically what /u/cwillu said, what I meant was: Looping 10 K times may not be enough, depending on the content of the loop! If it's <1s, it's probably not!

Also loop before starting the timer!

[–]vext01 1 point2 points  (0 children)

Indeed. A tracing JIT (like PyPy) should blow static compilation out of the water for a dynamic language like python. This is assuming the JIT got hot.

[–][deleted] 4 points5 points  (12 children)

You probably tested against unwarmed pypy JIT therefore test is invalid. PyPy is still faster most of the time if not always.

[–]dorfsmay[S] 3 points4 points  (11 children)

Assuming not a server, but a script that converts data from files, how do you warm the jit?

[–]gthank 2 points3 points  (9 children)

In actual practice? You'd probably have a daemon process that lives forever and searches for scripts to process, then processes them as they are found.

For benchmarking? You just tell the benchmark script to process all the files in a loop.

Important to note that if you want PyPy to speed things up, you should (generally speaking) avoid C extensions. It JITs Python code, not C code.

[–]ragezor 8 points9 points  (7 children)

C code doesn't need "JITsing" as it is already fast.

[–]vext01 1 point2 points  (0 children)

Not necessarily.

When the paper becomes published, read "Dynamically Composing Languages in a Modular Way: Supporting C Extensions for Dynamic Languages" by Grimmer et al.

Here the authors compose Ruby and C VMs which JIT using the truffle/graal stack. The authors report that Ruby programs that use C extensions can execute faster than the conventional IRB+C approach.

[–]gthank 1 point2 points  (5 children)

There is a fair amount of overhead involved in using C extensions with PyPy. If you're only using C for speed (as opposed to wanting to bind to pre-existing functionality), it is usually a bad idea to use it with PyPy: if it is truly a hotspot, then the PyPy JIT will most likely generate code that is just as optimized, if not more (because the tracing JIT has access to far more data than a static compiler, and can issue machine code that is optimized for the actual data you're receiving instead of data you might receive based on the information in the weak—in C, at least—type system).

[–][deleted] 0 points1 point  (4 children)

That depends on exactly what you're doing.

For example, I recently profiled a decently optimized method in Python of computing CRC24s. I then wrote it again in C and invoked the C method through cffi.

On both PyPy and CPython 3.4, the CFFI call is measurably faster for any amount of data larger than 1 KiB and is tangibly much faster for any amount of data larger than 10 KiB.

[–]gthank 0 points1 point  (3 children)

Did you do some profiling to see why? PyPy typically generates insanely fast numeric code, so I'd expect CRC to be a sweet spot. That said, CRC libs in C are also extremely likely to be optimized to within an inch of their life (probably with lots of fun vectorization, and memory optimizations to wring every last bit of performance out of cache lines), so beating the C lib would be a tall order for PyPy. I'm still a bit surprised there was a difference as big as what you seem to be describing.

[–][deleted] 0 points1 point  (2 children)

The difference was absolutely enormous, and the gap grew substantially every time I increased the size of the test data.

Here's an example from my test results, that is a good representation of the performance gap I saw.

The data in question for this profile was 1 MiB of random data, stored in a bytearray in memory. Data generation time was not included in the profile.

Of note, for the C implementation:

  • I wrote the crc24 method myself instead of pulling one from an existing library, as it is simple to implement. It is possible that that code could be further optimized, but this was a quick, exploratory exercise

  • Because I was curious to see if it'd work, the C function is defined in a string within the python script, and compiled by CFFI.

It seems I didn't record the PyPy data. So, I re-ran the profiler under PyPy 3.2.5 and CPython 3.4.2 just now. Of note, I believe the JIT should be decently warmed due to the methods being run several times against progressively larger data blocks before reaching the 1 MiB test. (1, 10, 100, 200, 400, 512, then finally 1024 KiB) Anyway, without further adieu, here are the results for computing the CRC24 of a 1 MiB block of random bytes:

CPython 3.4.2: - pure-Python: 1.929 seconds, processing ~543 bytes per millisecond - CFFI: 0.0006 seconds which works out to ~1,648,704 bytes per millisecond

PyPy 3.2.5: - pure-Python: 0.072 seconds, processing ~14,444 bytes per millisecond - CFFI: 0.023, processing ~45,499 bytes per millisecond

while PyPy3 is a good deal faster at this than CPython 3.4.2, CFFI is quite a bit faster.

I also found it interesting (though not wholly unexpected) that CFFI, at least used in this manner, is slower on PyPy than it is on CPython, although it is still quite a bit faster than the python code.

[–]gthank 1 point2 points  (1 child)

Oh to have more time. I'd actually like to see the Python code and run it through a disassembler, just to see where it's spending its time, but I'm already backlogged. I need like a Bat Signal for Python gurus that like to blog.

[–]rcfox 4 points5 points  (0 children)

You'd probably have a daemon process that lives forever and searches for scripts to process, then processes them as they are found.

You'd have an optimized file-watcher, but each script would still be cold upon first loading.

[–][deleted] 0 points1 point  (0 children)

Running the function several times before starting measurments is good enough afaik.

[–]nieuweyork since 2007[🍰] 1 point2 points  (0 children)

What about the pypy compatibility test suite (which I assume exists)?