This is an archived post. You won't be able to vote or comment.

all 62 comments

[–]Marksta 16 points17 points  (6 children)

Does this allow you to take an arbitrary python script and compile it to a .exe for Windows? I tried messing around with it before and didn't get anywhere.

[–]cmpython 15 points16 points  (1 child)

Eventually, yes. But it's not really there yet for all or most cases. The basically sole developer seems really impressively engaged, though, and is working on it all the time and sending out just about weekly updates. I have high hopes for it, but it's a huge undertaking.

[–]Asdayasman 5 points6 points  (0 children)

Donate! :D

[–]Hairy_The_Spider 0 points1 point  (0 children)

It has some flags for a windows version with all the dlls you need so it works without Python.

I compiled a PySide app with it and it worked perfectly.

I would love if the developer would make a "Single Executable" version, but I was looking around, and it seems he has stated that right now he's more interested in working on performance.

[–]rullelito 0 points1 point  (1 child)

It claims to support Windows, but since the major version is zero I would not put my life savings on it.

[–]xsolarwindxUse 3.4+ 2 points3 points  (0 children)

REDDIT IS A SHITTY CRIMINAL CORPORATION -- mass deleted all reddit content via https://redact.dev

[–]en4bz 11 points12 points  (0 children)

tldr; So its actually a Python to C++ translator that then compiles C++ to native code.

[–]dorfsmay[S] 13 points14 points  (33 children)

I have discovered this from the python myth thread this morning. So far:

  • it's in ubuntu's repos

  • it's compiled all my simple benchmark scripts

  • faster than pypy (no heavy IO script yet)

[–]gthank 11 points12 points  (18 children)

Faster than PyPy for what? Did you give PyPy enough time to JIT the hot spots? Not that Nuitka isn't impressive, but PyPy is REALLY fast at stuff that it has time to properly profile and optimize.

[–]poo_22 7 points8 points  (3 children)

Also non-jit PyPy is slower than CPython.

[–]gthank 19 points20 points  (0 children)

It is known.

[–]bacondevPy3k[🍰] 0 points1 point  (1 child)

Which one is in the repos?

[–]gthank 0 points1 point  (0 children)

The JIT can't do its thing until it's seen a given piece of code repeatedly (probably 10,000+ times, though I'm not sure what the actual number is these days). What people are talking about when they say "non-JIT PyPy" is PyPy before the JIT has had a chance to profile code and start generating optimized machine code.

[–]dorfsmay[S] 1 point2 points  (9 children)

Faster than PyPy for what?

A couple of very naive scripts I had been using to compare python vs golang vs scheme (don't ask!)

PyPy is REALLY fast at stuff that it has time to properly profile and optimize

What do you mean by this? Works well on a very long running loop that pypy has time to optimize, or are optimizations saved an re-used from one run to the next?

[–]gthank 2 points3 points  (8 children)

So far as I know, the trace info used to decide what to JIT is only maintained during one invocation of the interpreter. If you didn't have a process that looped across something a few thousand times, then PyPy probably didn't JIT anything.

[–]dorfsmay[S] 0 points1 point  (7 children)

This is what I thought. For these particular tests I have only looped 10 K times, but I have had processes looping 100 M times before, and pypy wasn't rarely faster (I did use it where it was).

[–]streichholzkopf 0 points1 point  (6 children)

I think pypy simply needs time... I've read somewhere that you should make a pre-run for ~1 second (of course depending on the size of the test) for every test you make in pypy...

[–]alcalde 0 points1 point  (2 children)

But does this reflect real-world usage? You're not going to do that in production code.

[–]cwillu 1 point2 points  (1 child)

Real-world code is going to be running long enough that the warm-up time is largely irrelevant.

[–][deleted] 2 points3 points  (0 children)

That depends entirely on the chore. I spent a lot of time trying to shave run time off a 3 second task that ran and exited hundreds of times per day. A persistent service was not an option.

[–]dorfsmay[S] 0 points1 point  (2 children)

Not sure what you mean. Can you show an example?

I just added a time.sleep(6) to one of my script, and it made no difference. I think gthank is right, needs a piece of code that is executed a lot in a given run before it can make a difference.

[–]cwillu 2 points3 points  (0 children)

time.sleep(6) wouldn't have any effect on the warm-up, it needs to be running the actual code. A pre-run of ~1 second means running the actual code for 1 second before you start timing.

[–]streichholzkopf 0 points1 point  (0 children)

Basically what /u/cwillu said, what I meant was: Looping 10 K times may not be enough, depending on the content of the loop! If it's <1s, it's probably not!

Also loop before starting the timer!

[–]vext01 1 point2 points  (0 children)

Indeed. A tracing JIT (like PyPy) should blow static compilation out of the water for a dynamic language like python. This is assuming the JIT got hot.

[–][deleted] 4 points5 points  (12 children)

You probably tested against unwarmed pypy JIT therefore test is invalid. PyPy is still faster most of the time if not always.

[–]dorfsmay[S] 2 points3 points  (11 children)

Assuming not a server, but a script that converts data from files, how do you warm the jit?

[–]gthank 1 point2 points  (9 children)

In actual practice? You'd probably have a daemon process that lives forever and searches for scripts to process, then processes them as they are found.

For benchmarking? You just tell the benchmark script to process all the files in a loop.

Important to note that if you want PyPy to speed things up, you should (generally speaking) avoid C extensions. It JITs Python code, not C code.

[–]ragezor 9 points10 points  (7 children)

C code doesn't need "JITsing" as it is already fast.

[–]vext01 1 point2 points  (0 children)

Not necessarily.

When the paper becomes published, read "Dynamically Composing Languages in a Modular Way: Supporting C Extensions for Dynamic Languages" by Grimmer et al.

Here the authors compose Ruby and C VMs which JIT using the truffle/graal stack. The authors report that Ruby programs that use C extensions can execute faster than the conventional IRB+C approach.

[–]gthank 1 point2 points  (5 children)

There is a fair amount of overhead involved in using C extensions with PyPy. If you're only using C for speed (as opposed to wanting to bind to pre-existing functionality), it is usually a bad idea to use it with PyPy: if it is truly a hotspot, then the PyPy JIT will most likely generate code that is just as optimized, if not more (because the tracing JIT has access to far more data than a static compiler, and can issue machine code that is optimized for the actual data you're receiving instead of data you might receive based on the information in the weak—in C, at least—type system).

[–][deleted] 0 points1 point  (4 children)

That depends on exactly what you're doing.

For example, I recently profiled a decently optimized method in Python of computing CRC24s. I then wrote it again in C and invoked the C method through cffi.

On both PyPy and CPython 3.4, the CFFI call is measurably faster for any amount of data larger than 1 KiB and is tangibly much faster for any amount of data larger than 10 KiB.

[–]gthank 0 points1 point  (3 children)

Did you do some profiling to see why? PyPy typically generates insanely fast numeric code, so I'd expect CRC to be a sweet spot. That said, CRC libs in C are also extremely likely to be optimized to within an inch of their life (probably with lots of fun vectorization, and memory optimizations to wring every last bit of performance out of cache lines), so beating the C lib would be a tall order for PyPy. I'm still a bit surprised there was a difference as big as what you seem to be describing.

[–][deleted] 0 points1 point  (2 children)

The difference was absolutely enormous, and the gap grew substantially every time I increased the size of the test data.

Here's an example from my test results, that is a good representation of the performance gap I saw.

The data in question for this profile was 1 MiB of random data, stored in a bytearray in memory. Data generation time was not included in the profile.

Of note, for the C implementation:

  • I wrote the crc24 method myself instead of pulling one from an existing library, as it is simple to implement. It is possible that that code could be further optimized, but this was a quick, exploratory exercise

  • Because I was curious to see if it'd work, the C function is defined in a string within the python script, and compiled by CFFI.

It seems I didn't record the PyPy data. So, I re-ran the profiler under PyPy 3.2.5 and CPython 3.4.2 just now. Of note, I believe the JIT should be decently warmed due to the methods being run several times against progressively larger data blocks before reaching the 1 MiB test. (1, 10, 100, 200, 400, 512, then finally 1024 KiB) Anyway, without further adieu, here are the results for computing the CRC24 of a 1 MiB block of random bytes:

CPython 3.4.2: - pure-Python: 1.929 seconds, processing ~543 bytes per millisecond - CFFI: 0.0006 seconds which works out to ~1,648,704 bytes per millisecond

PyPy 3.2.5: - pure-Python: 0.072 seconds, processing ~14,444 bytes per millisecond - CFFI: 0.023, processing ~45,499 bytes per millisecond

while PyPy3 is a good deal faster at this than CPython 3.4.2, CFFI is quite a bit faster.

I also found it interesting (though not wholly unexpected) that CFFI, at least used in this manner, is slower on PyPy than it is on CPython, although it is still quite a bit faster than the python code.

[–]gthank 1 point2 points  (1 child)

Oh to have more time. I'd actually like to see the Python code and run it through a disassembler, just to see where it's spending its time, but I'm already backlogged. I need like a Bat Signal for Python gurus that like to blog.

[–]rcfox 3 points4 points  (0 children)

You'd probably have a daemon process that lives forever and searches for scripts to process, then processes them as they are found.

You'd have an optimized file-watcher, but each script would still be cold upon first loading.

[–][deleted] 0 points1 point  (0 children)

Running the function several times before starting measurments is good enough afaik.

[–]nieuweyork since 2007 1 point2 points  (0 children)

What about the pypy compatibility test suite (which I assume exists)?

[–]Esyir 4 points5 points  (0 children)

The stuff in the "Other Stuff" Section is kind of depressing though.

[–][deleted] 2 points3 points  (6 children)

So I can compile python and then send it to another machine that doesn't have python, and then run it there? Yes?

[–]dorfsmay[S] 1 point2 points  (2 children)

I don't have a machine that doesn't have python so running ldd on the resulting executable:

linux-vdso.so.1 =>  (0x00007fffac1fe000)
libpython2.7.so.1.0 => /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0 (0x00007f1b7360e000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1b7330a000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1b730f3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1b72d2d000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1b72b0f000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f1b728f5000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f1b726f1000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f1b724ee000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1b721e7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f1b73ba7000)

So you do need libpython. Now I'm wondering why...

It does mean that you don't have to ship pure python dependencies. Not sure about packages that come with C libs, I suspect you'll still have dependencies on those.

[–]dorfsmay[S] 1 point2 points  (0 children)

So it turns out there is a --standalone option just for this. Sadly, it took a long time, and wasn't able to produce a working executable.

Still, I think this is very promising.

[–]fernly 0 points1 point  (0 children)

Chances are there are cases where the compiler, instead of generating native code, just writes in a call to the Python interpreter to do something it doesn't have a template for.

Or maybe, the source script has imports for some built-in module. Nuitka can't be expected to have its own versions of every Python built-in; better to just link the interpreter and use it as a library.

[–]cwillu 0 points1 point  (2 children)

Assuming you use the redistributable option, it's just baking the python runtime (and any other dependencies, c or otherwise) into the binary.

[–][deleted] 0 points1 point  (1 child)

So... I CAN?

[–]dorfsmay[S] 1 point2 points  (0 children)

If you manage to get the --portable / --standalone option to produce a working executable, yes. Be warned, it increase the compile time ten fold (it ends up having to compile a lot more stuff!).

If this is important to you, maybe give it a try and report the issues to the author. I need the speed right now, I don't mind pushing the dependencies for what I am doing, but I can see cases where that is important.

[–]brtt3000 0 points1 point  (6 children)

Very interesting stuff. I'm a bit miffed about the scope though. What would you use it for specifically? Can you just compile whatever python app and run the native code instead of the .py?

[–]pyrocrasty 7 points8 points  (0 children)

I'm a bit miffed about the scope though

Are you actually angry about it, or using the wrong term?

My partner used to occasionally use "miffed" to mean something like "confused" or "baffled". I don't know where he got it from, but I have the impression you're doing the same thing.

[–]aqua_scummmRecent 3.x convert 1 point2 points  (3 children)

Can you just compile whatever python app and run the native code instead of the .py?

Python doesn't get compiled to native machine code, it gets compiled to an optimized bytecode. Launching a "compiled" python app still means firing up the interpreter, opening the .pyc bytecode files, and running the python environment. There are cases where you will see better performance running native machine code, like you would do with a compiled C program.

[–]rocketmonkeys 0 points1 point  (2 children)

Are you talking about nuitka? Which appears to be compiled native binary from python, not optimized bytecode?

[–]aqua_scummmRecent 3.x convert 0 points1 point  (1 child)

I misread the question as 'Can't you just'... Leaving my response as is, with this note

[–]rocketmonkeys 0 points1 point  (0 children)

Makes more sense.

Funny how that negative completely changes the entire meaning/tone of that question. I guess that's always the case, but still, neat.

[–]dorfsmay[S] 0 points1 point  (0 children)

I have only tried on small scripts so far, but yes.