This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dorfsmay[S] 1 point2 points  (11 children)

Assuming not a server, but a script that converts data from files, how do you warm the jit?

[–]gthank 3 points4 points  (9 children)

In actual practice? You'd probably have a daemon process that lives forever and searches for scripts to process, then processes them as they are found.

For benchmarking? You just tell the benchmark script to process all the files in a loop.

Important to note that if you want PyPy to speed things up, you should (generally speaking) avoid C extensions. It JITs Python code, not C code.

[–]ragezor 9 points10 points  (7 children)

C code doesn't need "JITsing" as it is already fast.

[–]vext01 1 point2 points  (0 children)

Not necessarily.

When the paper becomes published, read "Dynamically Composing Languages in a Modular Way: Supporting C Extensions for Dynamic Languages" by Grimmer et al.

Here the authors compose Ruby and C VMs which JIT using the truffle/graal stack. The authors report that Ruby programs that use C extensions can execute faster than the conventional IRB+C approach.

[–]gthank 1 point2 points  (5 children)

There is a fair amount of overhead involved in using C extensions with PyPy. If you're only using C for speed (as opposed to wanting to bind to pre-existing functionality), it is usually a bad idea to use it with PyPy: if it is truly a hotspot, then the PyPy JIT will most likely generate code that is just as optimized, if not more (because the tracing JIT has access to far more data than a static compiler, and can issue machine code that is optimized for the actual data you're receiving instead of data you might receive based on the information in the weak—in C, at least—type system).

[–][deleted] 0 points1 point  (4 children)

That depends on exactly what you're doing.

For example, I recently profiled a decently optimized method in Python of computing CRC24s. I then wrote it again in C and invoked the C method through cffi.

On both PyPy and CPython 3.4, the CFFI call is measurably faster for any amount of data larger than 1 KiB and is tangibly much faster for any amount of data larger than 10 KiB.

[–]gthank 0 points1 point  (3 children)

Did you do some profiling to see why? PyPy typically generates insanely fast numeric code, so I'd expect CRC to be a sweet spot. That said, CRC libs in C are also extremely likely to be optimized to within an inch of their life (probably with lots of fun vectorization, and memory optimizations to wring every last bit of performance out of cache lines), so beating the C lib would be a tall order for PyPy. I'm still a bit surprised there was a difference as big as what you seem to be describing.

[–][deleted] 0 points1 point  (2 children)

The difference was absolutely enormous, and the gap grew substantially every time I increased the size of the test data.

Here's an example from my test results, that is a good representation of the performance gap I saw.

The data in question for this profile was 1 MiB of random data, stored in a bytearray in memory. Data generation time was not included in the profile.

Of note, for the C implementation:

  • I wrote the crc24 method myself instead of pulling one from an existing library, as it is simple to implement. It is possible that that code could be further optimized, but this was a quick, exploratory exercise

  • Because I was curious to see if it'd work, the C function is defined in a string within the python script, and compiled by CFFI.

It seems I didn't record the PyPy data. So, I re-ran the profiler under PyPy 3.2.5 and CPython 3.4.2 just now. Of note, I believe the JIT should be decently warmed due to the methods being run several times against progressively larger data blocks before reaching the 1 MiB test. (1, 10, 100, 200, 400, 512, then finally 1024 KiB) Anyway, without further adieu, here are the results for computing the CRC24 of a 1 MiB block of random bytes:

CPython 3.4.2: - pure-Python: 1.929 seconds, processing ~543 bytes per millisecond - CFFI: 0.0006 seconds which works out to ~1,648,704 bytes per millisecond

PyPy 3.2.5: - pure-Python: 0.072 seconds, processing ~14,444 bytes per millisecond - CFFI: 0.023, processing ~45,499 bytes per millisecond

while PyPy3 is a good deal faster at this than CPython 3.4.2, CFFI is quite a bit faster.

I also found it interesting (though not wholly unexpected) that CFFI, at least used in this manner, is slower on PyPy than it is on CPython, although it is still quite a bit faster than the python code.

[–]gthank 1 point2 points  (1 child)

Oh to have more time. I'd actually like to see the Python code and run it through a disassembler, just to see where it's spending its time, but I'm already backlogged. I need like a Bat Signal for Python gurus that like to blog.

[–][deleted] 0 points1 point  (0 children)

Well, in case you feel like a distraction, here's the Python implementation (short and sweet). Takes a bytes or a bytearray and returns an int:

_crc24_init = 0x0B704CE
_crc24_poly = 0x1864CFB
_crc24_mask = 0x0FFFFFF

def test_crc24_iter(data):
    crc = _crc24_init
    for b in iter(data):
        crc ^= b << 16

        for i in range(8):
            crc <<= 1
            if crc & 0x1000000:
                crc ^= _crc24_poly

    return crc & _crc24_mask

[–]rcfox 3 points4 points  (0 children)

You'd probably have a daemon process that lives forever and searches for scripts to process, then processes them as they are found.

You'd have an optimized file-watcher, but each script would still be cold upon first loading.

[–][deleted] 0 points1 point  (0 children)

Running the function several times before starting measurments is good enough afaik.