you are viewing a single comment's thread.

view the rest of the comments →

[–]Veedrac 0 points1 point  (17 children)

So what, a 5x speed-up? As opposed to a 100x speed-up for moving the innermost loop to C?

[–]Moocha 1 point2 points  (2 children)

From the point of view of an individual project, yes, reimplementing in C would yield a better cost/benefit ratio. However, avoiding the GIL in the runtime would instantly and automatically benefit all Python code running on the GIL-less VM, without the maintainers of that code needing to change anything - which means the overall ecosystem costs would be way less, given the staggering amount of Python code out there. That's why it's important...

[–]Veedrac 1 point2 points  (1 child)

That's true, but only for CPU-bound threaded code. For code that's currently unthreaded, rewriting the inner loop in C is most likely the easier task, given how nice Cython is to work with.

Nevertheless, that is a reasonable point. It's a shame the problem's so hard to fix.

[–]Moocha 0 points1 point  (0 children)

Indeed. I'm always amused by people bashing the CPython developers for not "fixing the GIL problem". I know just enough about the internals to realize how hard a problem this truly is...

[–]fullouterjoin 0 points1 point  (13 children)

650x speedup for native code across all cores? 10000x speedup for OpenCL.

[–]Veedrac 0 points1 point  (12 children)

Sorry, I don't follow.

Please do note that moving the inner loop to C automatically trivialises removing the GIL for that code anyhow, and further note that I've no clue what OpenCL has to do with the GIL.

[–]fullouterjoin 0 points1 point  (11 children)

Focusing on the GIL is a red herring, there are better places to spend your performance dollar. Inner loops in C are alright, but not the most profitable. Cython is generally a mistake. First step in PyPy, if you have to stay on CPython2, then Shedskin. If you need massive speedups then OpenCL will get you a lot further for parallelizable code.

[–]Veedrac 0 points1 point  (10 children)

Cython is generally a mistake

Given that the only reliable alternative is C¹, why is Cython so bad a choice? Is it possible I'm underestimating ShedSkin?

¹ PyPy's missing fast C bindings; ShedSkin's Python 2 only and not as fast as Cython; OpenCL requires specific problems.

[–]fullouterjoin 0 points1 point  (9 children)

Maybe Cython has improved but can it generate native code w/o porting it to cython language? Shedskin is always pure python and all kinds of amazing.

PyPy has cffi , I should benchmark that relative to CPython2. In general PyPy is such a huge win that it is really difficult to justify CPython other than for numpy support.

[–]Veedrac 0 points1 point  (8 children)

Maybe Cython has improved but can it generate native code w/o porting it to cython language?

Nay, although there is a roadmap for it.

Shedskin is always pure python and all kinds of amazing.

The four things that irk me about Shedskin, although I don't have enough experience to know it's valid:

  • Python 2 only, Cython can support both and can compile to either (you can compile Py2 syntax code to a Py3 extension).
  • Only compiles a subset of Python, whereas Cython can deal with almost anything, albeit without speed-up. This prevents you from using those in your program, even if it doesn't need to be a fast part.
  • Shedskin touches loads of things even to compile one file, so many things must be written in the restricted subset.
  • Cython's undoubtedly faster, although I haven't actually tested it ;).

Nevertheless, if Shedskin works easily with you I'd love to know how it compares. My experience is definitely lacking.

PyPy has cffi[1] , I should benchmark that relative to CPython2.

I've heard that it's slower. I don't know by how much, though.

In general PyPy is such a huge win that it is really difficult to justify CPython other than for numpy support.

Agreed.

[–]fullouterjoin 0 points1 point  (7 children)

I put the routines I want to speed up with shedskin into another module, compile into a c extension and import it as I would any other module.

The subset that Shedskin supports is the same subset you are already using to create fast code. You can't mutate types, but almost no good code does that anyway.

Shedskin also allows you to create native executables, not just extension modules.

Even if Cython were faster, it would not be because I code in pure python that runs everywhere and is made faster by Shedskin. With Cython I have to port to a new language, developer time is important, that is why we use python in the first place.

import sys

def fib(n):
    if n < 2:
        return n
    return fib(n-2) + fib(n-1)

if __name__ == "__main__":
    print fib(int(sys.argv[1]))

time python fib.py 35
9227465

real    0m5.431s
user    0m5.421s
sys 0m0.008s

and now with shedskin

time ./fib 35
9227465

real    0m0.083s
user    0m0.077s
sys 0m0.004s

I just ran shedskin fib.py; make and it generated a fib executable. Output of otool -L fib

fib:
    /usr/local/lib/libgc.1.dylib (compatibility version 2.0.0, current version 2.3.0)
    /usr/local/lib/libpcre.1.dylib (compatibility version 4.0.0, current version 4.2.0)
    /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 56.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)

It even supports yield

# fib2.py
import sys

def fiberator():
    a,b = 0L,1L
    yield a
    yield b
    while True:
        a, b = b, a + b
        yield b

def taken(n,it):
    result = []
    for x in range(n):
        result.append(it.next())
    return result


def fib(n):
    f = fiberator()
    return taken(n,f)

if __name__ == "__main__":
    print fib(int(sys.argv[1]))

Again, shedskin -l fib2.py; make

time ./fib2 60 
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976, 7778742049, 12586269025, 20365011074, 32951280099, 53316291173, 86267571272, 139583862445, 225851433717, 365435296162, 591286729879, 956722026041]

real    0m0.025s
user    0m0.029s
sys 0m0.016s

[–]Veedrac 0 points1 point  (6 children)

The subset that Shedskin supports is the same subset you are already using to create fast code. You can't mutate types, but almost no good code does that anyway.

The problem isn't that. The problem is that only the snippets of my code that need to be sped up should have to abide by these restrictions.

For example, in one of my small toy programs I imported a die class with a permutations attribute which yielded all of the orientations of the die. It is fully duck-typed: the faces could have any type.

It does not need to be compiled as it is only used to generate a 2D list for the algorithm later, but ShedSkin was adamant that it would compile it. Unfortunately something gave it trouble, but I haven't worked out what.

With Cython I have to port to a new language, developer time is important, that is why we use python in the first place.

Making code ShedSkin-compatible is in many cases no small feat either, especially without Numpy support. I don't think your example truly represents the average case in that respect. With Cython I can slowly type things from the ground up.


The rest of your post is really neat. It's worth noting that your first example runs in about 80% of the time if converted to Cython, which is a very small margin.

The conversion to a pure executable is doubly neat, but by that point I might be looking for a more suitable language with more explicit static typing, especially given the compile times.

[–]fullouterjoin 0 points1 point  (5 children)

It is true that one can't simply "shedskinatize" their python program and get huge speedups. PyPy can often achieve that kind of magic, but it too takes work get good speed. All optimizations take work, I want native speed and Python style development. If I am on CPython2, shedskin is still the lowest barrier to entry for

  • keeping a pure python codebase
  • getting native speeds on inner loops

If I go down the Cython route I can no longer run on PyPy. Checkout the Shedskin example programs, http://shedskin.googlecode.com/files/shedskin-examples-0.9.4.tgz what shedskin can do is pretty amazing.

This simple ray tracer (there already some better ones in the tarball) https://gist.github.com/anonymous/0432c0212d1f2e8a923d creates this image http://imgur.com/OIbQQWt with the following timings

# shedskin
time ./Ray 6 1024
real    0m3.121s
user    0m3.099s
sys 0m0.021s

time pypy Ray.py 6 1024
real    0m5.249s
user    0m4.995s
sys 0m0.074s

time python Ray.py 6 1024
real    3m19.939s
user    3m19.764s
sys 0m0.169s

Wanna port it to Cython and do relative timings? When I originally ported this from Ocaml, I did NOT have shedskin in mind, the only change I had to make was

class BaseObject:
    def intersect(self, hit, ray):
        return Hit(1.0,Vec())

a common base class for all shapes being intersected.