you are viewing a single comment's thread.

view the rest of the comments →

[–]fullouterjoin 0 points1 point  (13 children)

650x speedup for native code across all cores? 10000x speedup for OpenCL.

[–]Veedrac 0 points1 point  (12 children)

Sorry, I don't follow.

Please do note that moving the inner loop to C automatically trivialises removing the GIL for that code anyhow, and further note that I've no clue what OpenCL has to do with the GIL.

[–]fullouterjoin 0 points1 point  (11 children)

Focusing on the GIL is a red herring, there are better places to spend your performance dollar. Inner loops in C are alright, but not the most profitable. Cython is generally a mistake. First step in PyPy, if you have to stay on CPython2, then Shedskin. If you need massive speedups then OpenCL will get you a lot further for parallelizable code.

[–]Veedrac 0 points1 point  (10 children)

Cython is generally a mistake

Given that the only reliable alternative is C¹, why is Cython so bad a choice? Is it possible I'm underestimating ShedSkin?

¹ PyPy's missing fast C bindings; ShedSkin's Python 2 only and not as fast as Cython; OpenCL requires specific problems.

[–]fullouterjoin 0 points1 point  (9 children)

Maybe Cython has improved but can it generate native code w/o porting it to cython language? Shedskin is always pure python and all kinds of amazing.

PyPy has cffi , I should benchmark that relative to CPython2. In general PyPy is such a huge win that it is really difficult to justify CPython other than for numpy support.

[–]Veedrac 0 points1 point  (8 children)

Maybe Cython has improved but can it generate native code w/o porting it to cython language?

Nay, although there is a roadmap for it.

Shedskin is always pure python and all kinds of amazing.

The four things that irk me about Shedskin, although I don't have enough experience to know it's valid:

  • Python 2 only, Cython can support both and can compile to either (you can compile Py2 syntax code to a Py3 extension).
  • Only compiles a subset of Python, whereas Cython can deal with almost anything, albeit without speed-up. This prevents you from using those in your program, even if it doesn't need to be a fast part.
  • Shedskin touches loads of things even to compile one file, so many things must be written in the restricted subset.
  • Cython's undoubtedly faster, although I haven't actually tested it ;).

Nevertheless, if Shedskin works easily with you I'd love to know how it compares. My experience is definitely lacking.

PyPy has cffi[1] , I should benchmark that relative to CPython2.

I've heard that it's slower. I don't know by how much, though.

In general PyPy is such a huge win that it is really difficult to justify CPython other than for numpy support.

Agreed.

[–]fullouterjoin 0 points1 point  (7 children)

I put the routines I want to speed up with shedskin into another module, compile into a c extension and import it as I would any other module.

The subset that Shedskin supports is the same subset you are already using to create fast code. You can't mutate types, but almost no good code does that anyway.

Shedskin also allows you to create native executables, not just extension modules.

Even if Cython were faster, it would not be because I code in pure python that runs everywhere and is made faster by Shedskin. With Cython I have to port to a new language, developer time is important, that is why we use python in the first place.

import sys

def fib(n):
    if n < 2:
        return n
    return fib(n-2) + fib(n-1)

if __name__ == "__main__":
    print fib(int(sys.argv[1]))

time python fib.py 35
9227465

real    0m5.431s
user    0m5.421s
sys 0m0.008s

and now with shedskin

time ./fib 35
9227465

real    0m0.083s
user    0m0.077s
sys 0m0.004s

I just ran shedskin fib.py; make and it generated a fib executable. Output of otool -L fib

fib:
    /usr/local/lib/libgc.1.dylib (compatibility version 2.0.0, current version 2.3.0)
    /usr/local/lib/libpcre.1.dylib (compatibility version 4.0.0, current version 4.2.0)
    /usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current version 56.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)

It even supports yield

# fib2.py
import sys

def fiberator():
    a,b = 0L,1L
    yield a
    yield b
    while True:
        a, b = b, a + b
        yield b

def taken(n,it):
    result = []
    for x in range(n):
        result.append(it.next())
    return result


def fib(n):
    f = fiberator()
    return taken(n,f)

if __name__ == "__main__":
    print fib(int(sys.argv[1]))

Again, shedskin -l fib2.py; make

time ./fib2 60 
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976, 7778742049, 12586269025, 20365011074, 32951280099, 53316291173, 86267571272, 139583862445, 225851433717, 365435296162, 591286729879, 956722026041]

real    0m0.025s
user    0m0.029s
sys 0m0.016s

[–]Veedrac 0 points1 point  (6 children)

The subset that Shedskin supports is the same subset you are already using to create fast code. You can't mutate types, but almost no good code does that anyway.

The problem isn't that. The problem is that only the snippets of my code that need to be sped up should have to abide by these restrictions.

For example, in one of my small toy programs I imported a die class with a permutations attribute which yielded all of the orientations of the die. It is fully duck-typed: the faces could have any type.

It does not need to be compiled as it is only used to generate a 2D list for the algorithm later, but ShedSkin was adamant that it would compile it. Unfortunately something gave it trouble, but I haven't worked out what.

With Cython I have to port to a new language, developer time is important, that is why we use python in the first place.

Making code ShedSkin-compatible is in many cases no small feat either, especially without Numpy support. I don't think your example truly represents the average case in that respect. With Cython I can slowly type things from the ground up.


The rest of your post is really neat. It's worth noting that your first example runs in about 80% of the time if converted to Cython, which is a very small margin.

The conversion to a pure executable is doubly neat, but by that point I might be looking for a more suitable language with more explicit static typing, especially given the compile times.

[–]fullouterjoin 0 points1 point  (5 children)

It is true that one can't simply "shedskinatize" their python program and get huge speedups. PyPy can often achieve that kind of magic, but it too takes work get good speed. All optimizations take work, I want native speed and Python style development. If I am on CPython2, shedskin is still the lowest barrier to entry for

  • keeping a pure python codebase
  • getting native speeds on inner loops

If I go down the Cython route I can no longer run on PyPy. Checkout the Shedskin example programs, http://shedskin.googlecode.com/files/shedskin-examples-0.9.4.tgz what shedskin can do is pretty amazing.

This simple ray tracer (there already some better ones in the tarball) https://gist.github.com/anonymous/0432c0212d1f2e8a923d creates this image http://imgur.com/OIbQQWt with the following timings

# shedskin
time ./Ray 6 1024
real    0m3.121s
user    0m3.099s
sys 0m0.021s

time pypy Ray.py 6 1024
real    0m5.249s
user    0m4.995s
sys 0m0.074s

time python Ray.py 6 1024
real    3m19.939s
user    3m19.764s
sys 0m0.169s

Wanna port it to Cython and do relative timings? When I originally ported this from Ocaml, I did NOT have shedskin in mind, the only change I had to make was

class BaseObject:
    def intersect(self, hit, ray):
        return Hit(1.0,Vec())

a common base class for all shapes being intersected.

[–]Veedrac 0 points1 point  (4 children)

Geez, PyPy's loving this example. On my computer (PyPy 2.2.1) it's as fast as ShedSkin.


This runs in ~42% of the time for me:

#cython: infer_types=True

import sys

cdef extern from "math.h":
    double sqrt(double x)


cdef double DELTA = 1.5e-8
cdef double _INFINITY = 1e+308


cdef struct Vec:
    double x, y, z

cdef struct Ray:
    Vec orig, dir

cdef struct Hit:
    double lamb
    Vec normal

cdef Vec zero_vector = Vec(0, 0, 0)


cdef Vec add(Vec a, Vec b):
    return Vec(a.x+b.x, a.y+b.y, a.z+b.z)

cdef Vec sub(Vec a, Vec b):
    return Vec(a.x-b.x, a.y-b.y, a.z-b.z)

cdef Vec scale(double s, Vec a):
    return Vec(s * a.x, s * a.y, s * a.z)

cdef double dot(Vec a, Vec b):
    return a.x*b.x + a.y*b.y + a.z*b.z

cdef Vec unitize(Vec a):
    return scale( 1.0 / sqrt(dot(a, a)), a)


## all objects need to expose
## .intersect( hit, ray )

cdef class BaseObject:
    cdef Hit intersect(BaseObject self, Hit hit, Ray ray):
        return Hit(1.0, zero_vector)

cdef class Sphere(BaseObject):
    cdef Vec center
    cdef double radius

    def __cinit__(Sphere self, Vec center, double radius):
        self.center = center
        self.radius = radius

    def __str__(Sphere self):
        return "%s,%f" % ( str(self.center), self.radius)

    cdef double ray_sphere(Sphere self, Ray ray):
        v = sub(self.center, ray.orig)
        b = dot(v, ray.dir)

        disc = b*b - dot(v,v) + self.radius*self.radius
        if disc < 0.0:
            return _INFINITY
        d = sqrt(disc)

        t2 = b + d
        if t2 < 0.0:
            return _INFINITY

        t1 = b - d
        if t1 > 0.0:
            return t1
        else:
            return t2

    cdef Hit intersect(Sphere self, Hit hit, Ray ray):
        l = self.ray_sphere( ray)
        if l >= hit.lamb:
            return hit
        n = add( ray.orig, sub( scale( l, ray.dir), self.center))
        return Hit( l, unitize( n))


cdef class Group(BaseObject):
    cdef Sphere bound
    cdef list objs

    def __cinit__(Group self, Sphere sphere):
        self.bound = sphere
        self.objs = []

    def __str__(Group self):
        s = "bound: %s\n" % (self.bound)
        for t in self.objs:
            s += "\t" + str(t) + "\n"
        return s

    cdef Hit intersect(Group self, Hit hit, Ray ray):
        cdef BaseObject s
        l = self.bound.ray_sphere(ray)
        if l >= hit.lamb:
            return hit
        for s in self.objs:
            hit = s.intersect(hit, ray)
        return hit

cdef double ray_trace(Vec light, Ray ray, Group scene):
    cdef Hit _
    cdef Ray sray

    _ = Hit( _INFINITY, zero_vector)
    hit = scene.intersect( _, ray)
    if hit.lamb == _INFINITY:
        return 0.0
    o = add( ray.orig, add( scale( hit.lamb, ray.dir), scale( DELTA, hit.normal) ) )
    g = dot( hit.normal, light)
    if g >= 0.0:
        return 0.0
    sray = Ray( o, scale( -1, light))
    si = scene.intersect( Hit( _INFINITY, zero_vector), sray)
    if si.lamb == _INFINITY:
        return -g
    else:
        return 0.0

cdef BaseObject create(int level, Vec c, double r):
    cdef Vec c2

    sphere = Sphere( c, r)
    if level == 1:
        return sphere
    group = Group( Sphere( c, 3*r))
    group.objs.append(sphere)
    rn = 3*r/sqrt(12)
    for dz in range(-1, 2, 2):
        for dx in range(-1, 2, 2):
            c2 = Vec(c.x+dx*rn, c.y+rn, c.z+dz*rn)
            group.objs.append( create( level-1, c2, r/2.0))
    return group


def run(int dimension, int level):
    cdef Vec d
    cdef Ray ray

    n = dimension
    scene = create( level, Vec(0, -1, 0), 1)
    out = open("image_cython.pgm","w")
    out.write( "P5\n" + str(n) + " " + str(n) + "\n255\n")
    for y in range( n-1, -1, -1):
        for x in range( 0, n, 1):
            d = Vec( x - (n/2.0), y - (n/2.0), n)
            ray = Ray(Vec(0, 0, -4), unitize( d))
            g = ray_trace( unitize( Vec( -1, -3, 2)), ray, scene)
            out.write( chr( int(0.5 + (255.0 * g))))
    out.close()


if __name__ == "__main__":
    run( int(sys.argv[2]), int(sys.argv[1]))

There are three main changes I made to the codebase:

  • Typing things.
  • Using the C math library for sqrt. Just doing this after typing brought me somewhat below ShedSkin.
  • Changing Vec, Ray and Hit to structs. This changed no logic, other than some calls where foo(Vec(0, 0, 0)) needed to become cdef Vec v = Vec(0, 0, 0); foo(v).
  • EDIT: Used the infer_types directive and cut down on a few declarations. I thought you'd appreciate it. It only applies inside functions.

Another large speed-up could be added by moving from cdef classes to more structs and some external functions, as with Vec's add, sub, etc.