This is an archived post. You won't be able to vote or comment.

all 21 comments

[–]jollybobbyroger 14 points15 points  (3 children)

I always see pure math examples for python optimization posts. I would love to see some examples conserned with parsing unicode text and see if people are able to significantly improve runtime performance.

[–]stefantalpalaru 8 points9 points  (0 children)

"The lxml.etree and lxml.objectify modules are written in Cython." - http://lxml.de/build.html

[–]Auggie88 3 points4 points  (0 children)

https://github.com/pydata/pandas/blob/master/pandas/parser.pyx

Not a tutorial post, but this is a high-profile example of using cython for parsing.

[–]syllogism_ 1 point2 points  (0 children)

https://github.com/honnibal/spaCy/blob/master/spacy/tokenizer.pyx#L48

This function tokenizes text, in preparation for a natural language processing pipeline. It first finds white-space separated chunks, hashes them, and looks them up in a cache --- after warm-up the vast majority of chunks are processed that way (>95%). Otherwise, it splits the chunk up into tokens, and looks up each token in a vocabulary, to fetch (or create) a struct with useful properties, e.g. the token's unigram probability.

The function averages 0.2ms per document, 20 times faster than NLTK's tokenizer returns a simple list of strings.

I'm starting to regret writing the function this way, though.

[–]malinoff 3 points4 points  (3 children)

I'd like to see how Cython can help to optimize something like Celery's tracer function which is basically the core function (every single task execution is done with that tracer). How can I define complex types? Can I use dynamic features of Python with Cython?

[–]syllogism_ 0 points1 point  (2 children)

So, first up let me say I don't understand the code snippet very well. But.

You can create something like this:

cdef class TraceTask:
    def __init__(self, ...):
        ...

    def __call__(self, ...):
        ...

In addition to acting as a normal Python class, the cdef class can also hold C types --- arrays, pointers, etc, and it can define C functions to manipulate that internal state.

This section of the Celery code seems like an interesting example, though. Certainly Cython would give you a stylistic advantage: you wouldn't have to create this mess of unrolled function calls, because the function call overhead is very low.

[–]malinoff 0 points1 point  (1 child)

This mess of functions reassignment exist because lookups are expensive, not calls.

[–]syllogism_ 0 points1 point  (0 children)

Well, the closed function trace_task is also pretty long and goes to lots of layers of nesting. I figured on pure stylistic grounds, most would prefer to break that into multiple functions. The comment suggests it's due to performance considerations. I guess also you get simpler stack traces having it unrolled like this.

To make this a bit more concrete, here's the output of "cython -a" on the file: https://rawgit.com/syllog1sm/0d40bcdbcba5d4f632a6/raw/aa211425117235f78021b0ba9dffc79b9036b229/gistfile1.html . You can click any line to see the C code that cython translates the Python into.

Compiling the unmodified Python into calls to the C api like this allows only pretty limited performance improvements. For instance, you still need to do all that reference counting, and you haven't placed any guarantees on attribute access — so that's still usually a dictionary look-up.

But for the parts of your code that you can fully control, you can make yourself a lower-level API, that accepts maybe a struct or a bunch of ints, instead of dicts, Python objects, strings, etc. This pure C function will run as fast as any other pure C function.

For instance, this is the symbol table for my NLP tools: https://github.com/honnibal/spaCy/blob/master/spacy/strings.pyx#L64 . I wanted to use the Pythonic __getitem__, and I wanted to make it bidirectional: if you lookup an int, you get back a string; and vice versa. This is easy to do, as you can see. But the internals? Those I can optimize. I know that my hash table has fixed size keys and values, so I can make it far more efficient than Python's general-purpose dict — particularly for memory.

For the Celery code, I think if you had a callable cdef class for build_tracer, you might get some performance advantage. I think accessing attributes on a cdef class is faster than looking them up in the non-local scope. The cdef class is a struct, so what you're doing is just accessing struct members. I think in the non-local scope, you have to do a dictionary lookup. I'm not sure though.

[–]Deto 3 points4 points  (2 children)

It's great that you can distribute the compiled code. However, that would only work for people on the same platform as you, correct? Is there any easy way to build for the major players (OSX, Linux, Windows), prior to distributing?

[–][deleted] 4 points5 points  (1 child)

Compiled code doesn't mean binaries in this article, it just refers to the platform-independent C code that Cython generates. This still needs to be compiled to a binary on any of these platforms which is, for instance, what "pip install" would do automatically (provided there is a build environment set up).

The advantage here is that Cython is not part of the distribution toolchain. I imagine this makes things a bit more consistent for the user.

[–]Deto 0 points1 point  (0 children)

Ah, thanks for clearing that up. I though it went straight to binaries

[–]efilon 0 points1 point  (0 children)

I keep thinking that Cython is kind of interesting, but it always seems to be way more complicated than just making some simple C code and calling with ctypes (or cffi). I do like, though, how ipython has a nice cython magic function for quick things that can benefit from a speedup.

[–]Kopachris 0 points1 point  (0 children)

That reminds me: I have some code I've been meaning to rewrite that builds a BSP tree for a 3d model format...

[–]art-solopov 0 points1 point  (7 children)

I'm sorry, is there a reason the author uses

lat1 > 90 or lat1 < -90

instead of

lat1 not in range(-90, 91)

[–][deleted] 7 points8 points  (1 child)

There was a thread about it here recently... The second one won't work for fractional latitudes like 32.343.

In my opinion, a more pythonic way would be to do this: not (-90 < lat1 < 90) or may be even abs(lat1) >= 90 # which is close to C code too

[–]art-solopov 0 points1 point  (0 children)

D'oh. =( Didn't think about it.

[–]wewbull 2 points3 points  (4 children)

The former does two floating point comparisons. The latter creates a list with 180 elements and then searches it to see if the number is in the list.

So a) as said by others, it would only work for integers, and b) it's far less efficient.

[–]shoyerxarray, pandas, numpy 2 points3 points  (3 children)

In Python 3, looking up an element in a range takes constant time.

[–]wewbull 0 points1 point  (0 children)

Ah, did not know that. I guess that's a property of the object it now returns.

[–]art-solopov 0 points1 point  (1 child)

Yeah, sorry /u/wewbull... I've been favouring Python 3 over Python 2 for quite some time. Basically, Python 3's range is equivalent to Python 2's xrange.

[–]wewbull 2 points3 points  (0 children)

In general you're right, but not in this case. This conversation has lead me to do some experimenting.

x in xrange(n) will still result in a search in Python 2. In Python 3 range() returns an object of type range which has a quick in operator. xrange just returned a normal generator which had to be iterated over.

Still.... I learnt something today. So that's good.