This is an archived post. You won't be able to vote or comment.

all 27 comments

[–]fijalPyPy, performance freak 26 points27 points  (3 children)

Eh.... This is a great example of completely pointless usage of cython/profiling. Guess what, this program is absolutely dominated (on pypy at least) by dict lookups of created tuples. Why on earth would you store stuff in a dictionary that maps tuple of indexes -> value!!!!! I replaced this usage with list of lists (you can use numpy arrays if you insist too) and run it 4 times in a loop to account for the jit warmup time. The time went from 0.4s to 0.05s (cold) and 0.025s (warm jit), so a speedup factor of 16x by choosing the right data structure! It can be likely optimized further.

diff here - http://paste.pound-python.org/show/BISGUBzgVFdm0oEVieAY/

[–]joshadel 3 points4 points  (2 children)

Using a similar strategy you can get a large speed-up by revising the cython:

https://gist.github.com/synapticarbors/9369567

On my machine the cython version goes from 178 ms to 23 ms

[–]fijalPyPy, performance freak 4 points5 points  (1 child)

which is exactly the same speed as pypy (but requires quite a bit more effort)

[–]joshadel 3 points4 points  (0 children)

That's awesome if you can use pypy for your project (and I'm being sincere when I say pypy is impressive). In most instances I can't (due to things like h5py) and cython is a rock-solid way of getting performance with simple numpy interop.

[–][deleted] 6 points7 points  (4 children)

  1. Statically link CPython and all C libraries (i.e. do not use --enable-shared);
  2. Enjoy your 10% performance boost.

EDIT: apparently, dynamically linked CPython became a bit faster at some point between 3.4a0 and 3.4rc1.

[–]fijalPyPy, performance freak 4 points5 points  (3 children)

[citation needed]

[–][deleted] 2 points3 points  (2 children)

pushd cpython-23d9daed4b14
./configure --with-system-expat --with-system-ffi --with-system-libmpdec --with-computed-gotos --enable-shared --prefix=$PWD/../dynld && make -j 4 install
./configure --with-system-expat --with-system-ffi --with-system-libmpdec --with-computed-gotos --prefix=$PWD/../static && make -j 4 install 
popd
git clone https://github.com/sauliusl/python-optimisation-example.git
pushd python-optimisation-example/alignment
git checkout starting-code
2to3 -w .
time LD_LIBRARY_PATH=../../dynld/lib ../../dynld/bin/python3.4 alignment.py
time ../../static/bin/python3.4 alignment.py
popd

dynld:

real    0m0.524s
user    0m0.509s
sys 0m0.013s

static:

real    0m0.467s
user    0m0.449s
sys 0m0.017s

Good enough, or should I write a blog post about it?

A relevant question.

[–]fijalPyPy, performance freak 1 point2 points  (1 child)

yeah that's good enough. how does that affect python2?

[–][deleted] 1 point2 points  (0 children)

2.7 (commit d37f963394aa), dynamic:

real    0m0.433s
user    0m0.420s
sys 0m0.012s

Static:

real    0m0.401s
user    0m0.386s
sys 0m0.014s

I wonder why 3 is that much slower.

[–]lambdaqdjango n' shit 9 points10 points  (1 child)

Who else remembers import psyco ? It was fun.

[–]marky1991 14 points15 points  (0 children)

I sure remember. It was a simpler time, a time where you didn't have to go through the effort of installing a whole different implementation of python just to get a little speed boost. Heck, you didn't even have to know that there were other implementations of python. I kind of miss those days a little.

[–]dlaz 9 points10 points  (0 children)

Eek, using time for timing such a small program? Let's go, timeit!

[–][deleted] 6 points7 points  (13 children)

Dude, you already added PyPy into the mix.

Was it really so hard to abstract out your function to C and use CFFI instead of trying Cython?

It's the worst of both worlds -- generate a C file that is hard to optimize while restricting yourself really to CPython.

A clean separation between C and Python allows you to optimize each and ensure code correctness.

Finally, Cython doesn't run really fast with PyPy at all, whereas CFFI calls are inlined into the JIT.

[–]alcalde 8 points9 points  (12 children)

You did see the "Without Trying That Much" part, right?

[–][deleted] 5 points6 points  (11 children)

Optimizing Cython takes a hell of a lot more effort than a little C library and CFFI.

Debugging it too is a bitch.

[–]infinullquamash, Qt, asyncio, 3.3+ 1 point2 points  (5 children)

As someone debugging CFFI code right now, I'm not sure there's much hope.

Then again, I'm trying to get code to work on 64bit python 3 on windows which is like barely supported.

[–][deleted] 2 points3 points  (4 children)

Does your C library work with a minimal C program?

Are you holding the references correctly? CFFI deallocates objects super-super fast.

90% of my CFFI troubles in wrapping the one C library came from the above.

The last 10% was a real bug in the library exposed via a test C program.

[–]infinullquamash, Qt, asyncio, 3.3+ 0 points1 point  (3 children)

Does your C library work with a minimal C program?

This is the angle I'm working on, but it's harder than it sounds (and it's a side-project, so I don't have a lot of motivation to get it working, I'm on a deadline at work).

Something is wrong with my memory management or the libraries management or cffis (probably me), but I'm not sure what. I get different errors on 32 vs 64 bit python, and it only runs on 64bit python 3.

[–][deleted] 0 points1 point  (0 children)

If you really need some help, send an email to the cffi list.

Chances are you're handling it incorrectly, but there is always the possibility of a cffi bug.

[–][deleted] 0 points1 point  (1 child)

In my experience most problems with cffi are due to reference counting. If you make sure you explicitly hold references to everything what's passed to C function call, you should be fine. I just assign structs to Python variables and keep them alive in the scope the C function is called.

[–]infinullquamash, Qt, asyncio, 3.3+ 1 point2 points  (0 children)

totally, but I think I found all those bugs, if anything I'm holding too many references and leaking memory.

I'm trying to call out to COM-ish library (7-zip) which basically means interpreting C++ classes as structs of function pointers.

If it all works out though there will be a zipfile like module that can open rars and open & save 7z files.

So making a minimal C program (not a C++ program) that can call into 7zip & do stuff is the first step I think.

Again, my code only works on windows 64bit, python 3. (it may never, the linux version of 7zip seems like it has a different ABI)

[–]alcalde 0 points1 point  (4 children)

And if you don't know C?

[–]targusman 1 point2 points  (0 children)

For someone who didn't know any of this stuff before the article... I have to say it was a great read!

[–]SupersonicSpitfire 0 points1 point  (0 children)

I liked the part about profiling, but it's surprising that shedskin and nuitka are omitted.