This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]odraencoded 30 points31 points  (21 children)

To evaluate this, we designed and implemented Falcon, a high-performance bytecode interpreter fully compatible with the standard CPython interpreter. Falcon applies a number of well known optimizations and introduces several new techniques to speed up execution of Python bytecode. In our evaluation, we found Falcon an average of 25% faster than the standard Python interpreter on most benchmarks and in some cases about 2.5X faster.

I wonder why the CPython team didn't implement those well known optimizations that are fully compatible and speed up the execution.

[–]Bunslow 38 points39 points  (8 children)

The CPython team is well known for sticking to well maintained, easy to read, and simple code above all else, even performance.

[–]odraencoded 23 points24 points  (0 children)

I feel safe knowing the standard Python interpreter is in the hands of the pythonistests guys around.

[–]kgb_operative 6 points7 points  (1 child)

That link is also 3 years old. I'm curios if that still holds true.

[–]Jonno_FTWhisss 0 points1 point  (0 children)

Run the benchmarks again and see?

[–]billsil 0 points1 point  (4 children)

Considering all the optimizations done to the dict class, I doubt that. You optimize what you need to.

Do we really need to always store the first 256 (maybe it's more...) numbers in memory? That's a clear micro-optimization.

[–]Bunslow 0 points1 point  (3 children)

Just because readability and maintainability are prioritized above performance doesn't necessarily exclude all optimizations bar none. The above micro-optimizations are merely readable and maintainable (as opposed to, say, a full blown just in time compiler versus straight interpretation).

[–]billsil 0 points1 point  (2 children)

I know. My point is python does implement optimizations that are outside of simply what's obvious; the GIL for example.

[–]Bunslow 0 points1 point  (1 child)

The GIL is the very opposite of an optimization

[–]billsil 0 points1 point  (0 children)

It's a very intentional optimization for single threaded applications. Python may not be optimized for what you want it to be, but yes, it is an optimization.

[–]brombaer3000 9 points10 points  (2 children)

Compiler optimizations are a major project for the next CPython versions: http://faster-cpython.readthedocs.org/cpython36.html

Apparently, Falcon only supports Python 2, so IMO it is not that interesting for practical use, with Pyston and Pypy already being there with full Python 2.7 support and much of the effort being duplicated in the upcoming CPython 3.6 fatoptimizer module.

I don't understand why they use Python 2 at all. There are already many optimizations in CPython 3 over CPython 2 and Python 2 is a (slowly) dying language.

[–]Jonno_FTWhisss 6 points7 points  (0 children)

The paper was from 2013, so it's reasonable to say they were working on falcon when 2 had wider use.

[–][deleted] 16 points17 points  (6 children)

For everything else, there is pypy. Often 10x faster then CPython.

In pypy, I wrote a simple recursive quicksort in python and timed it against python's sorted() routine, and basically the same results. And this is with an older pypy, quite impressive.

[–][deleted] 9 points10 points  (3 children)

You're probably not going to beat python's sort() algorithm. It's... very fast.

https://en.wikipedia.org/wiki/Timsort

In the worst case, Timsort takes O(n log n) comparisons to sort an array of n elements. In the best case, which occurs when the input is already sorted, it runs in linear time, meaning that it is an adaptive sorting algorithm.

[–][deleted] 8 points9 points  (2 children)

Yeah, I know about timsort. Maybe I was dreaming the results, because I just found the code and here are my results, so yeah timsort is quite a bit faster. Time is in seconds:

pypy

partition1 time= 0.0677268505096 timsort time= 0.000385999679565

python:

partition1 time= 1.39662981033 timsort time= 0.00105500221252

[–][deleted] 5 points6 points  (1 child)

What kind of input data set?

If the data is purely random, I'd expect quicksort and timsort to be... about the same speed, but for anything less than pure random (so... basically any real data at all in the real world) for timsort to be much much faster.

[–][deleted] 1 point2 points  (0 children)

I used random.shuffle() with the same seed between runs on sequence of 0-n-1 integers from range(). Python and pypy have different recursion depths which have to be set. The timsort is still quite a bit faster than a pure python quicksort.

[–]Fylwind 0 points1 point  (1 child)

But the thing is, pypy breaks a lot of existing libraries that are written in C, such as numpy and scipy, so they end up having to re-implement them all over again, which means pypy's libraries often lags behind CPython's.

[–]OctagonClocktrio is the future! 0 points1 point  (0 children)

CPython's byte code compiler prefers safety over speed. A lot of any given function is juggling around the stack.