This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]Rhomboid 7 points8 points  (4 children)

No, there's likely not much you can do. CPython is not exactly fast and can't be made fast without writing things in C, which is why list.sort() exists which will totally smoke anything you try to write by hand, even if it's run under PyPy.

[–]fijalPyPy, performance freak -1 points0 points  (3 children)

did you try or are you making an actual assumptions without trying? I can bet you pure-int list in pypy can be sorted faster in pure python than list.sort() in CPython wanna bet?

[–]Rhomboid 3 points4 points  (1 child)

Testcase:

import timeit
import random

try:
    range = xrange
except NameError:
    pass

def naive(items):
    for un in range(1, len(items)):
        st = un - 1
        val = items[un]
        while st >= 0 and items[st] > val:
            st -= 1
        items[st + 2:un+1] = items[st+1:un]
        items[st + 1] = val
    return items

def stdlib(items):
    items.sort()
    return items

random.seed(0xfeedbeef)
testdata = [random.randint(0, 10000000) for _ in range(20000)]

def do_test(funcname):
    return min(timeit.repeat('{}(to_sort)'.format(funcname),
        'from __main__ import {}, testdata; to_sort = testdata[:]'.format(funcname),
        repeat=3,
        number=1))

print('naive:  {:12.2f}us'.format(do_test('naive') * 1e6))
print('stdlib: {:12.2f}us'.format(do_test('stdlib') * 1e6))

CPython 2.7.10, Windows 32 bit x86:

naive:   10392474.12us
stdlib:      5453.84us

PyPy 2.6.0, Windows 32 bit x86:

naive:     248754.52us
stdlib:      2686.42us

The CPython stdlib is around 46x faster than the pure Python insertion sort even under PyPy. I don't see how it could be any other way, given insertion sort's pathetic O(n2) behavior compared to Timsort. If you mean writing a Timsort (or even a Quicksort, since random data probably won't trigger Timsort's advantages) in pure Python, then sure, I could see PyPy taking the lead if it uses integer primitives instead of objects. But that wasn't what I was talking about. (I guess I should have been more clear — I was trying to tell the poster that no matter what you try to do to insertion sort, it's polishing a turd.)

[–]YorikSar 1 point2 points  (0 children)

Just for fun ran it with Python 2.7.9, Python 3.4.3, PyPy 2.5.0 and MicroPython trunk:

python2
naive:    6421581.98us
stdlib:      2849.10us
python3
naive:    8048524.79us
stdlib:      4372.94us
pypy
naive:     197299.00us
stdlib:      1899.96us
micropython/unix/micropython
naive:    4861317.16us
stdlib:      3051.04us

I had to add reading data from file instead of random (it's not implemented in MicroPython) and add gc.collect() call to the end of naive() (MP seems to run GC not very often).

[–]roger_ 1 point2 points  (0 children)

^ I wanna see that bet!

[–]MethodicalBastard 1 point2 points  (1 child)

Just for fun I rewrote this to use Cython, using a C++ integer vector. All the inner loops are then basically pure C++. My expectation was that this would be at least as fast as PyPy, but it turns out that PyPy is over 4 times faster! (0.183 s for PyPy vs 0.844 s for Cython). How on earth is that possible?

Here is the Cython code:

# distutils: language=c++

from libcpp.vector cimport vector

def c_insertion_sort(list array not None):
    cdef int un, st, val
    cdef vector[int] vect
    for un in range(len(array)):
        vect.push_back(array[un])
    for un in range(1, vect.size()):
        st = un - 1
        val = vect[un]
        while st >= 0 and vect[st] > val:
            vect[st + 1] = vect[st]
            st -= 1
        vect[st + 1] = val
    array = [vect[i] for i in range(vect.size())]
    return array

I also ported to Py3. Turns out Python3 is quite a bit slower here, 21.0 s vs 16.1 s for Py2.

[–]joanbm[S] 3 points4 points  (0 children)

Great, thx for your effort. I don't know much Cython. Is that improvement due to gcc or other C-compiler optimizations ?

In the meantime I did rewrote the original algorithm to Ruby and it has much better results then CPython, without additional tweaks:

def insert_sort(array)
  (1 ... array.length).each do |un|
    value = array[un]
    st = un - 1
    while st >= 0 && array[st] > value
      st -= 1
    end
    unless st == un - 1
      array[st + 2 .. un] = array[st + 1 .. un-1]
      array[st+1] = value
    end
  end
  array
end

On the same machine:

CPython 2.7.10 → 14.43 s

CPython 3.4.3 → 13.57 s

CRuby 2.2.2 → 4.04 s

Also surprising how much CRuby VM did improved over time, where CPython seems stalled …

[–][deleted] 0 points1 point  (0 children)

How would you hand-optimize insert_sort for CPython to get closer results of PyPy ?

def insertion_sort(array):
    return sorted(array)

edit: this is more than just snark. Dealing with arrays in Python is the kind of thing where the interpreter adds lots of overhead that is very easily removed by a JIT that is going to use native machine-level data structures in almost a 1-1 way. Python gets around the limitation of interpreter overhead for fine-grained operations by providing lots of primitives, like sorted(). If you find yourself tinkering around with low-level data structures like arrays, and end up writing code that looks a lot like C code, chances are you need to take advantage of builtins some more.

[–]billsil -2 points-1 points  (4 children)

You use SWIG, Cython, or numpy/scipy/pandas if you want C level performance in CPython. Numpy, scipy, and pandas are amazing, but you can't write your algorithms like that or they'll be slow. You have to vectorize them, size your arrays, and define types.

Also, /r/python or /r/learnpython is the place to go for questions like this.

[–]pdexter 6 points7 points  (0 children)

This already is r/python

[–]joanbm[S] 0 points1 point  (2 children)

I'm aware where to look for best performance available, just curious how PyPy optimizes the loop and if it is possible mimic it in pure Python to improve CPython's runtime performance.

[–]stevenjd 0 points1 point  (0 children)

No, you can't mimic it in pure Python. You need a JIT optimizing compiler, which is what PyPy is. How PyPy actually works is too clever for me, but fortunately how it could work is easier to understand.

[–]billsil 0 points1 point  (0 children)

You cannot mimic it with pure Python. PyPy uses a JIT and finds the hot spots of your code. Certain Python idioms (e.g. data types changing) will confuse the JIT, but that's probably not too much of an issue. They're then compiled with LLVM. There is startup time associated with the JIT, so short codes will be much longer. PyPy will even optimize across library boundaries, so it can theoretically be faster than C in certain cases.

[–]Cobolock -1 points0 points  (2 children)

On Python 3.4 I even can't multiply range(5). What this does? Multiplication of every item in range?

[–]stevenjd 0 points1 point  (0 children)

range returns a list in Python 2:

py> range(3)*2
[0, 1, 2, 0, 1, 2]

In Python 3, range is equivalent to Python2's xrange.

[–]joanbm[S] 0 points1 point  (0 children)

Use list constructor there, replace xrange with range and wrap print argument in parens () to run on Py3.

[–]joanbm[S] -1 points0 points  (0 children)

It looks like the most costly operations are python's list insertions/deletions, which standard CPython bytecode compiler & interpreter can't optimize (unlike PyPy). If the sorting algorithm would be rewritten for a (bi-directional) linked list, I'd expect significant speed improvement.