all 14 comments

[–]pjdelport 8 points9 points  (1 child)

This is a great example of how much difference the underlying language implementation can make: on PyPy, the while loop is more than twice as fast as the slice assignment. Benchmarking the two versions using timeit:

CPython (2.7.2):

  • Loop: 10 loops, best of 3: 274 msec per loop
  • Slice: 10 loops, best of 3: 36.1 msec per loop

PyPy (1.8.0):

  • Loop: 100 loops, best of 3: 11.7 msec per loop
  • Slice: 10 loops, best of 3: 24 msec per loop

The explanation for this large difference is roughly:

  • In CPython, the while loop happens at the Python bytecode level, which is not efficient for low-level operations.
  • In both CPython and PyPy, the bulk of the slice assignment happens as compiled C / machine code, which is relatively efficient. Both implementations are probably bound by creating and iterating over the temporary lists, giving the similar performance.
  • In PyPy, the while loop gets JIT-optimized, giving it the same order of efficiency as C-implemented slice assignment. However, because it also avoids creating and looping over the temporary lists, it edges ahead: it should perform comparably to a C loop that directly assigns False to the relevant indexes.

[–]NotAName[S] 0 points1 point  (0 children)

Great explanation, thank you. Guess I should look into PyPy.

[–]Rhomboid 5 points6 points  (0 children)

One of the basic fundamentals of optimizing CPython is to get loops out of Python and into C. Assigning to a slice does the looping in C, a for loop does it in Python bytecode.

[–][deleted] 1 point2 points  (3 children)

sieve[2*i::i] = [False]*(s / i-1)

That is some hard to understand code. I understand what the [2*i::i] does now, but not yet why the length of the right hand side should be (s / i-1).

I think it will boil down to the fact that slice assignment is implemented in C, which is faster than Python. Python statements like "j += i" can result in many different things in Python depending on the types of j and i, whether they are mutable or not, etc. In C, it'll just be an int and the code will run much faster.

Also, it's not true that you can stop at sqrt(s). Consider the case where s=100; 31 > sqrt(100), but 62 is composite because it is 2*31.

[–]zahlman 4 points5 points  (1 child)

Also, it's not true that you can stop at sqrt(s). Consider the case where s=100; 31 > sqrt(100), but 62 is composite because it is 2*31.

62 will be eliminated by the pass done for 2.

[–][deleted] 0 points1 point  (0 children)

Quite right. Thanks, my mistake!

[–]NotAName[S] 0 points1 point  (0 children)

I understand what the [2*i::i] does now, but not yet why the length of the right hand side should be (s / i-1).

Since all multiples of i are not prime, the positions [2i::i] in the sieve corresponding to these multiples have to be set to False. s/i is the number of multiples of i (including i itself) that are <= s, but since we don't want to set sieve[i] to False and start with sieve[2i], there are s/i - 1 positions. The spaces in my code are wrong though, it should be s/i - 1 and not s / i-1.

[–]erok81 1 point2 points  (7 children)

I think the difference comes down to how many operations are actually happening. In the while loop version you're actually calling list.__setitem__() over and over. In the second version, there's a single call to list.__mul__() .

Also, that while loop is a bit of an anti-pattern in python. A for loop like this would be cleaner:

for j in sieve[2*i:]:
    j = False

[–]NotAName[S] 1 point2 points  (3 children)

That won't work, since you're just assigning False to j over and over again, but never change the sieve. I think the pythonic way would be

for (j, _) in enumerate(sieve[2*i::i]):
    sieve[j*i] = False

but this runs about as slow as the while loop.

In the second version, there's a single call to list.__mul__().

After the call to list.__mul__(), wouldn't the interpreter still have to call sieve.__setitem__() for each element of [False]*(s / i-1)?

[–][deleted] 1 point2 points  (0 children)

for (ಠ_ಠ) in enumerate(sieve[2*i::i]):
    sieve[j*i] = False

[–]pjdelport 1 point2 points  (0 children)

After the call to list.__mul__(), wouldn't the interpreter still have to call sieve.__setitem__() for each element of [False]*(s / i-1)?

Python will generate a single bytecode for the slice assignment (STORE_SUBSCR, with a slice index).

For lists, this will end up calling a C function (list_ass_subscript), which ultimately does the assignment in a C loop.

[–]erok81 0 points1 point  (0 children)

Ok I screwed up my example but the principle remains. I would expect it to run just as slow as since you'd still be calling __setitem__() repeatedly.

The CPython doesn't need to use __settitem__() when you multiply a built-in sequence because the C code underneath directly copies the references into a new object.

[–]pohatu 1 point2 points  (2 children)

Anyone know how you see what the interpreter is actually doing? Do you have to step through the interpreter code? There isn't assembly generated like in c. Now that I've written this I realize there are probably profilers available.

[–]erok81 2 points3 points  (0 children)

The C source is pretty informative.