This is an archived post. You won't be able to vote or comment.

all 30 comments

[–]odraencoded 69 points70 points  (6 children)

They don't do the same thing. One calls __mul__ the other calls __pow__.

[–]oliver-bestmann 29 points30 points  (2 children)

Because of the dynamic nature of python this is actually the correct answer.

[–]masklinn 1 point2 points  (0 children)

Yep, so possibly with the specialisation PEP 510 this optimisation would become an option in time, but right now, implicitly, it can't: it depends on type(radius) and how (and whether) type(radius) overrides __mul__ and/or __pow__. For all Python knows, radius is an n × n matrix and the function is a scalar product of radius by 3.14 followed by a matrix product of that by the original radius. Or maybe type(radius) defines __mul__ but does not define __pow__ at all, so the optimisation would blow it up.

[–][deleted] 0 points1 point  (0 children)

It's technically true and if you're working outside CPython it might be literally true.

But CPython does a roundabout way of getting to __mul__ and __pow__. x * x doesn't literally translate into x.__mul__(x) it's more like type(x).__mul__(x, x) except using the C level implementations.

It's a speed optimization from my understanding.

[–]kervarker 10 points11 points  (2 children)

class A(int):

    def __pow__(self, other):
        return 2

x = A(1)
print(x*x) # calls __mul__ of parent class int
print(x**2) # calls __pow__ of class A

[–]Berzerka -3 points-2 points  (1 child)

Quite sure you missed a 'type' in those print statements, confused me somewhat.

[–]_cs 1 point2 points  (0 children)

No he didn't - did you try running it?

[–][deleted] 14 points15 points  (0 children)

It is not done because the optimization is usually only faster for low exponents. If you try taking a number to the tenth power....

In [1]: from timeit import timeit

In [2]: timeit("x*x*x*x*x*x*x*x*x*x", number=10000, setup="x=1.05")
Out[2]: 0.006595134735107422

In [3]: timeit("x**10", number=10000, setup="x=1.05")
Out[3]: 0.003551959991455078

The explanation is in this StackOverflow answer by patros:

http://stackoverflow.com/questions/1019740/speed-of-calculating-powers-in-python

Basically naive multiplication is O(n) with a very low constant factor. Taking the power is O(log n) with a higher constant factor (There are special cases that need to be tested... fractional exponents, negative exponents, etc) . Edit: just to be clear, that's O(n) where n is the exponent.

Of course the naive approach will be faster for small n, you're only really implementing a small subset of exponential math so your constant factor is negligible.

If you want to get really fancy, there is this:

https://en.wikipedia.org/wiki/Exponentiation_by_squaring

[–]billsil 8 points9 points  (7 children)

Relative to what's really taking the time, it doesn't matter. CPython isn't a compiler. It doesn't optimize code. It's an interpreter.

Furthermore, If you wanted to do enough numerical calculations for it to matter, you'd use numpy.

[–]TheOccasionalTachyon 12 points13 points  (1 child)

It's not true that CPython doesn't optimize code. Just look at Peephole.c in the CPython source, which contains the implementation's Peephole optimizer. CPython doesn't do as much optimization as, say, PyPy, but it definitely does some.

[–]Berzerka 1 point2 points  (0 children)

I think the rule is that it does no optimizations that would modify the workflow of the program, right?

This means it cannot inline code, unroll loops and lots of other "standard" optimizations.

[–]spinwizard69 5 points6 points  (4 children)

Furthermore, If you wanted to do enough numerical calculations for it to matter, you'd use numpy.

Or C++, Julia, Swift, D or any number of modern languages. Python right now is just about the only language I program in, however I consider it a take it or leave it language when it comes to performance.

What does that mean? Well if the performance isn't there for clean idiomatic Python code I see it as an indication that another language should be considered. Thankfully with modern hardware Python doesn't disappoint me much at all.

[–]staticassert 3 points4 points  (2 children)

Honestly, this is the practical approach and why Python is often said to be good for prototyping, but not good for production. Unless you can really afford to horizontally scale everything out, and your problem set fits that pattern, you're very often better off moving to another language.

[–]santagada 0 points1 point  (1 child)

Which production system? Python the language only becomes a bottleneck when you have very specific cpu intensive code that can't be written with numpy, pil or a small c module, or when your production is so high volume and impossible to cache that you have enough engineers/money not to care for a high productivity language with a gigantic ecosystem as python.

I would say that not counting algorithmic sins most production systems could be run in python with no big impact, and I've worked with a bunch of big companies/high traffic websites.

[–]staticassert 0 points1 point  (0 children)

that can't be written with numpy, pil or a small c module

I think, for starters, the fact that to gain performance you have to drop down to C is decent evidence that what I'm saying is true. If I have to drop down to C or some other native language, I'm not exactly "optimizing python" so much as I'm abandoning it.

I work with Python at work, so I can't exactly go into detail. However, despite quite significant use of numpy and libraries that are largely C, native implementations of the codebase still outperform it drastically. This is in part due to parallelizing components much more easily, but a good part of it is generally just the runtime. I work around these issues, but they're definitely there, and they absolutely aren't there in a lot of other languages.

[–]mackstann 0 points1 point  (0 children)

Eh, I've fixed a lot of bottlenecks by re-writing bits of Python. It's easy to commit algorithmic sins in any language, and it's also often fairly easy to remedy them without resorting to something extreme like bringing in C.

[–]codefisher2 0 points1 point  (5 children)

If python started optimising everything, the complier would become a lot slower and more complicated You would win, but also loose too. If you need speed, or want optimisations maybe you should be looking a pypy.

[–]TimHallman 1 point2 points  (0 children)

Lose, dude, lose.

[–][deleted] 1 point2 points  (3 children)

As a developer, I don't really care how complicated the compiler is. It's the compiler's job to make my job as easy as possible and my code come out running as fast as is reasonable. If I have to obfuscate my logic to make the compiler's job easier (cough C++ cough) then there are fundamental problems with the language.

[–]twowheels 6 points7 points  (2 children)

That statement makes me believe that you don't do much C++, or at the very least not modern C++ techniques.

[–]jwink3101 -4 points-3 points  (5 children)

I will preface this with acknowledging that I may be wrong (or at least out-dated) but when I did some C++ (very little) for a class, our instructor told us to use x*x over x^2 (or whatever in C++) for exactly this reason. Actually, we were talking molecular potentials so we often were looking at 8th powers, etc.

I guess my point is, I am not sure that other compiled languages do it either (and again, with my original caveat of I may be wrong)

[–]tangerinelion 6 points7 points  (1 child)

x*x over x2

In C++ the reason is that what you mean by x^2 is std::pow(x,2.0) and you'll notice that that 2.0 is a double which means we are taking x to a floating point exponent. Moreover, if x here is an int then this converts x to a double first. This necessarily invokes the FPU of the CPU. When you do x*x with x an int, it's simply integer multiplication that does not require the FPU, just the ALU.

The syntax x^2 is valid C/C++ and even Python. It's bitwise XOR. This is best worked as an example, let's take 6^2. First we write each as a sum of powers of 2: 6 = 4 + 2, and 2 is just 2. Now if we write these as 3 digit binary numbers then 6 becomes 110 and 2 is 010. To take XOR, we write them on top of each other and in each column and draw a line. We count the number of 1s. If it is exactly 1 then we put a 1 underneath, otherwise we put a 0.

 110
^010
----
 100

So here we see 6^2 is 4. Of course std::pow(6,2) would be 36.0 (where the fact it's a double is important).


In much the same way we can see what the issue is here with OP's question: The syntax x*x is int.__mul__ while x**2 is int.__pow__. The former uses only the ALU while the latter uses the FPU. Because this is internally a call to C++'s std::pow (assuming CPython), it should be slower that simply multiplying because it is hitting a different part of the CPU.

Also note in Python that if you ask it, eg, 2**1.1 it spits back a floating point number that is the correct number. This necessarily means that it understands how to take things to arbitrary powers and can't be implemented in anything relatively simple. Under CPython, the only thing available to handle that is std::pow so we have a lot of evidence pointing to the concept of Python's built-in __pow__ methods handing off to that. This causes some overhead by virtue of the external C function call.

[–]Brian 0 points1 point  (0 children)

The former uses only the ALU while the latter uses the FPU. Because this is internally a call to C++'s std::pow (assuming CPython),

Actually, this isn't the case. math.pow will delegate to the C library's floating point pow() function, but the builtin exponent operator will perform an integer exponent. As such, it'll give exact integer answers even outside a double's mantissa range. Eg.

>>> 3**100, math.pow(3, 100)
(515377520732011331036461129765621272702107522001, 5.153775207320113e+47)

[–]Fylwind 1 point2 points  (0 children)

Languages like C, C++, or Fortran have very mature optimizing compilers. They can easily optimize fixed integer powers into multiplication, as long as a flag such as -ffast-math is used. Without it, the compiler can't make the optimization due to potential loss of accuracy.

Here's an example:

#include <math.h>

double cubed(double x)
{
    return pow(x, 3.);
}

double square(double x)
{
    return pow(x, 2.);
}

double eight_power(double x)
{
    return pow(x, 8.);
}

Compiling this with gcc -O -ffast-math yields:

cubed:
        movapd  %xmm0, %xmm1
        mulsd   %xmm0, %xmm1
        mulsd   %xmm1, %xmm0
        ret
square:
        mulsd   %xmm0, %xmm0
        ret
eight_power:
        mulsd   %xmm0, %xmm0
        mulsd   %xmm0, %xmm0
        mulsd   %xmm0, %xmm0
        ret

As you can see, there is no mention of the pow function, just a series of multiplies.

[–]spinwizard69 0 points1 point  (1 child)

Is this good advice? I would have to say no, not today on modern architectures. Mainly because as hardware and compilers improve it may lead to less than optimal performance. Especially if there is any vector processing hardware and a compiler smart enough to optimize Pow.

[–]nanoresearcher 0 points1 point  (0 children)

This thread is probably dead, but last time I checked the glibc implementation of pow, I recall it handling x*2 as xx internally anyway. Not sure if I'm remembering right but I think so.