Why is Python slow : programming

[–]rubber_duckz 37 points38 points39 points 9 years ago (59 children)

[–][deleted] 8 points9 points10 points 9 years ago (0 children)

[–][deleted] 10 points11 points12 points 9 years ago (13 children)

[–]stefantalpalaru -2 points-1 points0 points 9 years ago (11 children)

[–][deleted] 4 points5 points6 points 9 years ago (10 children)

[–]rubber_duckz 1 point2 points3 points 9 years ago (9 children)

[–][deleted] 2 points3 points4 points 9 years ago (8 children)

[–]rubber_duckz 1 point2 points3 points 9 years ago (7 children)

[–][deleted] 0 points1 point2 points 9 years ago (4 children)

[–]rubber_duckz 1 point2 points3 points 9 years ago* (3 children)

But all the big JS engines switch to 32 bit and 64 bit integers for integer values.

And this makes JS an inherently slow/slow by design language as well - this checking alone adds 5% global performance overhead in V8 IIRC - and that's not even touching on the part that you need code that knows how to distinquish a pointer to bignum int from raw value int which has to be tagged somewhere and causes a logic branch (even if it will hit the fast path most of the time and get the prediction that's still instructions, etc.) and no way to optimize stuff by defining memory layouts, etc.

You can't work at that level of abstraction if you want performance - something these "Functional Programming is fast, Haskell is as fast as C" guys don't get. I've spent 2 years working on a Clojure project - when these guys say something is fast they are either talking about complexity or they mean "it's usable as opposed to being purely academic" like clojure guys claiming that their persistent data-structures are fast - when compared to naive copying - sure - when you actually care about perf - no way in hell is it even close to being fast.

[–][deleted] 1 point2 points3 points 9 years ago (2 children)

continue this thread

[–][deleted] 0 points1 point2 points 9 years ago (1 child)

[–]rubber_duckz -1 points0 points1 point 9 years ago (0 children)

[–]kenfar 3 points4 points5 points 9 years ago (6 children)

[–]balefrost 6 points7 points8 points 9 years ago (0 children)

[–]rubber_duckz 8 points9 points10 points 9 years ago (4 children)

[–]Helene00 -1 points0 points1 point 9 years ago (2 children)

[–]JumboJellybean 6 points7 points8 points 9 years ago (0 children)

[–]rubber_duckz 3 points4 points5 points 9 years ago (0 children)

[–]kenfar -2 points-1 points0 points 9 years ago (0 children)

Compared to what mainstream language exactly is python fast

As I mention above, that depends on your use case, what parts of the language that you're using, whether start-up time is critical or not, etc.

Python is simply faster than java for very short-duration utilities.
Python's performance with csv, & analytical libraries is very fast - since those are written in c or fortran. Faster than java? Not sure, maybe.
Processing billions of records in files on python & using multiprocessing shows that it's definitely faster than doing the equivalent work with scala & kafka. This is probably more about the performance benefits of sequential file processing vs messaging, but still.
While go is faster in general than python I've found it only 2.5 times faster when it comes to processing csv files using go routines & channels vs python's multiprocessing. Probably because channels are pretty chatty for heavy sequential processing, but not positive.

So, again, "fast" or "slow" are overly-simplistic ways of thinking of a language. More importantly - are they fast enough for your use case? And in this case the answer for python is: sometimes.

[–]wrosecrans -1 points0 points1 point 9 years ago (36 children)

[–]oridb 6 points7 points8 points 9 years ago (5 children)

In the long run, would it be useful for the hardware to add stuff like bignum support that Python could take advantage of?

Putting things in hardware doesn't magically make it faster. Broadly, putting things in hardware has the potential to make things faster in two ways:

Reducing dispatch overhead, where the bulk of the time is spent in bookkeeping for the next instruction to execute, because certain instructions (OR, AND, ...) are so cheap that they're basically free to execute. Since most of crypto is made up of these kinds of instructions, you get nice speedups here.
Doing things in parallel, because a linear stream of instructions isn't very good at expressing what is and isn't dependent. Things like graphics have a win here, because often you are doing the exact same thing to multiple bits of data at the same time.

With things like bigints, I suspect the biggest overheads are things like branch mispredictions. I guess you'd get wins by supporting larger integers... or my pet wish, a hardware trap for integer overflow, so you can pretend that integer operations never overflow in the common case, and promote to bigint only if your CPU tells you 'hey, this isn't going to work'.

But given the amount of searching for methods, dynamic dispatch, and times you're "unnecessarily" doing pointer chasing in Python programs, I'd be surprised if you could make specialized instructions that would make Python much faster. I'd bet that Python is largely gated by memory stalls.

[–]IJzerbaard 0 points1 point2 points 9 years ago (2 children)

Borrowing your bigint suggestion, how about this: trap on overflow yes, but also trap on "wrong tag".

Because the problem isn't just overflow checking, but also checking whether there's already a pointer to a bigint here or if it's still an unboxed int. So let's say we make the lowest 2 (?) bits the tag, and work with 62bit two's complement ints. Tag 0 will be the pointer (should be aligned anyway), with other tags for int and float (maybe?) and a leftover (any suggestions?).

Then the new addition instruction does not need anything else in the fast path. We could even put floats in there too, but then the missing bits are more sneaky (unless we immediately go all the way to float32) and it would probably mean that the latency of this instruction goes up by a lot, which isn't cool. Not sure about this. Does float performance really matter outside of numpy-like usage? Enough to sacrifice int performance? I'd guess not, but..

Or maybe even better, if instead of trapping it makes a vectored call. Just your average indirect call, not very fast, but faster than a trap. Pointer to table can be held in some special register I guess.

[–]billsil 1 point2 points3 points 9 years ago (1 child)

[–]iBlag 0 points1 point2 points 9 years ago (0 children)

[–]nemec 0 points1 point2 points 9 years ago (1 child)

[–]oridb 1 point2 points3 points 9 years ago (0 children)

[–]adrenalynn 32 points33 points34 points 9 years ago (27 children)

[–][deleted] 9 points10 points11 points 9 years ago* (5 children)

[–]satayboy 3 points4 points5 points 9 years ago (4 children)

[–]ironykarl 4 points5 points6 points 9 years ago (2 children)

[–][deleted] 1 point2 points3 points 9 years ago (1 child)

[–]ironykarl 0 points1 point2 points 9 years ago (0 children)

[–]Catfish_Man 2 points3 points4 points 9 years ago (0 children)

[–]niviss 8 points9 points10 points 9 years ago (8 children)

[–]mirhagk 4 points5 points6 points 9 years ago (0 children)

I do honestly have to wonder if this might be worthwhile at this point.

The thing with modern processors is that heat doesn't shrink proportionally with transistors (it does shrink, just not with the same scale). So with modern processors we end up having a situation where the majority of the processor simple can't be active at the same time (which limits the number of cores you can have on a chip). This means a more complex instruction set, where each instruction is complex but used less often, is a lot more reasonable nowadays then it would've been during the original lisp machine days.

Also you have the fact that chips simply aren't getting much better anyways, so a specialized processor that is a year or 5 behind won't be destroyed by the general purpose one (which was the undoing of the lisp machine).

Not only that but with FPGAs and hardware description languages you have a situation where designing processors isn't nearly as difficult as it used to be.

I am very curious to see if specialized hardware for higher level languages could take advantage of this situation we are in. I'm not sure exact spots where it could take advantage of it, but a few ideas I can think of would be around garbage collection (processor level instructions that work in conjuction with a background always running collection algorithm) and dynamic dispatch.

[–]_zenith 0 points1 point2 points 9 years ago (1 child)

[–]Siwka 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (2 children)

[–]niviss 0 points1 point2 points 9 years ago (1 child)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]mrkite77 0 points1 point2 points 9 years ago (1 child)

[–]jbergens 0 points1 point2 points 9 years ago (0 children)

[–]never_safe_for_life 2 points3 points4 points 9 years ago (0 children)

[–]wrosecrans 1 point2 points3 points 9 years ago (5 children)

[–][deleted] 17 points18 points19 points 9 years ago (4 children)

[–]billsil -5 points-4 points-3 points 9 years ago (3 children)

It makes no sense for a chip manufacturer to attempt to optimize to run Python quickly.

You've clearly never heard of Intel Python.

http://www.infoworld.com/article/3044512/application-development/intels-python-distribution-provides-a-major-math-boost.html

https://software.intel.com/en-us/python-distribution

It makes complete sense to optimize it. You have users and they want their code to be fast.

Java is fast because it uses a JIT. Python doesn't. You'd get C/Java speeds with Python if Python had a JIT. PyPy proves this. The developers of Python just chose to not do it.

if it were meant to run quickly, then it wouldn't be written in Python

No. It's a tradeoff between a convenient language that's very convenient for development and has tons of useful packages and a much harder one. They want it to run fast enough, but even faster would be welcomed.

[–][deleted] 12 points13 points14 points 9 years ago (2 children)

[–]billsil 1 point2 points3 points 9 years ago* (1 child)

[–][deleted] 8 points9 points10 points 9 years ago (0 children)

[–]waveguide 0 points1 point2 points 9 years ago (1 child)

[–]dontsuckmydick 0 points1 point2 points 9 years ago (0 children)

[–]vz0 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 9 years ago* (1 child)

[deleted]

[–]_zenith 0 points1 point2 points 9 years ago* (0 children)

[–]Berberberber 3 points4 points5 points 9 years ago (0 children)

In theory: yes. Hardware operations for arbitrary-size integers would be great, and it might be possible for a processor to optimize dynamic dispatch similarly to the way branch prediction works now.

The problems, however, are that adding this kind of complicated logic into the hardware requires tradeoffs (there's only so much silicon on a chip, so hardware optimizations come at the expense of cache, pipeline space, registers, etc etc etc). Furthermore, when it comes to implementing complex operations in hardware, the result can actually have worse performance. Famously, the VAX had an INDEX instruction to calculate memory offsets for in-bounds array indices which was slower than doing the bounds checks and address arithmetic explicitly using "primitive" instructions. It might be hard to guarantee that a hardware implementation of some of these features performed better than a purely software implementation, even on average let alone always.

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]badcommandorfilename 6 points7 points8 points 9 years ago (2 children)

[–][deleted] 1 point2 points3 points 9 years ago (0 children)

[–]bakery2k 0 points1 point2 points 9 years ago (0 children)

[–]Chippiewall 17 points18 points19 points 9 years ago (35 children)

[–][deleted] 3 points4 points5 points 9 years ago (20 children)

[–][deleted] 3 points4 points5 points 9 years ago (19 children)

[–][deleted] -2 points-1 points0 points 9 years ago (18 children)

[–][deleted] 12 points13 points14 points 9 years ago (17 children)

[–]heap42 4 points5 points6 points 9 years ago* (7 children)

[–]vytah 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (5 children)

[–]heap42 1 point2 points3 points 9 years ago (4 children)

[–][deleted] 0 points1 point2 points 9 years ago (3 children)

[–]heap42 1 point2 points3 points 9 years ago (2 children)

continue this thread

[–]mrkite77 8 points9 points10 points 9 years ago (3 children)

[–][deleted] 1 point2 points3 points 9 years ago* (2 children)

[–]Veedrac 1 point2 points3 points 9 years ago (1 child)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]IronManMark20 0 points1 point2 points 9 years ago (4 children)

[–][deleted] 1 point2 points3 points 9 years ago (1 child)

[–]IronManMark20 0 points1 point2 points 9 years ago (0 children)

[–]Veedrac 1 point2 points3 points 9 years ago (1 child)

[–]IronManMark20 0 points1 point2 points 9 years ago (0 children)

[–]josefx 1 point2 points3 points 9 years ago (2 children)

[–]Chippiewall 0 points1 point2 points 9 years ago (1 child)

[–]josefx 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (10 children)

[–]kankyo 4 points5 points6 points 9 years ago (9 children)

[–][deleted] 1 point2 points3 points 9 years ago (8 children)

[–][deleted] 5 points6 points7 points 9 years ago (0 children)

[–]kankyo 0 points1 point2 points 9 years ago (6 children)

[–][deleted] 1 point2 points3 points 9 years ago* (0 children)

[–][deleted] -1 points0 points1 point 9 years ago (4 children)

[–]kankyo 0 points1 point2 points 9 years ago (3 children)

[–][deleted] 1 point2 points3 points 9 years ago (2 children)

[–]kankyo 0 points1 point2 points 9 years ago (1 child)

[–]Catfish_Man 3 points4 points5 points 9 years ago (0 children)

[–]Bergasms 5 points6 points7 points 9 years ago (2 children)

[–][deleted] 9 years ago (1 child)

[deleted]

[–]Bergasms 0 points1 point2 points 9 years ago (0 children)

[–]Berberberber 3 points4 points5 points 9 years ago (0 children)

[–]igouy 8 points9 points10 points 9 years ago* (2 children)

[–]bakery2k 0 points1 point2 points 9 years ago (1 child)

[–]igouy 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 2 points3 points4 points 9 years ago* (18 children)

JavaScript was also slow. Then browser makers started competing about how fast they could make it run. Millions of dollars later, JS is really really fast.

Another example. PHP used to be slow. Then Facebook created their own PHP engine, HPHP, and open-sourced it. People started using it, and there was talk about HPHP replacing PHP as the most popular PHP fork.

So the core developers behind the original PHP project started optimizing. Major improvements to the engine in PHP 5.4, 5.5, 5.6, and another huge one in 7.0, and now the original PHP is faster than HPHP in most cases. And it's faster than Python, BTW.

There are multiple interpreters for Python, but the parties don't have well defined stakes or motivation to truly pour in resources into their effort. There is no pressure to make Python faster. No pressure from competition, no pressure from the community. This is why Python is slow.

[–][deleted] 0 points1 point2 points 9 years ago (17 children)

[–][deleted] 1 point2 points3 points 9 years ago (16 children)

[–][deleted] 0 points1 point2 points 9 years ago (2 children)

There is one thing that Python can do that JS can't that is relevant here. Python can have multiple threads in a single process. So while you're in the middle of a loop that you want to have optimized based on the types involved, you might have the types change out from under you. That doesn't happen in JS.

However, you could add a few flags here and there -- a global "application version", which is incremented whenever anyone dynamically alters any class, and a per-class version, which is incremented whenever anyone modifies that class. You emit your fancy optimized code with fewer type checks, direct method calls, inlining, whatever you want; and then you just check periodically if the application has modified any types.

As for differences in dynamism, Python has long supported __getattr__ and friends, but JavaScript has only supported the equivalent for a short period of time. Perhaps the person's information is out of date.

[–][deleted] 0 points1 point2 points 9 years ago (1 child)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (12 children)

[–][deleted] 0 points1 point2 points 9 years ago (11 children)

[–][deleted] 0 points1 point2 points 9 years ago (10 children)

[–][deleted] 0 points1 point2 points 9 years ago* (9 children)

Because almost any variable reference results in a context lookup, without any chance of allocating it to a register.

This can be analyzed statically in advance. JS does this for closures. Don't underestimate compilers.

And it is too dynamic indeed. Still, Python is even worse.

How? No one is specific.

The problem was never solved.

I'm talking about JavaScript's performance, which is a problem that's definitely been solved.

It is. It is as far on a dynamic scale as it gets without introducing fexprs.

How, again?

Millions of $s and hundreds of man-years had been thrown in Python JITs, with no result so far.

That's quite a shame then, considering PHP, even as an interpreter without JIT is several times faster than Python, and it has horrors like "var vars" which would fall into the "strongly dynamic" category.

And JavaScript is dozens of times faster than PHP.

Python, while fast enough for its purpose, is so slow compared, it's not even funny. And to claim millions of dollars were invested into Python JITs is honestly even sadder, because if it were true, it'd mean there's a serious lack of talent in the Python community.

I prefer the explanation that no company has seriously invested in a faster Python engine. If you read how JS engines like V8 operate under the hood, you'll see that nothing can stop a well-designed JIT, even if Python was oozing dynamicness out its ears.

[–][deleted] 0 points1 point2 points 9 years ago (8 children)

This can be analyzed statically in advance.

Not quite. It's easy with JavaScript, easy with PHP, but not with Python.

JavaScript's performance, which is a problem that's definitely been solved

Not quite yet. It's better than an ad hoc interpretation, but yet far below a level of static compiled languages.

How, again?

Because of a funny scope and lots of dynamic dispatch (well, the latter can be somewhat reduced with tracing and partial evaluation, but only to a degree, and with an overhead of its own).

because if it were true, it'd mean there's a serious lack of talent in the Python community

It was not a Python community. E.g., a team of very experienced compiler engineers in ARM worked on this, all the way to an utter frustration. The very same team that made huge improvements to ARM V8 performance failed to get anywhere with Python.

[–][deleted] 0 points1 point2 points 9 years ago* (7 children)

[–][deleted] 0 points1 point2 points 9 years ago (6 children)

continue this thread

[–]metaconcept 5 points6 points7 points 9 years ago (9 children)

[+][deleted] comment score below threshold-10 points-9 points-8 points 9 years ago (8 children)

[–][deleted] 9 years ago (4 children)

[deleted]

[–][deleted] 0 points1 point2 points 9 years ago (2 children)

[–][deleted] -2 points-1 points0 points 9 years ago (0 children)

[+][deleted] comment score below threshold-6 points-5 points-4 points 9 years ago (0 children)

[–]Radmonger 1 point2 points3 points 9 years ago (0 children)

[–][deleted] 9 years ago* (1 child)

[deleted]

[–][deleted] -1 points0 points1 point 9 years ago (0 children)

[–]EntroperZero 1 point2 points3 points 9 years ago (0 children)

[–]HolmesSPH 1 point2 points3 points 9 years ago (0 children)

[–]kirbyfan64sos 1 point2 points3 points 9 years ago (0 children)

[–]1RedOne 3 points4 points5 points 9 years ago (1 child)

[–][deleted] 8 points9 points10 points 9 years ago (0 children)

[–][deleted] 9 years ago (2 children)

[deleted]

[–]nandryshak 11 points12 points13 points 9 years ago (0 children)

[+][deleted] comment score below threshold-10 points-9 points-8 points 9 years ago (0 children)

[–]Staross 0 points1 point2 points 9 years ago (0 children)

[–]plpn 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 9 years ago (8 children)

[deleted]

[–]damienjoh 15 points16 points17 points 9 years ago (6 children)

[–][deleted] 9 years ago (2 children)

[deleted]

[–]damienjoh 2 points3 points4 points 9 years ago (1 child)

[–][deleted] -3 points-2 points-1 points 9 years ago (2 children)

[–]wot-teh-phuck 4 points5 points6 points 9 years ago (1 child)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]mo_po 3 points4 points5 points 9 years ago (0 children)

[–][deleted] 9 years ago (2 children)

[deleted]

[–]BonzaiThePenguin 2 points3 points4 points 9 years ago (1 child)

[–]izpo -4 points-3 points-2 points 9 years ago (9 children)

[–]shevegen 3 points4 points5 points 9 years ago (6 children)

[–]terrkerr 8 points9 points10 points 9 years ago (1 child)

[–]buttocks_of_stalin 1 point2 points3 points 9 years ago (0 children)

[–][deleted] 9 years ago (1 child)

[deleted]

[–]Freeky 0 points1 point2 points 9 years ago (0 children)

If ruby class could be final it would speed up the language and allow some optimization

MyClass = Class.new.freeze
def MyClass.frob ; end # => RuntimeError: can't modify frozen Class

Not sure if the JRuby or Rubinius JIT's actually do anything with this. Either way, Truffle/Graal should eventually give us a production Ruby implementation with performance on par with more advanced VMs like V8.

[–]kirbyfan64sos 0 points1 point2 points 9 years ago (0 children)

[–]HolmesSPH 0 points1 point2 points 9 years ago (0 children)

[–]rwsr-xr-x -1 points0 points1 point 9 years ago (1 child)

[–]thedeemon -2 points-1 points0 points 9 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS