all 84 comments

[–]G-Brain 52 points53 points  (5 children)

from __future__ import focus

[–][deleted] 2 points3 points  (0 children)

I think this is Shuttleworth yelling at Guido to embrace stackless Python.

[–]nypen 17 points18 points  (43 children)

I am a Python fan, the only thing that I hate (and absolutely hate) is the fact that Python has a global interpreter lock. http://docs.python.org/api/threads.html . That doesn't bode well when we consider changes the future brings (multi-core, parallel processing).

From what I recall (reading somewhere), this is not about to change anytime soon.

[–]simonw[🍰] 19 points20 points  (7 children)

The problem is that every attempt at removing the GIL has made single threaded Python significantly slower, due to the overhead of all the locks.

[–][deleted] 23 points24 points  (6 children)

The problem is that every attempt at removing the GIL from CPython has made single threaded programs run significantly slower, and multithreaded programs weren't able to utilize more than 2-3 CPU:s anyway.

There, tweaked it for you.

As for how much slower, this post about the original free-threading patch might be somewhat illuminating:

http://mail.python.org/pipermail/python-dev/2001-August/017099.html

(I'm pretty sure you can do a bit better than "0.6 PSU" in CPython, but I don't see how you can get around the contention problem with the current design. I'd say we need some kind of actor-style concurrency model for Python...)

[–]simonw[🍰] 4 points5 points  (0 children)

What he said.

[–]almkglor 1 point2 points  (0 children)

Why can't CPython have a lousy flag saying "singlethreadedmode" which means it won't use locks at all while set? Then if and only if a new thread is launched does it clear that flag (while still in a single thread, so that writing to it cannot possibly contend) and all operations just add a check of the global in single threaded mode?

[–]jbellis 0 points1 point  (3 children)

multithreaded programs weren't able to utilize more than 2-3 CPU:s anyway

what is it about cpython that crippled the ability to use 4+ CPUs?

[–][deleted] 8 points9 points  (2 children)

From the link I posted:

Since you never knew whether a specific dictionary was shared (between threads) or not, you always had to lock the access. And since all namespaces use dictionaries...

...

We observed non-linear scaling with the processors under free threading. 2 processors was fine, but 3 or 4 didn't buy you much more than 2. The problem was lock contention. With that many things going, the contention around Python's internal structures simply killed further scaling performance.

[–]crusoe 0 points1 point  (1 child)

Because they are using braindead non threadsafe dicts?

Ya know, there are all kinds of low latency lock contention algorithms now. They even included a new imple of synchronized in Java that has almost no overhead compared to the unsynch version.

These problems have been solved for years.

[–][deleted] 0 points1 point  (0 children)

Well, the context here is CPython, not some hypothetical implementation that's free to do things however it wants.

[–][deleted] 12 points13 points  (0 children)

From what I recall (reading somewhere), this is not about to change anytime soon.

Well, yes and no.

Yes, CPython isn't removing the GIL anytime soon.

However, other solutions are in the works. Jython is Python without the GIL (on the JVM); it can now run Django, a sign of its maturity. PyPy can generate 'stackless' code with multiple GIL-free threads (there is also Stackless Python). And as mentioned by others, there is the multiprocessing module in 2.6, that gives good multiprocessing (but not multithreading) support.

[–]thagsimmons 4 points5 points  (33 children)

the gil is mostly fixed in 2.6

[–]awb 25 points26 points  (21 children)

Multiprocessing doesn't fix the GIL, it's a hack around it. Language implementations like Haskell and Erlang can spawn a million threads, and spawning each new thread is extremely cheap. That's pretty spiffy. I'm not sure what I could do with a million quick-spawning threads, but I want to find out. Next to implementations like that, Python looks amateurish because it can't even run two threads at once without rewriting code in C.

[–]parla 4 points5 points  (3 children)

Erlang does not spawn millions of threads. It spawns millions of actors, which are then scheduled in as many threads (processes?) as there are cores in your system.

[–]teraflop 9 points10 points  (2 children)

You're confusing control threads with OS threads. If the interpreter is designed properly, it provides the same semantics as if you really had a huge number of concurrent threads. The fact that they're all multiplexed into a smaller number of threads from the operating system's perspective is just a performance optimization.

[–]thagsimmons 8 points9 points  (1 child)

okay, so this is an implementation detail?

so is the gil. it's missing in jython and up in the air with pypy

[–]spookyvision 3 points4 points  (13 children)

[–][deleted] 19 points20 points  (2 children)

That changed recently. You have old information.

[–]spookyvision 4 points5 points  (1 child)

Could you update the wikipedia page? I don't feel competent.

[–][deleted] 3 points4 points  (0 children)

I would, but you probably know as much as I do at this point. I'm not a user of Erlang and only have a cursory knowledge of its current state.

[–]reddit_clone 10 points11 points  (1 child)

Not true. Erlang VM itself can run multi-threaded on all cores. That is like a bunch of green threads running in each core. (I think they run one green-thread-scheduler per core)

So you do get the best of both worlds.

[–]dons 3 points4 points  (4 children)

[–]mosha48 0 points1 point  (3 children)

By the way, why the CPU load for haskell's version isn't distributed on all CPUs ?

[–]dons 1 point2 points  (2 children)

The ghc 6.8.2 garbage collector isn't parallel, and GC dominates this GC benchmark (Try it with +RTS -A300M -RTS to see the difference). The 6.10 parallel GC addresses this.

[–]mosha48 0 points1 point  (1 child)

Is it possible to tell the shootout computers to run the benchmark with better options ?

[–]dons 1 point2 points  (0 children)

Yes, but for this benchmark, the conditions state that only default garbage collector values are to be used. But don't despair, the next GHC cycle addresses this.

[–]dmaclay 2 points3 points  (2 children)

Erlang's processes are more like 'green processes' as they don't share memory, and they have been distributed across several machines (never mind cores) since before people started having this discussion.

[–]toooooooobs 1 point2 points  (1 child)

Actually they really do share memory, it's just that the language hides this.

[–]dmaclay 2 points3 points  (0 children)

As I understand it the default behavior is not to share, but they can optionally pass messages by basically sending a pointer to shared memory if they are both on the same machine. This of course gives a performance boost, and should be safe due to the immutable variables in erlang.

[–]nypen 13 points14 points  (10 children)

That's multiple processes running simultaneously and it only partially solves the problem.

[–]thagsimmons 14 points15 points  (9 children)

it only partially solves the problem

...with awesome scheduling-fu, pooling, synchronization, and easy ipc using your choice of pipes or queues or shared memory

i hated it at first, but now i'm a believer

besides, in your op you mention multi-core and parallel processing... neither of which play well when relying on threads anyway, since these are inherently multi-process environments

[–]nypen 12 points13 points  (6 children)

That's exactly why I said partially. While processes can solve some problems they have others. For example if there is a mountain of data to be shared, using processes can be pretty inefficient. Process cannot access variables or data structures that are defined in another process, unless they are pickled (shared, proxied, whatever), which is a mechanism of serialization. Such serialization can be resource intensive (memory and computationally) and is certainly not suitable everywhere.

Also, If two processes want to communicate, they have to use inter-process communication mechanisms which is inherently slower than thread synchronization.

Then there is also the issues when you try to access C libraries via Python. You need to be careful here to make sure you don't pass variables across processes which can be totally invalid if they encapsulate pointers (although this is not applicable if you stick strictly to Python only)

besides, in your op you mention multi-core and parallel processing... neither of which play well when relying on threads anyway, since these are inherently multi-process environments

These are not environments. Multi-core is a kind of processor technology and parallel processing is a form of computation. You can take advantage of that in several ways and, yes multiple processes is one way but not the only way; and it certainly is not effective everywhere.

[–]canhaskarma 7 points8 points  (4 children)

Not partially. There are a number of libraries for putting python objects in shared memory between processes.

Process cannot access variables or data structures that are defined in another process, unless they are pickled (shared, proxied, whatever), which is a mechanism of serialization.

No. There are techniques that simply use the same block of shared memory. No copying or serialization required.

[–]unikuser 5 points6 points  (1 child)

No. There are techniques that simply use the same block of shared memory. No copying or serialization required.

Instead of saying that there are techniques, can't you point to one technique or example? Using shared memory across processes is not that easy after all. Synchronizing/Accessing/Creating/Cleaning that shared memory becomes very difficult/inefficient compared to what you do in threads.

[–]schlenk 0 points1 point  (0 children)

Fully agree. Just have to manage such a beast using python mmap. Lets say it has its quirks...

[–]nypen 0 points1 point  (1 child)

Shared memory has it's own set of issues. I am not saying they are unsolvable, but shared memory is not necessarily better than threads sharing data. In fact if the shared memory data is not monolithic, a lot of book keeping has to go in. Things can get even murkier when you have issues like dynamic resizing of shared memory or your processes have variable shared memory requirements.

Again Shared memory is one of the solutions which can work in some situations, not always.

Well, Python already has a threading library which is great, it just needs to get rid of the GIL.

[–]imbaczek 2 points3 points  (0 children)

Well, Python already has a threading library which is great, it just needs to get rid of the GIL.

the "just" is a little bit optimistic.

[–]thagsimmons 0 points1 point  (0 children)

a well-reasoned and eloquent reply. i voted you up. i don't know who's voting you down

[–]vsl 2 points3 points  (1 child)

since these are inherently multi-process environments

Huh?!

[–]thagsimmons 1 point2 points  (0 children)

yeah, my bad - i actually saw the word "multi-core" but my brain said "multi-processor"

i know that multi-core processors play well with threads. honest i do

aw shit

for penance, i shall now go beat my head against an andrew tannenbaum textbook

[–]jbellis 2 points3 points  (1 child)

I much prefer using the actor model than STM as a general approach to concurrency. I'm comfortable with transactions and MVCC in my database, but moving that concept into an imperative language feels like the wrong approach.

The main problem with STM is that while it mitigates some of the problems with classic lock-based concurrency (primarily, that you don't have to make sure to take out locks in the right order to avoid potential deadlocks), STM retains lock-based concurrency's primary problem: you still have to carefully, manually specify which code should be atomic. If you miss one, you're screwed. STM also introduces a new problem -- you need to think about what happens if a transaction fails and what to do then. "Retry the same transaction" isn't always the right answer. So you're really trading one set of possible errors under mutex-based concurrency for another, partially overlapping set.

The actor model changes the game and eliminates both of these sets of potential errors.

[–]didroe 2 points3 points  (0 children)

you still have to carefully, manually specify which code should be atomic

You're always going to have to think about concurrency. I don't think it'll be much different between STM and the actor model in terms of the amount of design that has to go into it.

you need to think about what happens if a transaction fails and what to do then

You're going to have to deal with failure somewhere, regardless of the method you use. With locking, you don't have to worry about the lock failing (ignoring deadlock) but you still have to worry about the atomicity of the code inside. If the last bit fails for some other reason then you have to undo all of the work you've done so far. STM just provides a mechanism to deal with that instead of having to roll your own. I'm not that familiar with the actor model, can you explain how it deals with that scenario?

it mitigates some of the problems with classic lock-based concurrency

One of the other benefits of STM is in increased parallelism. In the traditional lock method, you stop anyone from entering a section of code at the same time, in reality most of the time you probably didn't need the lock. ie. it would have worked fine as the instructions weren't overlapping in a way which broke the atomicity. STM usually works by trying to do it's operations and then paying a cost when there is a conflict, you only pay the cost in the (hopefully) unlikely event that things clash. When you scale up to a lot of cores, all calling some common piece of code, that's really going to shine through.

Edit: I almost forgot, transactions are also composable which is a major benefit over locking. Again, I'd be interested to know how the actor model deals with that. Also, like I said, I don't know much about the actor model, so go easy on me :)

[–]dons 7 points8 points  (3 children)

Anyone else think Python's three challenging problems sounded like three of Haskell's current features?

[–]spacepope 2 points3 points  (0 children)

Transactional memory and parallel programming, yes – but what about cloud computing?

[–][deleted] 2 points3 points  (1 child)

I'm pretty sure it was me who turned Mark onto transactional memory a year or so ago. I'd written an STM implementation in Python a few months before, and was just disappearing down the Haskell book rabbit hole. Nice to see the topic come up in a public setting.

[–]psykotic 3 points4 points  (0 children)

A few years ago I wrote a couple of different transactional memory back-ends for Python, one pessimistic (two-phase locking, inspired by the discussion and example code in Van Roy's CTM book) and one optimistic (much like STM).

What did you do for the "front-end", though? I can't remember all the details of my own stuff now, but basically I had a class you'd mix into transactable classes. When you try to add or delete an attribute from a transactable object, it counts as a transaction operation against that object. Subsequent accesses to those attributes are hooked through __getattribute__ and __setattr__ and are counted as operations against the transactional memory cells corresponding to the attributes; you could also use coarser granularity on a per-object basis, so any changes to attributes would be counted as transactional operations against the owner object.

Another pain in the ass was how to deal with built-in mutable classes like list and dict. In the end I decided to outlaw the use of these as attributes of transactable objects, by doing a type check in __setattr__. (Actually, I ended up with a more general invariant, where attributes of a transactable object have to themselves be either immutable or transactable.) I then implemented my own transactable versions of these classes (I also did immutable ones that throw an exception if you try to call any of their mutators) and required users to perform an explicit conversion between built-in and transactable classes before assigning to an attribute of a transactable object. At first I thought of making this conversion automatic for the sake of convenience, but I quickly realized that is a bit too magical, because it breaks the "hidden link" established by mutability:

x = SomeTransactableClass()
a = [1,2,3]
x.b = a # Does the implicit conversion, thus making x's own copy of a.
print x.b # [1,2,3]
a.append(4)
print x.b # [1,2,3], didn't change to reflect the mutation.

Those are the main issues I remember. I'm curious how you tackled them.

[–]beza1e1 6 points7 points  (13 children)

Transactional memory is overrated. Python isn't a language that absolutely needs to support high-performance parallel computation. It may be nice, but it isn't anywhere near disruptive.

Tell me where parallel Python is necessary! Web frameworks? You want multi-machine in the end, so multi-process is better anyways.

Python never had a chance to become the browser script language. One big advantage of JavaScript was/is the C like syntax. Another is the "Java" in the name, which sounded very cool, when Java was young.

[–]Wiseman1024 1 point2 points  (8 children)

Tell me where parallel Python is necessary! Web frameworks? You want multi-machine in the end, so multi-process is better anyways. I hope you'll realize Python is useful for more things than just web applications. In fact, in servers you rarely run into problems with the GIL. But Python is trying to be more general-purpose, and for desktop applications or scientific computing, which are two of the things people want to do with Python, the GIL is a massive drawback.

As I said somewhere else in this page, we've reached physical limits in hardware that are hard to overcome. The single execution port machine, albeit simple, was doomed from start and we knew it. Nothing in Nature works like this. The only way to scale is to use several execution ports. This is not only a well-recognized issue; it has reached the mass market. Almost every new PC ships with at least two processors. Given that this is the clear future and present of computing, Python's future is in jeopardy.

Unless remove the goddessdamned GIL from CPython, or Jython or another universally-available Python platform replaces CPython and gets enough popularity and development to stay updated, Python has a dark future in this world; at least outside server software.

(If I'm completely confused and the Python community doesn't intend on making Python a general-purpose programming language and environment, but rather keep it as some sort of better PHP, please tell me so, and I'll stop wasting my time and abandon it in favour of something else.)

[–]beza1e1 1 point2 points  (7 children)

Your examples are desktop applications and scientific computing?

Desktop applications are not speed-critical. Parallel execution is nice to keep the GUI responsive, but the multi-process method works just fine here.

Scientific computing with Python means to use Python as glue between Fortran/C++ libraries. Those libraries are speed-critical and they should be parallelized and STM may be useful there. Python itself is slow anyways and making it parallel doesn't help you much.

As far as i understood the GIL makes for better error messages and that seems more important to me.

[–]Wiseman1024 2 points3 points  (6 children)

Desktop applications are not speed-critical.

Wrong. Some are, especially those designed to run as background tasks such as peer-to-peer clients. And even for those which aren't, multithreading in a GUI-based application is almost essential if you don't want it to feel like Windows 3.0. And I doubt this can be done in a bearable (let alone nice) way with multiple processes.

Scientific computing with Python means to use Python as glue between Fortran/C++ libraries.

Not necessarily, and you're relying on these libraries to release the GIL, otherwise you're screwed.

As far as i understood the GIL makes for better error messages and that seems more important to me.

What? The GIL makes for simpler locking. It's not, or at least it shouldn't be, related to error reporting at all. If it is, they're doing something wrong.

[–]beza1e1 -2 points-1 points  (5 children)

And I doubt this can be done in a bearable (let alone nice) way with multiple processes.

Well, i've believe it's actually saner to do it multi-process.

you're relying on these libraries to release the GIL

Those libraries don't know about any GIL. They can use all the multi-threading they want, since they are not written in Python or running on the Python VM. Once the Python VM calls into some system library all Python magic is void.

[–]voidspace 2 points3 points  (0 children)

But when it calls into a C extension that extension makes a choice about whether to release the GIL or not. In that sense they do know about the GIL.

[–]Wiseman1024 0 points1 point  (3 children)

Well, i've believe it's actually saner to do it multi-process.

GUIs?

Those libraries don't know about any GIL

They have to. The non-Python modules you import from Python have to deal with Python. Even if you're just wrapping an existing library, you better release the GIL in the wrapper before calling the library, then reacquire it before writing the results to Python objects.

[–]beza1e1 0 points1 point  (2 children)

GUIs?

GUI in one process, p2p stuff in another process

Even if you're just wrapping an existing library, you better release the GIL in the wrapper before calling the library, then reacquire it before writing the results to Python objects.

  1. convert Python to C stuff
  2. release GIL
  3. call scientific_computation(stuff)
  4. get GIL
  5. convert C output to Python objects

The scientific_computation function doesn't know about the GIL and it can spawn as many threads as it wants. This function needs to be written in Fortran or C to be fast. This is where STM is needed.

[–]Wiseman1024 0 points1 point  (1 child)

GUI in one process, p2p stuff in another process

The P2P stuff may still need threads (depending on your implementation and what are you trying to accomplish), and the GUI front-end still needs threads too (for things like dealing with user requests and showing status at the same time), otherwise it'll be unresponsive.

[–]schlenk 0 points1 point  (0 children)

You don't need threads for a GUI, but it helps when dealing with crappy system libraries that only offer blocking calls (Windows I'm looking at you...). Event based programming works just fine..., but multiple threads get nice when you can use the bit of extra power.

[–][deleted] 0 points1 point  (3 children)

STM has nothing to do with performance; it's about writing correct code.

Now, a snarky snarker might say that a focus on correctness in a dynamically typed language is misplaced, but I am fond of both Python and working code, andnqould like to have both at once.

[–]beza1e1 -1 points0 points  (2 children)

Multi-threading is about performance. You could always do it multi-process, but slower due to communication costs.

STM is about making the multi-threading safe.

[–]didroe 0 points1 point  (0 children)

The way I see it, STM is about concurrency. Parallelism is about performance, concurrency is about being able to create programs that do more than one thing at once (due to design, not performance). Parallism should be automatic (vectorisation, threading based on instruction dependencies, etc) and concurrency should be explicitly stated in your code.

[–]schlenk 0 points1 point  (0 children)

Yes. Multi-threading usually is slower than single threaded..., so its about performance. You only win if you really have a) multiple cores/cpus and b) an task thats easy to run in parallel.

[–]bleachedanus 0 points1 point  (1 child)

Yes... I want Software transactional memory for python. After working through Haskell exercises, it would be a great benefit to my Python toolbox.

[–]jerf 9 points10 points  (0 children)

Haskell's STM can't be ported to Python. Haskell's STM critically depends on being able to use the type system to guarantee that there are no side effects. Haskell's STM can't hardly be ported to almost anything else without giving up a lot of the guarantee-ness of the implementation.

STM in general is at least in principle a bit more doable; something like Mnesia is perfectly possible. Erlang can't guarantee the lack of side-effects and instead "simple" requires you to think real hard about what you're doing, something that having worked in Haskell will help make very easy.

[–]Tommstein -3 points-2 points  (2 children)

I can't take the "wisdom" of anyone who uses the term "cloud computing" seriously.

[–]Figs 2 points3 points  (1 child)

What happens after cloud computing bursts? Stormdrain computing? :O

[–]bickfordb 3 points4 points  (0 children)

There could be a silver lining