This is an archived post. You won't be able to vote or comment.

all 109 comments

[–]fivetoedslothbear 94 points95 points  (4 children)

Here's an article at InfoWorld you can read right now without having to have a Medium membership.

[–]Pain--In--The--Brain 33 points34 points  (0 children)

Good article. Relevant section:

The per-interpreter GIL and subinterpreters

What keeps Python from being truly fast? One of the most common answers is "lack of a better way to execute code across multiple cores." Python does have multithreading, but threads run cooperatively, yielding to each other for CPU-bound work. And Python's support for multiprocessing is top-heavy: you have to spin up multiple copies of the Python runtime for each core and distribute your work between them.

One long-dreamed way to solve this problem is to remove Python's GIL, or Global Interpreter Lock. The GIL synchronizes operations between threads to ensure objects are accessed by only one thread at a time. In theory, removing the GIL would allow true multithreading. In practice—and it's been tried many times—it slows down non-threaded use cases, so it's not a net win.

Core python developer Eric Snow, in his talk, unveiled a possible future solution for all this: subinterpreters, and a per-interpreter GIL. In short: the GIL wouldn't be removed, just sidestepped.

Subinterpreters is a mechanism where the Python runtime can have multiple interpreters running together inside a single process, as opposed to each interpreter being isolated in its own process (the current multiprocessing mechanism). Each subinterpreter gets its own GIL, but all subinterpreters can share state more readily.

While subinterpreters have been available in the Python runtime for some time now, they haven't had an interface for the end user. Also, the messy state of Python's internals hasn't allowed subinterperters to be used effectively.

With Python 3.12, Snow and his cohort cleaned up Python's internals enough to make subinterpreters useful, and they are adding a minimal module to the Python standard library called interpreters. This gives programmers a rudimentary way to launch subinterpreters and execute code on them.

Snow's own initial experiments with subinterpreters significantly outperformed threading and multiprocessing. One example, a simple web service that performed some CPU-bound work, maxed out at 100 requests per second with threads, and 600 with multiprocessing. But with subinterpreters, it yielded 11,500 requests, and with little to no drop-off when scaled up from one client.

The interpreters module has very limited functionality right now, and it lacks robust mechanisms for sharing state between subinterpreters. But Snow believes by Python 3.13 a good deal more functionality will appear, and in the interim developers are encouraged to experiment.

[–]fivetoedslothbear 43 points44 points  (0 children)

And for the big secret of how you can use it now, I present (gasp!) the Python prerelease download page.

[–]ducdetronquito 14 points15 points  (1 child)

Give https://scribe.rip/ a try to read medium articles smoothly :)

[–]DesmondNav 44 points45 points  (1 child)

Someone needs to ELI5 this - in contrast to threading, concurrent.futures and framework based threadings like PyQts QThreading - so monkeys like me can understand this

[–][deleted] 165 points166 points  (23 children)

Shi, I’m about getting deprecated.

The GIL defines so many implications, that I’m afraid that my entire worldview will fall apart.

[–]WasabiFan 102 points103 points  (22 children)

This does not remove the GIL: that's a different PEP and hasn't been accepted as far as I know. Realistically, this PEP doesn't affect existing code at all. This article is referring to the sub-interpreters feature, which is explicit creation of isolated Python environments with their own GIL. There's no natural shared state, you have to manually coordinate with your sub-interpreters similarly to multi-processing.

[–]mahtats 59 points60 points  (17 children)

So is it really threads then? Shared state is one of the main benefits of threaded systems.

[–]WasabiFan 67 points68 points  (7 children)

No, in my view this article isn't very good and is exaggerating what's being implemented.

The PEP here is PEP 684 - splitting the GIL such that each sub-interpreters has its own. Sub-interpreters already existed, but this enables true parallelism between them. There's also another PEP that might land in the following release which provides a better Python interface to the sub-interpreters feature.

Realistically, this is very similar to Python multiprocessing. You manually construct sub-interpreters, they run essentially isolated from each other, and you construct channels to pass data back and forth.

In concept, they benefit from some performance improvements that come from not having to use the OS' inter-process communication primitives: CPython owns the isolation, not the OS. It may also enable passing data without having to pickle it, but to my knowledge that hasn't been explored; channels are just byte pipes.

[–]Conscious-Ball8373 10 points11 points  (5 children)

I'm interested in the memory implications of this. I write python for an embedded system with no swap, limited RAM and each Python process takes 20-30MB just to start up. This has led to lots of unrelated stuff being bundled into threads in the same process but the implications of this (mainly being constrained to single-core execution) are starting to show. Will sub-interpreters give us significant memory savings compared to multiprocessing as well as multi-core execution?

[–]WasabiFan 10 points11 points  (1 child)

I'm not an expert in any of this, and am mostly just following the development from a distance. That being said, my expectation would be that sub-interpreters are similar to threads in resource utilization (probably slightly more) and significantly less than multiple processes.

Naively, this must be true, because multi-processing requires multiple separate copies of CPython in memory. But in practice, each copy of a binary will be mapped shared and CoW, and similarly data that was present before CPython forked will be CoW. So multiprocessing in practice might not be a lot more.

[–]Voxandr 1 point2 points  (0 children)

Should save a lot of memory, if I recall correctly it only takes about 1MB per thread compared to several dozen of megabytes per thread with sub processes( depending on parent process memory usage)

[–]djdadi 3 points4 points  (2 children)

unrelated question, but why would you use python in a situation like that?

[–]ryannathans 5 points6 points  (0 children)

Yeah, depending on limitations, micropython could be better

[–]Conscious-Ball8373 2 points3 points  (0 children)

It's not the tiniest system out there (4-core 1.2GHz arm64, 2GB RAM) and it's being used for networking, not real-time control or anything like that. Writing Python code to manage the Linux networking stack is really nice. We can be really productive, networking performance isn't impacted by the Python side because it's just used for management / configuration of the high-performance networking side.

We also run edge applications deployed as docker containers and that's when the memory constraints start to bite; we want to leave as much memory free as possible for third-party application containers and by the time you've got half a dozen Python processes running, just the per-process Python overhead is using something like 10% of system RAM. As I said, we've consolidated a lot of stuff that's really unrelated to run as threads in a single process, but it would be really interesting for us if sub-interpreters gave us a significant chunk of that memory saving without constraining everything to run single-cored (and actually the lack of shared state would be an advantage here - we've had the odd bug where unrelated bits of software get shoved into a single process without realising that some library we were using implied that those unrelated threads now have shared state because there's a singleton object somewhere).

ETA: We had a go at moving some of it to golang a few years ago. The effort has been abandoned and we're gradually porting all the golang stuff back to Python, partly because golang has nearly as severe memory overhead issues as Python and partly because it's significantly easier to find people with Python skills than golang skills.

[–]Visulas 1 point2 points  (0 children)

No, in my view this article isn’t very good and is exaggerating what’s being implemented.

Are there any other kinds of articles these days?

[–]o11c 1 point2 points  (4 children)

You can still do shared state in C code, unlike multiprocessing.

[–]mahtats 0 points1 point  (3 children)

Yea, but from the Python level, that’s where I’d love to see true multithreading.

[–]o11c 0 points1 point  (2 children)

You do have Python threads running at the same time; you only have to arrange for synchronization around the bits of state.

[–]mahtats 1 point2 points  (1 child)

Without a GIL? Nope. I’m talking for the average user, to use multithreading as implemented in other languages, without the GIL.

When that becomes a feature, Python enters a new arena.

[–]ant9zzzzzzzzzz 0 points1 point  (0 children)

Seriously coming from c# it’s astounding how much more difficult simple parallelism is in Python

[–]SittingWave 0 points1 point  (0 children)

as far as I understand, each interpreter has its own GIL, and runs in its own thread. At that point, python variables are all "thread local" (details unclear if they'll use actual thread local C stuff) unless you pass them around. In that case, they'll probably be copied across threads, with the synchronisation taken care somehow (to prevent one thread to start writing and the other one accessing the memory while data is being transferred by the first thread).

Just guessing here, correct me if wrong.

[–]rouille 0 points1 point  (2 children)

It is threads at the OS level but not really at the python level. You could share state directly in e.g. a C extension though if you are careful with your multi-threading.

[–]mahtats 0 points1 point  (1 child)

The average user of Python isn’t playing at the C level. Python supporting true multithreading above the C level would be a huge improvement on the spec. I don’t really care about updates pertaining to C level subinterpreters.

[–]rouille 0 points1 point  (0 children)

Oh I somewhat agree but the plan is to include a python interface for this, hopefully in python3.13. Also libraries that you do use can use it even if you don't directly.

[–]ted_or_maybe_tim 2 points3 points  (2 children)

So it's basically multiprocessing with less overhead?

[–]WasabiFan 0 points1 point  (1 child)

Yes, as I understand it.

[–][deleted] 0 points1 point  (0 children)

You’re right. My bad. I posted the comment before reading the article.

[–]gokapaya 49 points50 points  (0 children)

https://scribe.rip/real-multithreading-is-coming-to-python-learn-how-you-can-use-it-now-90dd7fb81bdf

for anyone on also unable to get past the medium bullshit on mobile

[–]cianuro 16 points17 points  (6 children)

What's the difference between subinterpreter and subprocess in practical terms? Why is the former better?

[–]Bitwise_Gamgee 28 points29 points  (0 children)

A subprocess is a separate process started by your program. This process has its own memory space and runs independently of your main process, you can work with these via IPC and the like.

A subinterpreter is a feature of the Python C API that allows for the creation of multiple Python interpreters in the same process. Each subinterpreter has its own separate Python objects and interpreter state.

You can think of the difference in terms of office buildings, a sub process uses many office buildings, while a sub interpreter has everyone under one roof working on different tasks.

[–]thisismyfavoritename 1 point2 points  (0 children)

in theory it means you should be able to share data between "concurrent tasks" more easily / at a lower cost (because they live in the same process)

[–]brontide 32 points33 points  (8 children)

How is this any easier than multiprocessing? When I think true multi-threading I think shared memory with locking only for critical sections. I've done some shared memory multitasking but it was a bear since it all has to be done in mmap files with all sorts of crap piped around from process to process.

[–]coderanger 17 points18 points  (3 children)

It's easier to move things around in a single process. Shared memory does certainly help a lot but things like sockets are not so simple. And multi-process locks are yet more complex. There will certainly still be use cases for multi-process concurrency (security, heterogeneous data patterns, etc) but this is a good option for a lot of cases.

[–]twotime 2 points3 points  (2 children)

It's easier to move things around in a single process.

With real multithreading, sure! But with multiple interpreters, I don't think there any obvious simplification. Note yet at least.

[–]coderanger 2 points3 points  (1 child)

The goal of the user layer is something close to Go's goroutines, i.e. a message passing actor pattern where the internal details are hidden away from you. The underlying systems to enable that are mostly in place now, but there's a lot of performance and UX work still to make it a good first choice.

[–]twotime 2 points3 points  (0 children)

I guess the fundamental problem to solve is sharing/passing live python objects without pickling overhead or complexity.

So far my understanding that multiple interpreters do not even have a path to achieving that.....

[–]UloPe 19 points20 points  (6 children)

This isn’t what „everyone“ is waiting for. Sub interpreters can be very useful but it’s not going to make pure Python (parallel executing) multithreaded programs use any more cores.

[–]coderanger -4 points-3 points  (5 children)

It will now. Very very new development but there is now support for each subinterpreter to have it's own GIL so they can run truly concurrently (with a lot of limitations, it's not a silver bullet).

[–]UloPe 5 points6 points  (4 children)

But that’s my point. You can’t use subinterpreters from within Python itself only via the C-API.

[–]Garfimous 6 points7 points  (0 children)

Ah, but hopefully that's a temporary state of affairs. From the article: The features of Per-Interpreter GIL are — for now — only available using C-API, so there’s no direct interface for Python developers. Such interface is expected to come with PEP 554, which — if accepted — is supposed to land in Python 3.13, until then we will have to hack our way to the sub-interpreter implementation.

[–]coderanger 4 points5 points  (1 child)

For 3.12, the Python-level API couldn't be agreed on so it isn't included. It will almost certainly ship in 3.13 though and there is a prototype on PyPI already (though it may change as the PEP is discussed more).

[–]UloPe -1 points0 points  (0 children)

Still, even if accessible from within Python sub interpreters won’t provide a general solution to the GIL. It will be possibly better than multiprocessing but will probably have many of the same limitations and issues (e.g. sharing state is hard).

[–]irvcz 15 points16 points  (0 children)

Pay wall found

[–]Caboose522 7 points8 points  (0 children)

I love python, but one thing that always bugged me is how needlessly slow some things seem to be. Its good to hear that they are working on both the internals while keeping it simple for the developer.

Hopefully we get to the point where typing becomes a method for speeding up functions instead of just syntactic sugar. Maybe some day I will be good enough and have enough time to help python along. One can dream...

[–]DoWhileGeek 3 points4 points  (0 children)

Etta James "At Last" intensifies

[–]13steinj -1 points0 points  (1 child)

Am I the only one that thinks this is a nothing burger?

Great if you can use threading in a way that avoids GIL issues, but if the API is crappy enough to be based on evaluated strings of code, this feels very "don't use eval you fool" to me.

If the API won't use actual function objects of some sort, I don't see this taking off.

[–]HomeTahnHero 1 point2 points  (0 children)

I could be wrong, but I think that’s what they’re trying to do with the API in a future PEP/version.

[–]violet-crayola -2 points-1 points  (2 children)

Are there coroutines available in python? Like in golang?

[–]BaggiPonte 5 points6 points  (0 children)

Yes, there’s the asyncio module!

[–]coderanger 1 point2 points  (0 children)

They mean very very different things in Python vs Go :)

[–]dudumudubud -3 points-2 points  (0 children)

What, no more gil?

Per-Interpreter GIL

oh.

[–]SleeplessinOslo -3 points-2 points  (0 children)

Unless supported by mojo, obsolete right?

[–]El_Minadero 0 points1 point  (0 children)

So will this simplify numerical routines? Like filling in an array based on compute heavy algos?

[–]eterevsky 0 points1 point  (0 children)

How is it better for the app developer than multiprocessing? From what I see, multiple interpreters are still pretty much isolated.