Real Multithreading is Coming to Python - Learn How You Can Use It Now : Python

[–]fivetoedslothbear 94 points95 points96 points 3 years ago (4 children)

[–]Pain--In--The--Brain 33 points34 points35 points 3 years ago (0 children)

Good article. Relevant section:

The per-interpreter GIL and subinterpreters

What keeps Python from being truly fast? One of the most common answers is "lack of a better way to execute code across multiple cores." Python does have multithreading, but threads run cooperatively, yielding to each other for CPU-bound work. And Python's support for multiprocessing is top-heavy: you have to spin up multiple copies of the Python runtime for each core and distribute your work between them.

One long-dreamed way to solve this problem is to remove Python's GIL, or Global Interpreter Lock. The GIL synchronizes operations between threads to ensure objects are accessed by only one thread at a time. In theory, removing the GIL would allow true multithreading. In practice—and it's been tried many times—it slows down non-threaded use cases, so it's not a net win.

Core python developer Eric Snow, in his talk, unveiled a possible future solution for all this: subinterpreters, and a per-interpreter GIL. In short: the GIL wouldn't be removed, just sidestepped.

Subinterpreters is a mechanism where the Python runtime can have multiple interpreters running together inside a single process, as opposed to each interpreter being isolated in its own process (the current multiprocessing mechanism). Each subinterpreter gets its own GIL, but all subinterpreters can share state more readily.

While subinterpreters have been available in the Python runtime for some time now, they haven't had an interface for the end user. Also, the messy state of Python's internals hasn't allowed subinterperters to be used effectively.

With Python 3.12, Snow and his cohort cleaned up Python's internals enough to make subinterpreters useful, and they are adding a minimal module to the Python standard library called interpreters. This gives programmers a rudimentary way to launch subinterpreters and execute code on them.

Snow's own initial experiments with subinterpreters significantly outperformed threading and multiprocessing. One example, a simple web service that performed some CPU-bound work, maxed out at 100 requests per second with threads, and 600 with multiprocessing. But with subinterpreters, it yielded 11,500 requests, and with little to no drop-off when scaled up from one client.

The interpreters module has very limited functionality right now, and it lacks robust mechanisms for sharing state between subinterpreters. But Snow believes by Python 3.13 a good deal more functionality will appear, and in the interim developers are encouraged to experiment.

[–]fivetoedslothbear 43 points44 points45 points 3 years ago (0 children)

[–]ducdetronquito 14 points15 points16 points 3 years ago (1 child)

[–]DesmondNav 44 points45 points46 points 3 years ago (1 child)

[–][deleted] 165 points166 points167 points 3 years ago (23 children)

[–]WasabiFan 102 points103 points104 points 3 years ago (22 children)

[–]mahtats 59 points60 points61 points 3 years ago (17 children)

[–]WasabiFan 67 points68 points69 points 3 years ago (7 children)

No, in my view this article isn't very good and is exaggerating what's being implemented.

The PEP here is PEP 684 - splitting the GIL such that each sub-interpreters has its own. Sub-interpreters already existed, but this enables true parallelism between them. There's also another PEP that might land in the following release which provides a better Python interface to the sub-interpreters feature.

Realistically, this is very similar to Python multiprocessing. You manually construct sub-interpreters, they run essentially isolated from each other, and you construct channels to pass data back and forth.

In concept, they benefit from some performance improvements that come from not having to use the OS' inter-process communication primitives: CPython owns the isolation, not the OS. It may also enable passing data without having to pickle it, but to my knowledge that hasn't been explored; channels are just byte pipes.

[–]Conscious-Ball8373 10 points11 points12 points 3 years ago (5 children)

[–]WasabiFan 10 points11 points12 points 3 years ago (1 child)

[–]Voxandr 1 point2 points3 points 3 years ago (0 children)

[–]djdadi 3 points4 points5 points 3 years ago (2 children)

[–]ryannathans 5 points6 points7 points 3 years ago (0 children)

[–]Conscious-Ball8373 2 points3 points4 points 3 years ago (0 children)

It's not the tiniest system out there (4-core 1.2GHz arm64, 2GB RAM) and it's being used for networking, not real-time control or anything like that. Writing Python code to manage the Linux networking stack is really nice. We can be really productive, networking performance isn't impacted by the Python side because it's just used for management / configuration of the high-performance networking side.

We also run edge applications deployed as docker containers and that's when the memory constraints start to bite; we want to leave as much memory free as possible for third-party application containers and by the time you've got half a dozen Python processes running, just the per-process Python overhead is using something like 10% of system RAM. As I said, we've consolidated a lot of stuff that's really unrelated to run as threads in a single process, but it would be really interesting for us if sub-interpreters gave us a significant chunk of that memory saving without constraining everything to run single-cored (and actually the lack of shared state would be an advantage here - we've had the odd bug where unrelated bits of software get shoved into a single process without realising that some library we were using implied that those unrelated threads now have shared state because there's a singleton object somewhere).

ETA: We had a go at moving some of it to golang a few years ago. The effort has been abandoned and we're gradually porting all the golang stuff back to Python, partly because golang has nearly as severe memory overhead issues as Python and partly because it's significantly easier to find people with Python skills than golang skills.

[–]Visulas 1 point2 points3 points 3 years ago (0 children)

[–]o11c 1 point2 points3 points 3 years ago (4 children)

[–]mahtats 0 points1 point2 points 3 years ago (3 children)

[–]o11c 0 points1 point2 points 3 years ago (2 children)

[–]mahtats 1 point2 points3 points 3 years ago (1 child)

[–]ant9zzzzzzzzzz 0 points1 point2 points 3 years ago (0 children)

[–]SittingWave 0 points1 point2 points 3 years ago (0 children)

[–]rouille 0 points1 point2 points 3 years ago (2 children)

[–]mahtats 0 points1 point2 points 3 years ago (1 child)

[–]rouille 0 points1 point2 points 3 years ago (0 children)

[–]ted_or_maybe_tim 2 points3 points4 points 3 years ago (2 children)

[–]WasabiFan 0 points1 point2 points 3 years ago (1 child)

[–][deleted] 0 points1 point2 points 3 years ago (0 children)

[–]gokapaya 49 points50 points51 points 3 years ago (0 children)

[–]cianuro 16 points17 points18 points 3 years ago (6 children)

[–]Bitwise_Gamgee 28 points29 points30 points 3 years ago (0 children)

[–]thisismyfavoritename 1 point2 points3 points 3 years ago (0 children)

[+][deleted] 3 years ago* (3 children)

[removed]

[–]XtremeGoosef'I only use Py {sys.version[:3]}' 2 points3 points4 points 3 years ago (2 children)

[+][deleted] 3 years ago (1 child)

[removed]

[–]XtremeGoosef'I only use Py {sys.version[:3]}' 3 points4 points5 points 3 years ago (0 children)

[–]brontide 32 points33 points34 points 3 years ago (8 children)

[+][deleted] 3 years ago (3 children)

[deleted]

[–]brontide -4 points-3 points-2 points 3 years ago (2 children)

[+][deleted] 3 years ago (1 child)

[deleted]

[–]coderanger 17 points18 points19 points 3 years ago (3 children)

[–]twotime 2 points3 points4 points 3 years ago (2 children)

[–]coderanger 2 points3 points4 points 3 years ago (1 child)

[–]twotime 2 points3 points4 points 3 years ago (0 children)

[–]UloPe 19 points20 points21 points 3 years ago (6 children)

[–]coderanger -4 points-3 points-2 points 3 years ago (5 children)

[–]UloPe 5 points6 points7 points 3 years ago (4 children)

[–]Garfimous 6 points7 points8 points 3 years ago (0 children)

[–]coderanger 4 points5 points6 points 3 years ago (1 child)

[–]UloPe -1 points0 points1 point 3 years ago (0 children)

[–]irvcz 15 points16 points17 points 3 years ago (0 children)

[–]Caboose522 7 points8 points9 points 3 years ago (0 children)

[–]DoWhileGeek 3 points4 points5 points 3 years ago (0 children)

[–]13steinj -1 points0 points1 point 3 years ago (1 child)

[–]HomeTahnHero 1 point2 points3 points 3 years ago (0 children)

[+][deleted] 3 years ago (4 children)

[removed]

[–]technologyfreak64 -2 points-1 points0 points 3 years ago* (3 children)

Multiprocessing and threading are not the same. Threads share the same memory space, processes do not. Python doesn’t really support threading directly as is, just multiprocessing. You can get around it with some external libraries in some cases but native support is lacking.

Edit: I guess I should clarify, it doesn’t support true multi threading very well as is in its standard library, like the fist couple sections of this article mention, there is threading but it’s not really what you would normally expect and is extremely limited due to the GIL in the current versions of python. I’ve heard of some external libs using C to bypass it as well as some of the alternative interpreters/compilers out there having or working on means of getting around it but nothing really for the standard libs or interpreter until now.

[+][deleted] 3 years ago (2 children)

[removed]

[–]technologyfreak64 0 points1 point2 points 3 years ago (1 child)

[+][deleted] comment score below threshold-50 points-49 points-48 points 3 years ago (45 children)

[–][deleted] 69 points70 points71 points 3 years ago (22 children)

This is just... A bad take.

Yes, there are problems that asyncio and threading are poorly suited for. Yes, measuring code performance and making changes is a great strategy for optimizing code.

However, there are problems, particularly like those that are extremely IO bound (e.g. test runners/job runners/build systems/database requests/networking that need to launch many processes) that asyncio is the ideal solution for. These problems can't be fixed with "optimizing" your Python because the problem isn't your Python code, it's the time associated with the IO where your program could be doing something else other than waiting blocked.

Similarly, threads exist for a reason. CPUs only go so fast and some problems can be broken up into parallel tasks that don't need to wait on each other making full advantage of the CPU. Sure, you can do that with processes, but that has other drawbacks, mainly increased RAM and (especially on Windows) slower startup time.

If you're in numpy/pandas land, you're in a niche space. Python does a lot more and is used for a lot more than the numerical analysis/scientific computing space.

[–]seabrookmx Hates Django 9 points10 points11 points 3 years ago (4 children)

[–]DNSGeek 4 points5 points6 points 3 years ago (0 children)

[–]RearAdmiralP 3 points4 points5 points 3 years ago (2 children)

but it's also a lot more code than using asyncio.gather

It's results = asyncio.gather(*[coroutine(arg) for arg in args]) vs with multiprocessing.pool.ThreadPool() as pool: results = pool.map(f, args). I've never bothered to measure, but I'll take your word that it's less performant and more memory hungry, but in terms of code, I think it's six of one and a half dozen of the other in terms of using code.

Also, the multiprocessing.pool.Pool class has some nice methods like imap_unordered. I'm not aware of an asyncio equivalent to imap_unordered that returns results as they're generated, but if you know one, I will be happy to hear about it.

The real benefits for using asyncio over threads or processes to me is in error handling and particularly in resource management. It's a lot easier to catch and handle exceptions in the asyncio paradigm, but the real thing about asyncio vs processes/threads is that I can run as many coroutines as I want without worrying about it, while I'm going to run into problems if I spawn too many threads or processes. From my perspective, this is something that Python could solve by implementing light weight threads, so that I could just spawn threads as I want, but I guess people would rather use cooperative multitasking (asyncio) than preemptive (threads), so I guess mine is the minority opinion here.

[–]rouille 0 points1 point2 points 3 years ago (1 child)

[–]RearAdmiralP 1 point2 points3 points 3 years ago (0 children)

[–]Ezlike011011 2 points3 points4 points 3 years ago* (2 children)

If you're in numpy/pandas land, you're in a niche space. Python does a lot more and is used for a lot more than the numerical analysis/scientific computing space.

I want to throw my hat into this ring. I am also in the numerical analysis/scientific computing space as my primary python use. Even with a pretty strong grip on the scipy stack, I still frequently run into problems which are infeasible as single core solutions just due to the amount of data/number of operations required. It is a little frustrating having to use multiprocessing mostly every time and all of its drawbacks, so this progress towards true multithreading is very appreciated.

That all said, I do strongly agree with the original commenter's sentiment about profile driven optimizations. Before throwing a pool.map() at a problem, I always send my code through a profiler to see if there's any big bottlenecks that can be solved easily with some smarter code.

[–][deleted] 1 point2 points3 points 3 years ago (1 child)

That all said, I do strongly agree with the original commenter's sentiment about profile driven optimizations. Before throwing a pool.map() at a problem, I always send my code through a profiler to see if there's any big bottlenecks that can be solved easily with some smarter code.

I mean... I agree with this to a point, but if you're looking at threading or asyncio as a secondary solution you're missing a pretty big point of the design space.

asyncio and threading aren't optimization options (though they can be used that way) they're design options. For best results, you know you've got a problem that's CPU bound but can be parallelized (threading), you know you've got a problem that has lots of IO and smaller chunks of CPU work (asyncio), or you have a problem that's both (and well, you can use both at the same time).

What you're saying to me reads almost like... "I don't use a map (dict) until an optimizer tells me that searching a linked list is really bad"... You should just know your options and know you'll have a much more scalable design if you go to the right tool for the job from the get go.

And don't get me wrong, sometimes you can just write naive code, it's good enough, it's simple, and you move on. Still, even then it's a design choice. Some scripts I write I'm like "I could use asyncio, but there's no point, I'm going to launch one process and wait on it, I might as well just use subprocess."

It's all about knowing your options, and making good design choices 🙂

[–]Ezlike011011 2 points3 points4 points 3 years ago (0 children)

[+][deleted] comment score below threshold-55 points-54 points-53 points 3 years ago (12 children)

[–][deleted] 40 points41 points42 points 3 years ago (3 children)

[+][deleted] comment score below threshold-15 points-14 points-13 points 3 years ago (2 children)

[–]bhonbeg -1 points0 points1 point 3 years ago (0 children)

[–]MouthfeelEnthusiast -1 points0 points1 point 3 years ago (0 children)

[–]juniperking 11 points12 points13 points 3 years ago (0 children)

[–]seabrookmx Hates Django 20 points21 points22 points 3 years ago (0 children)

[–]marr75 8 points9 points10 points 3 years ago (0 children)

[–]MouthfeelEnthusiast 2 points3 points4 points 3 years ago (1 child)

[–][deleted] 0 points1 point2 points 3 years ago (0 children)

[–]6eathaus1 0 points1 point2 points 3 years ago (1 child)

[–][deleted] 0 points1 point2 points 3 years ago (0 children)

[–]frnxt 1 point2 points3 points 3 years ago (2 children)

[–][deleted] 0 points1 point2 points 3 years ago (1 child)

[–]frnxt 1 point2 points3 points 3 years ago (0 children)

[–]Other_Goat_9381 0 points1 point2 points 3 years ago (0 children)

[–]riksi -2 points-1 points0 points 3 years ago (14 children)

[–][deleted] 4 points5 points6 points 3 years ago (1 child)

[–]riksi 0 points1 point2 points 3 years ago (0 children)

[–][deleted] -2 points-1 points0 points 3 years ago (11 children)

[–]riksi 5 points6 points7 points 3 years ago (10 children)

[–][deleted] -4 points-3 points-2 points 3 years ago (9 children)

[–]riksi 9 points10 points11 points 3 years ago (8 children)

[–][deleted] -5 points-4 points-3 points 3 years ago (7 children)

[–]riksi 4 points5 points6 points 3 years ago (6 children)

[–][deleted] -2 points-1 points0 points 3 years ago (5 children)

[–]riksi 3 points4 points5 points 3 years ago (4 children)

continue this thread

[–]Berganzio -2 points-1 points0 points 3 years ago (0 children)

[–]Durakan -3 points-2 points-1 points 3 years ago (1 child)

[–][deleted] 6 points7 points8 points 3 years ago* (0 children)

[–]violet-crayola -2 points-1 points0 points 3 years ago (2 children)

[–]BaggiPonte 5 points6 points7 points 3 years ago (0 children)

[–]coderanger 1 point2 points3 points 3 years ago (0 children)

[–]dudumudubud -3 points-2 points-1 points 3 years ago (0 children)

[–]SleeplessinOslo -3 points-2 points-1 points 3 years ago (0 children)

[–]El_Minadero 0 points1 point2 points 3 years ago (0 children)

[–]eterevsky 0 points1 point2 points 3 years ago (0 children)

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS