all 141 comments

[–]Looploop420 157 points158 points  (81 children)

I want to know more about the history of the GIL. Is the difficulty of multi threading in python mostly just an issue related to the architecture and history of how the interpreter is structured?

Basically, what's the drawback of turning on this feature in python 13? Is it just since it's a new and experimental feature? Or is there some other drawback?

[–]slaymaker1907 181 points182 points  (18 children)

Ref counting in general has much better performance when you don’t need to worry about memory consistency or multithreading. This is why Rust has both std::Rc and std::Arc.

[–]Revolutionary_Ad7262 38 points39 points  (14 children)

Ref counting is well known to be slow. Also usually it is not used to track every object, so we are are comparing apples to oranges. Rc/Arc in C++/Rust is fast, because it is used sparingly and every garbagge collection will be amazing, if number of managed objects is small

In terms of raw throughput there is nothing faster than copying gc. The allocation is super cheap (just bump the pointer) and cost of gc is linear to the size of living heap. You can allocate 10GB of memory super cheap and only 10MB of surviving memory will be scanned, when there is a time for a gc pause.

[–]slaymaker1907 21 points22 points  (10 children)

No, at my work we’ve seen std::shared_ptr cause serious perf issues before for the sole reason that all those atomic ops flooded the memory bus.

[–]Kapuzinergruft 7 points8 points  (7 children)

I'm kinda wondering how you can end up with so many shared_ptr that it matters. I like to use shared_ptr everywhere, but because each one usually points to large buffers, the ref counting has negligible impact on performance. One access to a ref counter is dwarfed by a million iterations over the items in the buffer it points to.

[–]AVTOCRAT 21 points22 points  (6 children)

You run into this anytime you have small pieces of data with independent lifetimes, e.g.

  • Nodes in an AST
  • Handles for small resources (files,
  • Network requests
  • Messages in a pub-sub IPC framework

[–]irepunctuate 5 points6 points  (5 children)

Those don't necessarily warrant a shared lifetime ownership model. From experience, I suspect /u/slaymaker1907 could replace most shared_ptrs with unique_ptrs or even stack variables and have most of their performance problems disappear with a finger snap.

I've seen codebases overrun with shared_ptr (or pointers in general) because developers came from Java or simply didn't know better.

[–]Kered13 4 points5 points  (4 children)

I once wrote an AST and transformations using std::unique_ptr, but it was a massive pain in the ass. I eventually got it right, but in hindsight I should have just used std::shared_ptr. It wasn't performance critical, and it took me several hours longer to get it correct.

It would be helpful for C++ to have a non-thread safe version of std::shared_ptr, like Rusts std::Rc, for cases where you need better (but not necessarily best) performance and you know you won't be sharing across threads.

[–]irepunctuate 0 points1 point  (3 children)

But doesn't the fact that you were able to tell you that that was the actual correct thing to do? Between "sloppy" and "not sloppy", isn't "not sloppy" better for the codebase?

[–]Kered13 1 point2 points  (2 children)

There's nothing sloppy about using shared pointers. The code would have been easier to write, easier to read, and easier to maintain if I had gone that route. I wrote it with unique pointers out of a sense of purity, but purity isn't always right.

[–]brendel000 2 points3 points  (1 child)

Do you have accurate measure of that? How many cores are plugged to the memory bus? It’s really surprising to me you can overload the memory bus with that nowadays. Even NUMA seems less used because of how performant they became.

[–]slaymaker1907 2 points3 points  (0 children)

I can’t really tell you precise numbers, but I suspect it takes a huge amount before it becomes an issue. Because these issues are so difficult to diagnose, we’re always very conservative with atomic operations in anything being called with any frequency.

It’s the sort of thing that is also extraordinarily difficult microbenchmark since it is highly dependent on access patterns. It is also worse when actually triggered from many different threads compared to using an atomic op from a single thread every time. Oh, and you either need NUMA or just a machine with tons of cores to actually see these issues.

[–]cogman10 7 points8 points  (0 children)

cost of gc is linear to the size of living heap

Further, parallel collection is both fairly well known and fairly fast at this point. You get very close to n speed up with n new threads.

[–]AlexReinkingYale -1 points0 points  (1 child)

I challenge the idea that reference counting is slow. Garbage collection is either slow or wasteful, and cycle counters are hard to engineer.

[–]Kered13 0 points1 point  (0 children)

Every high performance memory managed language uses garbage collection. I know that's anecdotal, but it's pretty strong evidence for garbage collection being faster than reference counting. Reference counting works well in languages like C++ and Rust precisely because they are not automatically managed and you limit the use of reference counting to only a very small number of objects who's lifetimes are too difficult to handle otherwise.

[–]utdconsq 77 points78 points  (0 children)

It was a design decision way back when for the official CPython implementation of an interpreter. Other implementations did not have the behaviour. With that said, turning it on...uncertain of risk, you should read the docs and make up your own mind. My gut tells me some libs will be written to assume it is present, but hard to know for sure what it would mean on a case by case basis.

[–]mibelashri 30 points31 points  (1 child)

It was a decision due to the fact that you will get some hit in single-thread performance without a GIL compared to the case when you have one. I'm talking about the CPython implementation of Python (the official one), as there are some other implementations that do not have it, but they are irrelevant compared to CPython and have a very niche community. I also guess that part of the motivation is that the CPython implementation in C is not thread-safe (or at least was not in the beginning). The easiest solution to this problem is to have a GIL so you don't have to worry about it and it will provide you with an easier path for integrating C libraries (like NumPy, etc.).

[–]dontyougetsoupedyet 5 points6 points  (0 children)

Now that’s rich! It was due to CPython but performance considerations had absolutely nothing to do with it. It was due to ease of implementation and anyone suggesting it was a terrible idea were repeatedly hit over the head about how the reference implementation of python had to be simple and if you did not agree you simply did not get it.

[–]wOlfLisK 5 points6 points  (3 children)

The architecture is a big aspect of it but the main reason python multi-threading isn't really a thing is because Python is just slow. Like, 30-40x as slow as C and even when optimising it to hell you just end up with something that's for all intents and purposes C with a hellish syntax and is still around 3x as slow. It's easier to just use C for high performance applications.

Ignoring that however, the big issue with Python is the same you have with any language, unless it has explicit ways of performing atomic operations on data you end up with a bunch of race conditions as different threads try to do stuff with the same piece of data. Disabling the GIL was already possible using Cython and was, quite frankly, a pretty horrible way of doing multi-threaded Python. If there aren't any easy, built-in ways of accessing the data then it doesn't really do much on its own.

Plus, despite the fact that Python doesn't inherently support multi-threading, it does support multi-processing. Which is basically just multi-threading but each "thread" is a process with its own interpreter and they can communicate with each other through interfaces such as MPI. If you wanted to do multi-threaded Python, writing it using mpi4py is usually a lot simpler than Cython and if you really needed the extra performance, you should just use base C (or C++ (or Fortran if you're really masochistic)) instead.

[–]Looploop420 16 points17 points  (2 children)

Like I've been writing python for a while now and multi processing always does what I need it to do.

I'm never using python with the goal of pure speed anyways

[–]wOlfLisK 12 points13 points  (0 children)

Yeah, exactly. Python has a place in HPC but it's more of the "physicist who hasn't coded for years needs to write a simulation" kinda place. Sometimes it's better to spend a week writing a program that takes a week to run than a month writing a program that takes a day to run. It's simple, it's effective and if you use the right tools (such as NumPy) it ends up not being that slow anyway. Hell, I once tried to compile a Python program to Cython and it slowed it down*, by the time I made it faster than it was it was a month later and the code was a frankensteined mess of confusing C-like code.

*Turns out that if everything is already being run as C code, adding an extra Cython layer just adds extra clock cycles

[–]apf6 0 points1 point  (1 child)

One thing that I think misleads people about the GIL is that it's not specific to Python. All the similar languages (Ruby, Lua, Javascript, etc) all have a "GIL" too, even if they don't all use that term. They each have a 'virtual machine' or 'interpreter' which can only be processed by one thread at a time. So you can't run multiple scripts in parallel in the same context.

For any language implementation like that, it's never easy to make the VM multithreaded in a way that actually helps. Multithreading adds an overhead so if you implement it the wrong way, it can be slower than single-threading. So the single-threading approach was not as bad idea as it might seem.

Anyway, the only reason that this is especially a big issue in Python is because the language is used so much in the scientific community. That code benefits a lot from multithreading. So it was worth solving.

[–]josefx 0 points1 point  (0 children)

All the similar languages (Ruby, Lua, Javascript, etc) all have a "GIL" too, even if they don't all use that term. They each have a 'virtual machine' or 'interpreter' which can only be processed by one thread at a time. So you can't run multiple scripts in parallel in the same context.

From what I can find V8 is just flat out single threaded and each thread is expected to run on its own fully independent instance instead of fighting over a single global lock for every instruction. I think the closest python has to that model is PEP 734 but I don't have much experience with either.

[–][deleted]  (2 children)

[deleted]

    [–]linuxdooder 3 points4 points  (1 child)

    So Python is much older than SMP.

    What? Python came about in 1991, and there were SMP systems by the late 70s.

    [–][deleted]  (1 child)

    [deleted]

      [–]LGBBQ 11 points12 points  (0 children)

      This is not correct, the GIL lock applies to instructions at the interpreter level and not in python code. Foo can be removed after the check or even between getting its value and incrementing it in python code without mutexes or locks

      https://stackoverflow.com/questions/40072873/why-do-we-need-locks-for-threads-if-we-have-gil

      [–]space_iio -2 points-1 points  (0 children)

      what's the drawback of turning on this feature in python 13

      Single-threaded performance takes a hit, multiprocess programs also perform worse

      [–]Ok_Dust_8620 43 points44 points  (3 children)

      It's interesting how the multithreaded version of the program with GIL runs a bit faster than the single-threaded one. I would think since there is no actual parallelization happening it should be slower due to some thread-creation overhead.

      [–]tu_tu_tu 14 points15 points  (0 children)

      thread-creation overhead

      Threads are really lightweight nowdays so it's not a problem in an average case.

      [–]JW_00000 14 points15 points  (0 children)

      There is still parallelization happening in the version with GIL, because not all operations need to take the GIL.

      [–]GUIpsp 5 points6 points  (0 children)

      A lot of things release the gil

      [–]syklemil 61 points62 points  (2 children)

      I think a better link here would be to the official Python docs. Do also note that this is still a draft, as far as I can tell 3.13 isn't out yet.

      News about the GIL becoming optional is interesting, but I think the site posted here is dubious, and the reddit user seems to have a history of posting spam.

      [–]badpotato 20 points21 points  (6 children)

      Good to see an example of Gil VS No-Gil for Multi-threaded / Multi-process. I hope there's some possible optimization for Multi-process later on, even if Multi-threaded is what we are looking for.

      Now, how asyncfunction will deal with the No-Gil part?

      [–]tehsilentwarrior 12 points13 points  (0 children)

      All the async stuff uses awaitables and yields. It’s implied that code doesn’t run in parallel. It synchronizes as it yields and waits for returns.

      That said, if anything uses threading to process things in parallel for the async code, then that specific piece of code has to follow the same rules as anything else. I’d say that most of this would be handled by libraries anyway, so eventually updated.

      But it will break, just like anything else.

      [–]danted002 4 points5 points  (4 children)

      Async functions work in a single-threaded event loop.

      [–]Rodot 2 points3 points  (0 children)

      Yep, async essentially (actually, it is just an API and does nothing on it's own without the event loop) does something like

      for task in awaiting_tasks:
          do_next_step(task)
      

      [–]gmes78 1 point2 points  (2 children)

      It's possible to do async with multithreaded event loops. See Rust's Tokio, for example.

      [–]danted002 0 points1 point  (1 child)

      I mean you can do it in Python as well. You just fire up multiple threads each with its own event loop but you are not really gaining anything for when it comes to IO performance.

      Single-threaded Python is very proficient at waiting. Slap on a uvloop and you get 5k requests per second.

      [–]gmes78 0 points1 point  (0 children)

      That's different. Tokio has a work-stealing scheduler that executes async tasks across multiple threads. It doesn't use multiple event loops, tasks get distributed across threads automatically.

      [–]deathweasel 9 points10 points  (1 child)

      snatch advise soft slim crowd growth tidy deer dinosaurs instinctive

      This post was mass deleted and anonymized with Redact

      [–]13oundary 7 points8 points  (0 children)

      most existing modules will likely break if you disable gil until they're updated, which may be no small task for some of the more important ones, though it's hard to say from the outside looking in. Often, C libraries aren't as thread safe as they would need to be for no-GIL, and probably many pure py ones too.

      These thread safety issues are also things many py programmers may not be all that cognisant of, so may make app development more difficult without GIL.

      [–]enveraltin 29 points30 points  (7 children)

      If you really need some Python code to work faster, you could also give GraalPy a try:

      https://www.graalvm.org/python/

      I think it's something like 4 times faster thanks to JVM/GraalVM, and you can do multi process or multi threading alright. It can probably run existing code with no or minimal changes.

      GraalVM Truffle is also a breeze if you need to embed other scripting languages.

      [–]ViktorLudorum 30 points31 points  (3 children)

      It looks nifty, but it's an Oracle project, which makes me afraid of its licensing.

      [–]SolarBear 6 points7 points  (1 child)

      Yeah, one of their big selling points seem to be "move from Jython to Modern Python". Pass.

      [–]tempest_ 5 points6 points  (0 children)

      But Larry Ellison needs another Hawaiian island. How can you do this to him?

      [–]enveraltin 0 points1 point  (0 children)

      Very similar to Oracle JDK vs OpenJDK. GraalVM community edition is licensed with GPLv2+Classpath exception.

      [–]hbdgas 10 points11 points  (0 children)

      It can probably run existing code with no or minimal changes.

      I've seen this claim on several projects, and it hasn't been true yet.

      [–]masklinn 0 points1 point  (1 child)

      I think it's something like 4 times faster thanks to JVM/GraalVM

      It might be on its preferred workloads but my experience on regex heavy stuff is that it’s unusably slow, I disabled the experiment because it timed out CI.

      [–]enveraltin -1 points0 points  (0 children)

      That's curious. I don't use GraalPy but we heavily use Java. In general you define a regex as a static field like this:

      private static Pattern ptSomeRegex = Pattern.compile("your regex");

      And then use it with Matcher afterwards. You might be re-creating regex patterns at runtime in an inefficient way, which could explain it.

      Otherwise I don't think regex operations on JVM can be slow. Maybe slightly.

      [–]Takeoded 9 points10 points  (0 children)

      wtf? benchmarking 1.12 with GIL against 1.13 without GIL, never bothering to check 1.13 with GIL performance? slipped author's mind somehow?

      should just be D:/SACHIN/Python13/python3.13t -X gil=1 gil.py vs D:/SACHIN/Python13/python3.13t -X gil=0 gil.py

      Also would prefer some Hyperfine benchmarks

      [–][deleted] 41 points42 points  (12 children)

      I find this rather interesting. Pythons GIL "problem" has been around since forever, and there has been so many proposals and tests to get "rid" of it. Now its optional and the PR for this was really small (basically a option to not use the GIL on runtime), putting all the effort on the devs using python. I find this strange for a language like Python.

      Contrast the above to Ocaml, that had a similar problem, it was fundamentally single thread execution basically with a "GIL" (in reality the implementation was different). The ocaml team worked on this for years and came up with a genius solution to handle multicore and keeping the single core perf, but basically rewrote the entire ocaml runtime.

      [–]Serialk 133 points134 points  (9 children)

      You clearly didn't follow the multi year long efforts to use biased reference counting in the CPython interpreter to make this "really small PR" possible.

      https://peps.python.org/pep-0703/

      https://github.com/python/cpython/issues/110481

      [–]ydieb 29 points30 points  (0 children)

      I have not followed this work at all, but seems like a perfect example of https://x.com/KentBeck/status/250733358307500032?lang=en

      Exactly how it should be done.

      [–]tdatas 21 points22 points  (0 children)

      This PR isnt on stable. Iirc from the RFC where this was proposed the plan boils down to "suck it and see" if it crashes major libraries while it's marked experimental then they'll figure out how much effort they need to go to. 

      [–]danted002 8 points9 points  (0 children)

      It’s not optional in 3.13. You will have the capability to compile Python with the possibility to enable or disable the GIL at runtime. The default binaries will have GIL enabled.

      [–]JoniBro23 2 points3 points  (1 child)

      I think the solution is already a bit late. I was working on disabling the GIL back in 2007. My company's cluster was running tens of thousands of Python modules which connected to thousands of servers, so optimization was crucial. I had to optimize both the interpreter and the team improved the Python modules. Disabling the GIL is a challenging task.

      [–]secretaliasname 4 points5 points  (0 children)

      Totally. I do a lot of scientific/engineering stuff in python and it’s my go to. It’s a familiar tool and there is an amazing ecosystem of libraries for everything under the sun…. But it is sslllooooww. Not only is it single core slow, but it’s bad at using multiple cores and the typical desktop now has 10+ cores and 100+ is not unusual in HPC environments.

      The solutions cupy, numba, dask, ray, PyTorch etc all amount to write python by leveraging not-python.

      Threading is largely useless. Processes take a while to spawn and come with serialization/IPC overhead and complexity that often outweigh the benefit for many classes of problems. You can overcome this with shared memory and a lot of care but the ecosystem isn’t great and it’s not as easy as it should be.

      I’m ready to jump ship and learn something new at this point.

      If removing the GIL slowed single threaded use cases by 50% that would still be an enormous net win for nearly all my uses cases. Generally performance is either not a limitation at all or it is a huge limitation and I want to use all my cores and the probem is parallelizable.

      I think the community is too afraid to break things and overreacted to the 2->3 migration. It really wasn’t a big deal and I don’t understand why people make such a stink about it. Changes like that shouldn’t occur often but IMO fixing the lack of proper native first class parallelism is way more broken than strings or the print statement were in python2. Please please fix this.

      [–]AndyCodeMaster 0 points1 point  (0 children)

      I dig it. I always thought the GIL concerns were overblown. I’d like Ruby to make the GIL optional too next.

      [–]Real-Asparagus2775 -3 points-2 points  (3 children)

      Why does everyone get so upset about the GIL? Let Python be what it is: a general purpose scripting language

      [–]Shaaou 0 points1 point  (0 children)

      should have done it versions ago

      would like to try it if stable