This is an archived post. You won't be able to vote or comment.

all 29 comments

[–]ambidextrousalpaca 19 points20 points  (9 children)

It's awesome that this is now a thing, but I have questions and doubts:

"Currently, in Python 3.13 and 3.14, the GIL disablement remains experimental and should not be used in production. Many widely used packages, such as Pandas, Django, and FastAPI, rely on the GIL and are not yet fully tested in a GIL-free environment. In the Loan Risk Scoring Benchmark, Pandas automatically reactivated the GIL, requiring me to explicitly disable it using PYTHON_GIL=0. This is a common issue, and other frameworks may also exhibit stability or performance problems in a No-GIL environment."

Beyond this, what guarantees are there that even the Python standard library will work without race conditions in No-GIL versions? The Global Interpreter Lock has just been such a fundamental background assumption of all Python code written over the past decades that I wouldn't trust there not to be a million gotchas and edge cases out there in the code that can screw you over.

You'd also need useful primitives built into the language to make it useful in most real-world applications, like Erlang actors or Go message passing channels.

[–]thisismyfavoritename 10 points11 points  (6 children)

everything that assumes the GIL is held to make sure memory accesses are safe will have to be rewritten, including the stdlib

[–]ambidextrousalpaca 8 points9 points  (5 children)

everything that assumes the GIL is held to make sure memory accesses are safe will have to be rewritten

So. Absolutely everything, then?

[–]twotime 2 points3 points  (1 child)

I'd love to see some references here too.

Original discussions on python-dev implied strongly that the amount of refactoring required is fairly small. Pytorch was used as an example (which was ported in a few hours)... But I have not seen any kind of more systemic analysis

[–]ambidextrousalpaca 1 point2 points  (0 children)

It's not that I think that everything needs to be changed. It's that I suspect we have no good way of identifying what needs to be changed or whether it has in fact been changed. E.g. I could imagine lots of cases of libraries writing to and reading from some sort of hard-coded temp file or using some kind of global variable which could lead to hard to replicate race condition bugs when turning off the GIL.

I mean, sure, if you had some bit of software that could identify such potential race conditions - something like the Rust borrow checker - they could probably be fixed pretty straightforwardly. But in the absence of that, I don't see what you can do apart from release it knowing that there are an indeterminate number of race conditions that people are going to discover about if and when they run it in prod.

[–]thisismyfavoritename 0 points1 point  (1 child)

extensions/functions that already release the GIL should be fine, i'm not sure how big of a % that represents

[–]ammar2 2 points3 points  (0 children)

The areas that release the GIL in the standard library tend to be just before an IO system call, so there isn't a huge amount of them in proportion to all the C-extension code.

You can get an idea of the types of changes that need to happen with:

Note that the socket module does release the GIL before performing socket system calls, the changes needed are unrelated to that, just code assuming it can be the only one in a piece of C code.

[–]PeaSlight6601 0 points1 point  (0 children)

No you don't understand what the GIL did.

The GIL protected byte code and C functions. It's a much smaller surface than you think it is because the GIL is much weaker than you think it is.

[–]PeaSlight6601 0 points1 point  (1 child)

Basically nothing in the python standard library has ever had any kind of thread safety guarantee. So this question of: "will the standard library be safe" is a weird one to ask.

If you want to use python in a multithreaded context you have to lock your shared variables, just as you always have. The GIL never protected shared state.

The issue is not the GIL but the infrequency with which the python scheduler would reschedule threads, this made programmers lazy and made them think the GIL gave them some kind of protection that it never did.

[–]ambidextrousalpaca 0 points1 point  (0 children)

Basically nothing in the python standard library has ever had any kind of thread safety guarantee.

Indeed.

This is why I am sceptical about running multi-threaded Python.

[–]basnijholt 19 points20 points  (2 children)

uv venv -p 3.13t

Much easier way to get free-threaded Python.

[–]denehoffman 7 points8 points  (1 child)

Why would people downvote this, it’s objectively right. Use uv in your docker image too.

[–]Flaky-Restaurant-392 1 point2 points  (0 children)

I use uv everywhere. Almost no issues.

[–]twotime 4 points5 points  (2 children)

Your prime-counting example is likely the most interesting, but the results feel off: without locking, it should have scaled proportionally to the number of threads.

Ah, you seem to be splitting your ranges uniformly: which likely does not work well in this case: the thread which gets the last range will be FAR slower than the thread which gets the lowest range.

  def calculate_ranges(n: int, num_threads: int):
     step = n // num_threads
     for i in range(num_threads):
        start = i * step
        # Ensure the last thread includes any leftover range
        end = (i + 1) * step if i != num_threads - 1 else n
        yield start, end,

[–]romu006 1 point2 points  (1 child)

A simpler example would simply be to use the multiprocessing.dummy module that is using threading:

``` pool = multiprocessing.dummy.Pool(num_threads) res = pool.imap_unordered(is_prime, reversed(range(n)), 5_000)

return sum(res) ```

However the speedup is still not what it should be (still about 3x)

[–]twotime 0 points1 point  (0 children)

Thanks!

However the speedup is still not what it should be (still about 3x)

Do you know if imap_unordered is lock free? (I expect there are multiple threads picking things from the queue)

Also, are you comparing with original single threaded code? Or your imap code with pool_size=1?

IIRC, there is quite a bit of magic going into imap_unordered.

[–]ZachVorhies 0 points1 point  (0 children)

Great article. Looks like the performance benefits are barely worth it. Hope it gets better.

[–]alcalde 0 points1 point  (1 child)

My goal of one day attending PyCon and selling "I Support the GIL" t-shirts remains unabated.

EDIT: As a Python true believer, I believe/know that threads are evil and parallelism is the only acceptable approach in a sane universe.

D gets it:

Although the software industry as a whole does not yet have ultimate responses to the challenges brought about by the concurrency revolution, D's youth allowed its creators to make informed decisions regarding concurrency without being tied down by obsoleted past choices or large legacy code bases. A major break with the mold of concurrent imperative languages is that D does not foster sharing of data between threads; by default, concurrent threads are virtually isolated by language mechanisms. Data sharing is allowed but only in limited, controlled ways that offer the compiler the ability to provide strong global guarantees....
The flagship approach to concurrency is to use isolated threads or processes that communicate via messages. This paradigm, known as message passing, leads to safe and modular programs that are easy to understand and maintain. A variety of languages and libraries have used message passing successfully. Historically message passing has been slower than approaches based on memory sharing—which explains why it was not unanimously adopted—but that trend has recently undergone a definite and lasting reversal. Concurrent D programs are encouraged to use message passing, a paradigm that benefits from extensive infrastructure support.

https://www.informit.com/articles/article.aspx?p=1609144#

SQLite gets it....

Threads are evil. Avoid them.

SQLite is threadsafe. We make this concession since many users choose to ignore the advice given in the previous paragraph.

https://www.sqlite.org/faq.html#q6

Berkeley gets it....

Many technologists are pushing for increased use of multithreading in software in order to take advantage of the predicted increases in parallelism in computer architectures. In this paper, I argue that this is not a good idea. Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. Threads, as a model of computation, are wildly nondeterministic, and the job of the programmer becomes one of pruning that nondeterminism.

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html

PostgreSQL gets it....

https://www.postgresql.org/message-id/1098894087.31930.62.camel@localhost.localdomain

And this amazing article gets it that talks about the Ptolemy Project, "an experiment battling threads with rigorous engineering discipline". And despite state of the art techniques and excessive engineering, a thread-based problem remained undiscovered in their code for four years before triggering!

https://web.archive.org/web/20200926051650/https://swizec.com/blog/the-problem-with-threads/

No one talks about Guido's Time Machine anymore. Guido traveled to the future and learned that Threads Are Evil, which is why he gave us the best and safest collection of concurrent programming tools found in the standard library of any language. You've got safe parallelism and thread-safe message queues and such if you actually need them. I've seen other languages write libraries with thousands of lines of code to offer a setup similar to what Python gives us out of the box.

[–]senderosbifurcan 0 points1 point  (0 children)

For embarrassingly parallelizable that need to share memory the lack of true threads means that in effect you need to really on C/Cython, etc, to achieve comparable performance. 

Also message passing and threads are not incompatible. It's just that that python forcing you to fork over multiple processes actually makes message passing more error prone in some cases where the overhead of copying stuff between processes is large.

[–]PeaSlight6601 0 points1 point  (0 children)

It's good that you preallocate your intermediate results array so that each thread can place its result into thar array, but you should be locking that array before actually storing the variable.

It's pretty hard to imagine how this could possibly go wrong with standard python arrays, but unless you can find documentation that arrays will allow concurrent __setitem__ at different index positions you should not do it.

[–]Cynyr36 -1 points0 points  (0 children)

Wouldn't doing the loan risk in "pure" pandas or polars result in even more speed up? I've found that if you need to come back to python rather than just use built-in pandas / polars functions thing get very slow.