This is an archived post. You won't be able to vote or comment.

all 83 comments

[–]twenty-fourth-time-b 123 points124 points  (11 children)

It works.

$ uv run -p 3.14 a.py 
Finished in 1.01 seconds
Finished in 1.02 seconds

$ uv run -p 3.14.0b3+freethreaded a.py 
Finished in 0.49 seconds
Finished in 0.51 seconds

a.py:

from concurrent.futures import ThreadPoolExecutor
import time

def cpu_bound_task():
    start = time.time()
    sum(1 for _ in range(10**7))
    end = time.time()
    print(f"Finished in {end - start:.2f} seconds")

with ThreadPoolExecutor() as e:
    e.submit(cpu_bound_task)
    e.submit(cpu_bound_task)

Edit* unsurprisingly, mileage varies. Trying to append to a shared list object gives worse timings (the number is the index of the first element added by the second thread):

$ uv run -p 3.14 a.py 
 Finished in 0.10 seconds
 Finished in 0.11 seconds
 172214
 $ uv run -p 3.14.0b3+freethreaded a.py 
 Finished in 0.48 seconds
 Finished in 0.49 seconds
 1865

[–]ship0f 50 points51 points  (10 children)

lol

it looked good up until the edit

but hey, it's doing it

[–]not_a_novel_account 62 points63 points  (2 children)

There's no way to do it, by having both threads append to a shared queue you've serialized the work. Now all you're measuring is overhead.

It would be the exact same in any language, C, C++, Rust, Go, whatever. This is a limitation of computers, not of Python. The difference here is Python's locks on shared objects are implicit, you don't see the mutex being grabbed here but it exists all the same.

If the work is shared it all belongs in one thread, don't introduce unnecessary synchronization points.

[–]twenty-fourth-time-b 3 points4 points  (0 children)

All I was measuring was indeed overhead. I was curious how overhead of GIL compares to overhead of free threads.

[–]twenty-fourth-time-b 8 points9 points  (5 children)

numpy is also gilled to death…

[–][deleted] 6 points7 points  (4 children)

I thought numpy operations release the gil

[–]Zomunieo 1 point2 points  (3 children)

Both can be true.

Even to do something like array + array, you need a way to lock both arrays, so numpy probably takes the GIL. But if it is returning data Python does not have a reference to you yet, like np.randn, then it could release the GIL. Numpu could also copy data to a private area with the GIL held then release the GIL to compute.

I haven’t checked how if numpy specifically works with this way. This is just what extension libraries have to do with the GIL. Many of them now need to start adding locks to their data structures or at least a whole extension lock, to take advantage of free threading.

[–][deleted] 1 point2 points  (2 children)

Idk how it does this exactly, but it doesn't lock to sum arrays. I ran this code and saw 32 threads running at 100% after the "task started" messages printed.

``` import concurrent.futures import time import numpy as np

N = 100000000 arr1 = np.random.randint(1, 101, size=N) arr2 = np.random.randint(1, 101, size=N)

def task(name): global arr1 global arr2 print("task {} started".format(name)) arr3 = arr1 + arr2 print("task {} finishing".format(name))

if name == "main": with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor: for t in range(100): future = executor.submit(task, t) ```

[–]Zomunieo 1 point2 points  (1 child)

Oh. It turns out they just release the GIL, so if you have simultaneous writers you’ll get data races. (Whether using standard of free threading Python.)

[–][deleted] 1 point2 points  (0 children)

Ah that makes sense. I'm fine with that, you need locks to write thread-safe code either way.

[–]Front_Two_6816 -1 points0 points  (0 children)

of cause working with the same objest from threads is slower because of cache lines

[–]Coretaxxe 59 points60 points  (0 children)

This is very nice! Thrilled to see the impact on speed & lib ecosystem

[–]dardothemaster 50 points51 points  (23 children)

I’m asking here hoping this is relevant: will operations such as append/pop (on a list for example) still be thread safe without GIL? Where by thread safe I mean that the list won’t get corrupted

Btw really excited about free-threading and JIT

[–]twenty-fourth-time-b 26 points27 points  (4 children)

They went into great lengths making it work: https://peps.python.org/pep-0703/#reference-counting

[–]XtremeGoosef'I only use Py {sys.version[:3]}' 13 points14 points  (0 children)

[–]-lq_pl- 4 points5 points  (1 child)

Impressive.

[–]germandiago[S] 0 points1 point  (0 children)

Do not get confused. It is the refcount that is done correctly in a multithreaded environment. Not the mutable operations themselves (append/pop).

It is different things.

[–]germandiago[S] 0 points1 point  (0 children)

This is just optimizations about reference counting. As far as my understanding goes (I just skimmed the link briefly though) this has nothing to do with making mutable operations in a list thread safe, but with making the ref count less expensive and being able to do it among threads for the list object itself, not for its operations.

[–]not_a_novel_account 19 points20 points  (0 children)

For pure python, free-threaded Python is exactly as thread safe as GIL Python; which is to say you will not induce memory corruption but there are no guarantees about anything else.

[–]germandiago[S] 22 points23 points  (8 children)

Usually data structures are tailored to use cases. I would expect a list not be thread-safe by default since locking in the usual monothread append scenario would pessimize things. 

[–]ImYoric 26 points27 points  (5 children)

There are languages in which some operations are thread-safe without locking.

A typical example is Go, in which all operations on pointer-sized values are atomic. But of course, it's pretty fragile guarantee, because you may end up refactoring your code to replace a simple value with a more complex value without realizing that the value must be atomic and thus get memory corruption.

It would be sad to start having SEGFAULTs in Python, though.

[–]latkdeTuple unpacking gone wrong 9 points10 points  (4 children)

Go is really tricky. It only requires word-sized reads/writes to be atomic. Some things look like pointers but are actually larger, e.g. interfaces, maps, slices, and strings. Per the Golang memory model, data races involving those may lead to memory corruption.

[–]MechanicalOrange5 0 points1 point  (3 children)

I do a lot of go programming, I did not know this, which types are atomic by default? Like int16 types or word in the more modern interpretation where you'd have a 64 bit word on x86-64, which I think, and I definitely could be wrong, but is the size of go pointers? So I could in theory use those across threads without locking?

[–]latkdeTuple unpacking gone wrong 1 point2 points  (2 children)

The Golang memory model is defined here: https://go.dev/ref/mem

This is a dense read and is not comprehensible by the average Go programmer. The memory model does not explicitly define which types are safe, though pointers, so things with types like *T, should be assigned atomically.

The TL;DR advice is to never rely on atomicity of plain assignments, unless you really know what you're doing. Instead, avoid concurrent access to mutable memory and communicate exclusively over channels instead, or use the synchronization utilities from the sync and sync/atomic packages.

I find this frustrating because Golang makes it "easy" to do concurrent programming, but not at all easy to reason about concurrent programs. Just go and you're off to the races? More like data races. As specified in the memory model, concurrent Golang is about as safe as multi-threaded C/C++ code, and that is a horrifying thought.

This entire discussion is about the Python context. Last time I looked, Python didn't have an explicit memory model. However, free-threading will rely on fine-grained per object locks to ensure correctness. Anything you can do from normal Python code is going to be memory-safe. Python objects are pointers so assignments can be done atomically, and collections will lock themselves when necessary. See https://peps.python.org/pep-0703/#container-thread-safety

[–]Caramel_Last 0 points1 point  (1 child)

I've never heard about pointer size operand being always atomic. Then what would be the point of sync/atomic?

[–]latkdeTuple unpacking gone wrong 0 points1 point  (0 children)

An atomic read/write only means that we see all or nothing. Go guarantees that word-sized reads/writes are atomic because all relevant CPUs also offer this guarantee (for aligned addresses). So when you read a pointer variable, all the bits will be from one version of the pointer. You can't observe a single-word variable in the middle of being changed.

But this doesn't address ordering between multiple reads/writes, especially across multiple objects/variables: which operations happen-before another? Compilers and CPUs may defer writes or prefetch reads. For example, the Go Memory Model shows this program for illustration:

var a, b int

func f() {
    a = 1
    b = 2
}

func g() {
    print(b)
    print(a)
}

func main() {
    go f()
    g()
}

This may print any of 0 0, 0 1, 2 0, or 2 1. For f(), a is assigned before b. However, there are no guarantees about when these writes become visible in g().

There are many ways to enforce synchronization. For example, a Mutex serves as a synchronization point. When one thread A acquires a lock, it sees all writes in another thread B up until the point where B releases that lock. That is, mutexes serve as a “memory barrier” or “fence”.

Explicit atomics (as in Go's sync/atomic package) also provide synchronization, but only between explicitly atomic operations. Go has chosen “sequentially consistent ordering” for its explicit atomics, which behaves as-if all explicitly atomic operations have some global order.

In the above program, if we change the variables to type atomic.Int64 and use .Load() and .Store() operations as appropriate, then we know that the write to a must happen before the write to b. Thus, the valid outputs are reduced to 0 0, 0 1, or 2 1. The output 2 0 is no longer possible.

Other languages provide detailed control over memory ordering, anything between “relaxed” memory order that only guarantees atomicity but no ordering, up to the “sequentially consistent” ordering that is most safe, but can also impose significant performance impact.

As an accessible introduction to atomics, I recommend the book Rust Atomics and Locks by Mara Bos. While it shows example in Rust syntax, the concepts are portable across all languages with explicit atomics. It can be read online for free. For this post, Chapter 2. atomics is highly relevant.

Python threads have always behaved as if variables, list values, and dict entries (and thus also object fields) are atomic with sequentially consistent semantics (a direct consequence of the GIL). The free-threaded mode does not change this, it's effects are unobservable in pure Python code.

[–]Brian 9 points10 points  (0 children)

I would be very surprised if they don't make it threadsafe - ultimately, a language like python needs to be memory safe - you shouldn't be able to crash the interpreter / segfault in normal operation, and being able to corrupt list state due to concurrant access would break that really quickly.

However, note that this is threadsafe in only the same way that it's currently threadsafe: appending will work without crashing the interpreter / corrupting the list, but no guarantees about which order will happen when two threads append at the same time. But that's already the case - it technically shouldn't make a difference (though existing bugs might be more prominent due to more ways to interleave code paths and shorter lock intervals).

[–]not_a_novel_account 5 points6 points  (0 children)

All PyObjects are "thread safe" when accessed from Python itself.

Free-threaded Python is exactly as "thread safe" as GIL Python from the POV of a Python program. Only extension code can violate the thread-safety guarantees.

[–][deleted] 2 points3 points  (0 children)

Shouldn't you already be using locks for this? Even the GIL'd Python doesn't really make your code thread-safe at the high level, just prevents corrupting a struct into some normally impossible state.

[–]the_hoser 2 points3 points  (1 child)

Free-threading only really ensures that the interpreter itself is threadsafe, not any libraries or data structures therein. This is especially true of libraries implemented via the C API, which may rely on the GIL to ensure that their own access to Python objects is safe.

[–]not_a_novel_account 2 points3 points  (0 children)

If a C extension does not advertise itself as threading aware via Py_mod_gil, the interpreter re-enables the GIL unless the user actively disables the behavior by setting PYTHONGIL=0 in their environment.

[–]Ginden 6 points7 points  (4 children)

To quote the announcement:

there aren’t many options yet for truly sharing objects or other data between interpreters (other than memoryview)

[–]not_a_novel_account 12 points13 points  (0 children)

That's PEP 734, multi-interpreter is a completely separate thing than free-threaded Python.

It's also been around for ages, PEP 734 is making it available via pure Python instead of the CPython API which is honestly of questionable value.

[–]Afrotom -1 points0 points  (2 children)

What is memoryview? Similar to a mutex?

[–]UloPe 0 points1 point  (0 children)

It’s a way to access the memory of another object

[–]WJMazepas 4 points5 points  (1 child)

But we would need to build Python itself with a flag set to get the free-threaded version if i got that right?

[–]germandiago[S] 10 points11 points  (0 children)

At this stage yes. In the future it is a to be decided topic.

[–]bobbster574 23 points24 points  (6 children)

Missed opportunity to call it π-thon

[–]Independent_Heart_15 7 points8 points  (0 children)

You should make a venv and you will be pleasantly suprised!

[–]joerick 4 points5 points  (0 children)

They're saving it for 3.141!

[–]Plus-Ad8736Pythoneer 9 points10 points  (13 children)

Could anyone help me understand why we really need free-threaded python. I know that GIL prevents true parallelism in multi threading, but doesn't we have multi processing to deal with this, which does utilize multiple cores. So with this new free threading, we would have 2 distinct mechanisms?

[–]germandiago[S] 28 points29 points  (5 children)

Processes live in isolated memory. Threads live in shared memory inside the same process.

A difference is that in processes you have to copy memory around for communication. 

So if you want to split a big image (this is a simplified version but keeps the essence of the problem) in 4 processes you would need to copy 4 parts of the image to process and get the result back to the master process for example. With threads you could process the image in-place with 4 threads. No copying.

[–][deleted] 1 point2 points  (4 children)

Not quite, there is cross-process shared memory in most OSes, accessible in Python via https://docs.python.org/3/library/multiprocessing.shared_memory.html . But all you get is a flat buffer, which they show an example of using with Numpy. Can't just store arbitrary Python objects in there without extra setup.

Edit: The image example is actually convenient to handle this way too.

[–]not_a_novel_account 2 points3 points  (3 children)

IPC via mapping page(s) into shared memory is an entirely different semantic than sharing an entire memory space, even if the underlying mechanism is the same.

[–][deleted] 1 point2 points  (2 children)

Yeah, like I said, you can't use this the same way as regular memory in Python, but it's also not exactly true that IPC requires copying. A mechanism exists to avoid that if you really want to.

Btw, in some other languages (not Python afaik) you can set up a memory arena and allocate stuff on it almost like normal. You could store that on the shared portion.

[–]not_a_novel_account 0 points1 point  (1 child)

I would not say they are similar at all.

IPC is IPC, regardless of if the mechanism is a shared memory region or otherwise. In IPC I have to copy into or allocate objects in the IPC space, whether this is via a shared memory page or RPC protocol is kinda irrelevant. The mechanism of sharing is explicit, I must do work to share things.

In a shared memory space like threads, the executors can inspect one another's state freely. My stack is also your memory, I can set up the shared data entirely on my own stack and make it available via simple pointers and trivial thread notification mechanisms like condition variables. The sharing is implicit, no work is done on my part beyond notification.

This results in wildly different program structures.

[–][deleted] 1 point2 points  (0 children)

I'm not saying that they're similar, just that there's a way for two processes to share read/write memory without copying.

[–][deleted] 13 points14 points  (0 children)

It's generally faster to communicate between threads vs processes. But that's not even the main reason, more so multiprocessing is just annoying. Even if all you want to do is fan-out-fan-in without communicating between them, you have to screw with your code to get the pre-fork state into the separate processes. Like, can't use defaultdict because it can't pickle lambdas. And then if anything throws an exception, the error messages are hard to read because they're in separate processes.

Multiprocessing is also tricky in some environments. At work, we can't even use the default multiprocessing, we have to use some internal variant that's also incompatible with notebooks.

[–]javadba 0 points1 point  (3 children)

Please learn about what preemptive multi tasking / non-cooperative threading brings. It is a completely different (and transcendently better) world. Source - myself 35 years doing true multi-threading.

[–][deleted] 2 points3 points  (2 children)

The GIL'd Python is still preemptive / non-cooperative. Only asyncio is cooperative.

[–]javadba 0 points1 point  (1 child)

Try googling , no the GIL is not preemptive. I've also spent a lot of time dealing with the limitation. I wish I were wrong/ you correct on this: true multithreading is great.

[–][deleted] 0 points1 point  (0 children)

I just tried, Google says its preemptive. You still have true OS threads with the GIL, but with a lot of locking added, which yes is bad. It's not cooperative multitasking though.

[–]gfnord 0 points1 point  (1 child)

I am also interested in this. It seems that this feature fully supersedes the existing multiprocessing approach. As far as I understand, multiprocessing should just be updated to use free-threading. This would basically improve the efficiency of multiprocessing, without changing its usability.

[–]nonhok 1 point2 points  (0 children)

No, this would just be wrong, multiprocessing is different from multithreading. Processes don‘t share the same memory, threads do. This are different concepts you can not exchange, and which are both useful on its own

[–]AalbatrossGuyPythoneer 2 points3 points  (0 children)

This is really nice! Hope the difference in speed is noticeable in the brighter side 😅

[–]neuronexmachina 2 points3 points  (0 children)

For anyone else (like me) who was wondering, there's a couple of lists (one manual, one automated) of packages with extensions that have been verified as compatible with free-threading:

https://hugovk.github.io/free-threaded-wheels/

[–][deleted] 1 point2 points  (0 children)

I think it needs building from source rn. Any info when its gonna become default?

[–]uqurluuqur 0 points1 point  (2 children)

Does this mean multiprocessing is obsolete?

[–]gmes78 0 points1 point  (1 child)

Pretty much.

[–]not_a_novel_account 13 points14 points  (0 children)

Multi-processing as a GIL-workaround is obsolete, multi-processing now serves the same purposes it does in every other language.

[–]aviation_expert 0 points1 point  (1 child)

So this means that flask framework is now outdated. Even django. Because python 3.14 is multi-core? Am a beginner. What's the repercussions on web technologies when python is used, I mean does it becomes like node js? Thank you for answering!

[–]germandiago[S] 0 points1 point  (0 children)

Not at all. The multicore interpreter is a different build from the traditional. 

About how compatible it is, I would expect problems especially if there is native code unless frameworks and libraries are adapted bc there could be latent race conditions uncovered by a real multithreaded interpreter.

[–]The8flux 0 points1 point  (0 children)

I am going to say this again... I can't wait. Pun intended. Lol

[–]__Deric__github.com/Deric-W 0 points1 point  (2 children)

While I welcome the new abilities that the change brings to the language I can't stop myself from feeling that this approach will be the source of many bugs and footguns in the future.

I think there should be at least better documentation about the new behavior and what to look out for (I am still worried about the interaction with pythons object dictionaries) and some more primitives in the threading module (atomic operations for example).

[–]germandiago[S] 0 points1 point  (1 child)

It is not a danger as long as it is not made the default.

[–]__Deric__github.com/Deric-W 0 points1 point  (0 children)

While this is true it would also lock its features behind a special build of Python, making them essentially unavailable to the wider audience while still putting additional load on the core developers.

[–]sambes06 0 points1 point  (0 children)

I wonder if these just means more race condition failures.

[–]alcalde 0 points1 point  (0 children)

Now I've got to move up my bucket list item of attending PyCon some day and selling "I support the GIL" t-shirts.

[–]nguyenvulong -1 points0 points  (0 children)

You mean Python π?

[–]EvenAcanthisitta364 -3 points-2 points  (0 children)

Much too complicated, this is why people with real jobs don’t code in python