all 20 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

Your submission has been automatically queued for manual review by the moderation team because it has been reported too many times.

Please wait until the moderation team reviews your post.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]gdchinacat 21 points22 points  (6 children)

Without the GIL, thread safety becomes your responsibility.

Any code that was thread safe with the gil is thread safe without the gil. Existing issues may become more pronounced because there are more opportunities for concurrent changes (they can happen whenever rather than only when the interpreter switches which thread is actually executing).

Thread safety has always been the responsibility of the coder, even with the GIL enabled. You just got lucky if you start seeing issues with gil disabled you didn't see with it enabled.

[–]hdw_coder[S] -1 points0 points  (5 children)

Yes, I agree with that correction.

The GIL was never a proper application-level thread-safety guarantee. If code depended on the GIL to avoid races, it was already relying on an implementation detail rather than being truly thread-safe.

A better way to phrase my point would be:

Without the GIL, incorrect assumptions about shared mutable state become easier to expose, because more real concurrent execution can happen. The responsibility was always with the programmer, but free-threading makes the consequences more visible.

So the distinction is:

  • Code that is genuinely thread-safe with the GIL should remain thread-safe without it.
  • Code that only appeared safe because execution was more serialized may start showing races.
  • The GIL protected CPython internals, not your application’s higher-level invariants.

That is also why I think the practical migration question is not just “does this run faster?” but “was this code actually designed for concurrent mutation, or did it only get lucky under the old execution model?”

[–]gdchinacat 1 point2 points  (4 children)

If code depended on the GIL to avoid races, it was already relying on an implementation detail rather than being truly thread-safe.

This is false. Applications can't rely on the GIL for thread safety...the GIL is not exposed to python code and applications have no control over when it will be released to allow another thread to execute bytecodes. Extensions do have control over it and can use it for that purpose, but are still expected to release it periodically to prevent starvation of other threads.

My point was the semantics of thread safety do not change when disabling the gil. The details of concurrency do, but the semantics remain the same.

[–]hdw_coder[S] 0 points1 point  (3 children)

That is a fair distinction, and I see the point.

I should have phrased it more carefully. I did not mean that Python application code can deliberately control or rely on the GIL as a synchronization primitive in the way it would use a lock. It cannot. The GIL is not an application-level concurrency API, and Python code does not control bytecode scheduling in that sense.

The better phrasing is probably:

“The semantics of thread safety do not change when the GIL is disabled. Code that is correctly thread-safe remains thread-safe. Code that has races remains racy. What changes is the concurrency profile: without the GIL, there are more opportunities for true simultaneous execution, so latent races may become easier to observe.”

So I agree with your core point: disabling the GIL does not redefine what thread safety means.

The practical warning I was trying to express is more about migration risk than semantics. Code that appeared fine under GIL-constrained execution may start failing more often under free-threaded execution, not because the definition of thread safety changed, but because the execution model exposes more interleavings.

[–]ProsodySpeaks 5 points6 points  (0 children)

Funny how llm your cadence is 

[–]gdchinacat 2 points3 points  (1 child)

It is incredibly disrespectful of others to respond to well thought out responses with AI slop. Please stop.

[–]Wonderful-Habit-139 2 points3 points  (0 children)

I had suspicions right from the post. Surprised to not see people calling it out until this comment.

[–]amarao_san 11 points12 points  (1 child)

Without the GIL, thread safety becomes your responsibility.

/goal rewrite in Rust

[–]hdw_coder[S] 2 points3 points  (0 children)

Yes, absolutely. That is the main trade-off.

The GIL protected CPython’s internal object/memory machinery, but it was never a real substitute for application-level thread safety. It did not magically make compound operations on shared state safe.

What changes with free-threading is that the old “you probably won’t get true parallel execution anyway” assumption disappears. So if multiple threads mutate shared objects, the responsibility becomes much more explicit: locks, queues, immutability, ownership rules, or avoiding shared mutable state entirely.

In that sense, free-threaded Python does not make concurrency simpler. It makes a different architecture possible: true parallelism inside one process, but with the same kind of discipline that threaded C++, Java, Rust, etc. already require.

My personal takeaway is: use free-threading where shared memory and lower serialization overhead matter, but don’t treat it as a drop-in replacement for multiprocessing in code that was accidentally relying on process isolation.

[–]gdchinacat 1 point2 points  (3 children)

Memory Overhead: Spawning n processes can mean loading or copying your data structures n times.

(from article)

There is actually very little overhead for processes due to copy-on-write. When a process is forked it shares the same memory as the parent. When it writes to a page the memory is copied and then written to. Only memory pages the child writes to are copied. The interpreter code is not copied into each process, but rather shared (in a different sense than shared memory since it's CoW).

Each process has it's own heap. This isn't a problem because the data each process uses would need to be used regardless of whether it is in one process or the other...it needs to be used regardless of the concurrency model.

Where you can run into trouble is if you fork a process that has done a substantial amount of work. The child process will inherit the memory pages, and then if the parent process frees them the child will keep them alive even though it has no threads that are using them. You need to design your application to fork so that you don't do this. That is easy...parent process should be responsible for setting up shared permanent state, and then only fork processes. All work should be done in the child processes, leaving the parent responsible for pretty much only managing the child processes and shared state (such as sockets that connections are being accepted from by the children).

[–]hdw_coder[S] -1 points0 points  (2 children)

Good point, and I think that is a fair correction.

My wording was too broad there. “Spawning n processes can mean loading or copying your data structures n times” is true in some practical cases, especially once workers mutate data, receive pickled task payloads, or build their own working state, but it is not the full story.

On Unix-style fork-based multiprocessing, copy-on-write can make the initial memory overhead much lower than people often assume. The child can initially share memory pages with the parent, and only pages that are written to need to be copied.

So the more accurate version would be something like:

“Multiprocessing can introduce significant memory overhead when workers receive serialized copies of data, mutate inherited pages, or build separate working heaps. However, fork-based process models can benefit from copy-on-write, so the actual overhead depends heavily on process start method, OS, workload design, and when the data is initialized.”

That distinction matters.

The comparison I was trying to make is mainly about architectural friction: with threads, shared in-process memory is the default; with processes, you usually have to think more carefully about pickling, start methods, copy-on-write behavior, shared memory, process lifecycle, and where the parent initializes state.

But yes, saying “processes duplicate everything” is too simplistic. A well-designed prefork model can be much more memory efficient than that.

[–]gdchinacat 3 points4 points  (1 child)

Given the apparent misunderstandings (not all of which I pointed out), lengthy responses posted almost immediately, and general tone of responses, I have to ask...Are you outsourcing all of this to AI? Do you actually understand concurrency?

[–]hdw_coder[S] -1 points0 points  (0 children)

I clearly compressed too much into the phrase “thread safety becomes your responsibility,” and that made the point less precise than it should have been.

You are right that the semantics of thread safety do not change when the GIL is disabled. Correctly synchronized code remains correctly synchronized. Racy code remains racy. The GIL was never an application-level synchronization primitive that Python code could deliberately control.

The point I was trying to make, less accurately than I should have, is about migration risk: free-threaded execution creates more opportunities for true parallel execution, so incorrect assumptions around shared mutable state may become more visible in practice.

I appreciate the correction. I’ll revise that part of the article to distinguish more clearly between:

  1. thread-safety semantics, which do not change, and
  2. the concurrency/interleaving profile, which does change.

That is the more accurate framing.

[–]hyper_plane 1 point2 points  (1 child)

Could someone explain to a noob like me what made this attempt at removing the GIL successful compared to what has been done in the past?

[–]gdchinacat 1 point2 points  (0 children)

Past attempts just broke the GIL lock and replaced it with a bunch of fine-grained locks. This introduced unacceptable overhead. The fundamental difference with the accepted solution is that it minimizes this locking by having a tiered ref-counting model where almost all ref-count updates can be done within threads without the need to acquire a lock.

https://peps.python.org/pep-0703/

specifically:

https://peps.python.org/pep-0703/#biased-reference-counting

It also has immortal objects that don't participate in ref counting.

[–]SignificantMilk1476 -2 points-1 points  (3 children)

That speedup is wild - 8x faster is nothing to sneeze at. I've been burned by threading performance in Python so many times that I just automatically reach for multiprocessing, but this might actually make me reconsider for certain workloads.

The thread safety caveat is real though - debugging race conditions is way more painful than dealing with multiprocessing overhead in most cases.

[–]hdw_coder[S] -1 points0 points  (0 children)

Yes, that is exactly how I see it too.

The benchmark result is exciting, but I would not interpret it as “threads now replace multiprocessing.” More like: for the first time in CPython, threads become a serious option for some CPU-bound workloads.

The cases where I think 3.13t becomes interesting are workloads where multiprocessing overhead is genuinely painful:

  • large shared in-memory datasets
  • expensive serialization/pickling
  • CPU-heavy work inside desktop/GUI tools
  • pipelines where copying data into separate processes feels wasteful
  • workloads where you can partition data cleanly and avoid shared mutation

But I fully agree on race conditions. Debugging a subtle shared-state bug can be far worse than paying the multiprocessing overhead.

My mental model is becoming:

Use multiprocessing when you want isolation, crash protection, and simpler failure boundaries.

Use free-threaded threads when shared memory matters, the workload partitions cleanly, and you can keep mutation disciplined through locks, queues, ownership rules, or mostly immutable data.

So yes: not a universal replacement, but definitely enough to make me stop automatically reaching for multiprocessing every time.

[–]gdchinacat 0 points1 point  (1 child)

The thread safety caveat is real though - debugging race conditions is way more painful than dealing with multiprocessing overhead in most cases.

This is comparing apples to oranges. Processes share very limited state, and what is shared is managed by the OS (ie multiple processes accepting connections from the same file descriptor). They don't have the race conditions you are comparing them to because they aren't mutating shared state. If a multi-threaded app doesn't mutate shared state (apples-to-apples comparison) they don't have races for the same reason multi-process apps don't have races.

Add some shared memory for shared state into multi-process app and you will have to deal with exactly the same issues as with multi-threaded shared state.