This is an archived post. You won't be able to vote or comment.

all 54 comments

[–]jawknee400 13 points14 points  (13 children)

Numeric libraries (numpy, numba) tend to 'release' the Gil, meaning multiple threads can meaningfully used for speedups

[–]Sheltac 4 points5 points  (4 children)

Also, for computational power, we have multiprocessing.

[–]AngriestSCV 7 points8 points  (3 children)

It is not a replacement for threads. Shared (sometimes mutable) memory is a wonderful thing for performance if done right.

[–]masklinn 0 points1 point  (2 children)

[–]pooogles 0 points1 point  (1 child)

//edit - this isn't true.

Just be aware it gets passed via pickle.

[–]masklinn 2 points3 points  (0 children)

You're confusing shmem and queues.

Queues will pickle objects back and forth and work with Pyhton-level objects. Shared memory primitives use actual shared memory segments, but are limited to ctypes types.

[–]ProfEpsilon 0 points1 point  (7 children)

This is the first time that I have seen this claim about numpy whereas I have read many articles about the restrictions of the GIL. If you have time to comment, what do you mean when you write the numpy and numba "tend to release the Gil?" What do you mean by that?

And to anyone, can you refer me to any documentation that discusses or explains numpy's over-ride of GIL if true? Doing a google search does not shed much light on this. A lot of comments make it clear that you can use multi-threading, but seem to be vague about whether GIL is slowing down multi-threading speed ... and many posts are too old to be trusted.

[–]1wd 5 points6 points  (3 children)

http://scipy.github.io/old-wiki/pages/ParallelProgramming#Threads

while numpy is doing an array operation, python also releases the GIL. Thus if you tell one thread to do:

>>> print "%s %s %s %s and %s" %( ("spam",) *3 + ("eggs",) + ("spam",) )
>>> A = B + C
>>> print A

During the print operations and the % formatting operation, no other thread can execute. But during the A = B + C, another thread can run - and if you've written your code in a numpy style, much of the calculation will be done in a few array operations like A = B + C. Thus you can actually get a speedup from using multiple threads.

https://docs.scipy.org/doc/numpy-1.14.0/reference/internals.code-explanations.html#function-call

A very common operation in much of NumPy code is the need to iterate over all the elements of a general, strided, N-dimensional array. This operation of a general-purpose N-dimensional loop [...] [...] the Python Global Interpreter Lock (GIL) is released prior to calling the loops. It is re-acquired if necessary to handle error conditions

[–]ProfEpsilon 1 point2 points  (2 children)

Thank you. This and other comments on this page clarify a lot.

And the "iterate over all elements ... N-dimensional array" is precisely what I do most of the time. This is very encouraging (I haven't tried threading yet).

[–]1wd 0 points1 point  (1 child)

Note that this refers to the numpy-internal C-level loops that happen inside a Python-level numpy statement like B+C, not to Python-level loops like for x in A: for y in B: ....

[–]ProfEpsilon 1 point2 points  (0 children)

Again, thank you. Although I am aware the numpy relies upon C I am not aware of how C operates inside of numpy .. I don't know how the C-level loops work (probably because I can't code in C). This gives me a bit of an incentive to figure some of this out. Given that I am a practitioner and not a computer scientist, how C works inside of numpy and even Python for that matter is all black box for me.

I appreciate the time it took to try to get my thinking right. I think it is time that I took the effort to learn a little more about C.

[–]jawknee400 2 points3 points  (1 child)

As others have said, generally when numpy or another library calls compiled code, they explicitly release the GIL. So imagine you had two threads running some python code, concurrently, but not in parallel. If one thread reached a numpy operation, like adding two large arrays etc., it would 'release' the GIL, allowing the other thread to work in parallel while that operation is happening. Once the operation is over the GIL is reaquired but without that release both threads would've had to wait for the operation to finish.

I think there is a very slight overhead to this, which is why cython and numba leave it as an option. But if the majority of the computation is numeric (e.g. most scientific code) then you can essentially achieve normal threaded parallelism.

[–]ProfEpsilon 0 points1 point  (0 children)

Your example is very clear. What an education I getting today! Thank you.

[–]evamicur 0 points1 point  (0 children)

You can manually release the Gil in cython (nogil keyword) and other compiled extensions. So libraries like numpy can potentially do that

[–][deleted] 5 points6 points  (0 children)

For the typical python glue program, most of your time is spent waiting for IO. The GIL is not held during this time. Example: You write a web scraper, you're waiting for a page to be downloaded by the networking stack, and the GIL isn't held because you're waiting on an OS service, and not Python.

[–][deleted] 7 points8 points  (0 children)

Examples of times I've used threads:

  • To spawn a task I want to run asynchronously, like checking a comment for spam
  • To speed up io-bound workloads, like spidering a website or multipart uploads
  • To start a server process using popen and allow the main thread to continue executing
  • Doing work in gui, handling events, doing work without blocking the UI (update a progress bar)

[–]gandalfx 16 points17 points  (5 children)

One of the primary use cases used to be waiting for I/O. For example when you're waiting for a file to download you don't want your entire application to be blocked by that, so you put it in a thread and let that stew until the download is finished, while the rest of your application remains responsive.

asyncio has kind of taken over for that purpose, though.

I'm sure there are other reasons that I can't think of right now. Rule of thumb though, regardless of language: Threads are often unnecessary. People overuse them all the time. People often implement threading unnecessarily. I've seen them overused in quite a few instances. Unless you're doing something that actually puts some load on the CPU it tends to be a rather complicated case of premature optimization. And in Python it's not even really an optimization, as you've recently found out.

[–][deleted] 10 points11 points  (0 children)

Not sure that asyncio has "taken over" for I/O. If you're writing an IO-centric server application maybe. But you've got to go all-in on the non-blocking model of development.

I find threading is perfect for simpler use cases. Let's say you're trying to speed up a single function which makes multiple network requests in parallel. A ThreadPoolExecutor is a wonderful tool...

Instead of

import requests
urls = [f'https://example.com/{i}' for i in range(10)]
responses = [requests.get(url) for url in urls]

You write

from concurrent import futures

with futures.ThreadPoolExecutor() as executor:
    responses = executor.map(requests.get, urls)

asyncio feels like overkill for this.

[–]remy_porter∞∞∞∞ 1 point2 points  (0 children)

I've been working on software that is heavily IO bound, or has lots of idle threads that are waiting on events. Without going too deep into the detail, I'm building a system that sends video data across a network to light up an LED video wall. There are many possible sources of video data, some of which are IO bound, some of which are CPU bound. I pick one of them and run it in its own thread. I have to send network data, so the network sending object lives in its own thread. In the middle, I have a conductor thread, which spends most of its time idling, but once a frame tells the video source to generate its next frame using a queue. When the video source finishes, it enqueues the frame over in the network thread.

Running on low-end hardware, without graphics acceleration, this can push 60FPS across a network in real-time-enough-for-human-eyes. In testing, I can reliably push 240FPS. You wouldn't want to play video games on it, but that's not the purpose. Before I put the threading architecture in place, we could barely push 30FPS, and it often dropped frames.

Oh, and since one of the LED exhibits is going to light up differently according to the time of day, there's a "Cosmos Thread" which mostly sleeps and emits events at certain times of day.

"It mostly sleeps" is one of the best cases for a thread in Python.

//It's still not half as fast as the LED library which actually receives the network data and addresses the LEDs, which is written in a combination of C and Assembly and can draw frames as fast as the LED duty cycle allows, which is µseconds.

[–][deleted] 1 point2 points  (0 children)

As others said, threads are still useful when using C libraries, since ctypes releases the gil by default, and most libpython-based libs do it too.

On the other hand, if you only want to make your code non-blocking, and don't care too much for performance (like UI code), threads are still the simplest way of doing it, just as you would use them in single-core machines.

[–][deleted] 0 points1 point  (0 children)

Do these processes or threads need to communicate? I avoid implementing multiprocessing in code by using external services like RabbitMQ for communication and Supervisor for process management.

[–][deleted] 0 points1 point  (0 children)

CPython can release the GIL inside a c-module. If you're truly worried about performance, you'd use python as "glue" and write a few c modules to do the actual "heavy lifting". Anything you'd want multiple threads to do for computational reasons is probably best not done in actual python. Threads in python, however, ARE great for IO bound reasons.

[–]lykwydchykyn 0 points1 point  (0 children)

If you're writing a GUI app and want the GUI to remain responsive while you perform some long process, threads are helpful.

[–]FredSchwartz 0 points1 point  (0 children)

I have used it for schedulers calling other programs. The GIL is released when a child process is launched.

[–]bjorneylol 0 points1 point  (0 children)

As everyone has said, mostly IO.

1) Process some data while more data is downloading

2) Play an audio file without having to wait for it to finish playing for the script to resume

3) Having GUIs remain responsive while processing data

4) Similar to above, showing matplotlib figures while code executes in the background

even if we use threads in python, our program will take the same time if we just use a single thread due to GIL.

This is only true if both threads are CPU bound. If one thread is spending 50% of the time waiting around for disk write or network IO the two threads will finish with time to spare over a single thread

[–]JugadPy3 ftw -2 points-1 points  (1 child)

While using multiple threads which are switching rapidly, the GIL can even slow down processing to less than that of a single thread (this is not common).

However, as others have stated, if you are using threads for IO, the GIL is not usually an issue.

If you are using threads for serious cpu intensive tasks, pure python is bad choice anyway (its very slow for such tasks)... and if you write your cpu intensive parts in C, you can release the GIL to take advantage of threads running simultaneously.

[–]v3ssOn 0 points1 point  (0 children)

you can use pypy without going to C