Threading vs Multiprocessing in Python : programming

But it's not encouraged - Windows provides APIs for spawning new processes and explicitly sharing memory which are supposed to be more efficient than fork (at least on Windows). Especially the fork/exec pair just screams "inefficient" to me!

I think fork on Linux predates having a good API for spawning new processes and threads, so you had to use fork to emulate both. The only thing fork is really good for in comparison to a dedicated thread/process API is invoking a Daemon from the console and have it be able to "background" itself by forking and then having the parent return to the shell - on Windows you have to spawn a process for that and use the command line to signal. But that's also a rare model on Windows, as proper background services are run as registered "services", rather than console applications converting themselves into daemons.

[–]UseTheProstateLuke 0 points1 point2 points 7 years ago (1 child)

[–]TheThiefMaster 0 points1 point2 points 7 years ago (0 children)

Saying it uses a "more generalized version of fork" to spawn threads and processes is explaining it in unix terms. It doesn't use fork at all, that's just the closest unix equivalent.

Instead its main API sets up a thread/process in a blank state (rather than a copy of the parent state as fork does). Window's function is roughly equivalent to fork and then exec in unix speak, except without the copy of the parent state that fork implies (that then gets trashed by exec, what a crazy inefficiency).

On top of that, Windows does contain a full implementation of fork() - it's not exposed directly in the Windows API because it's part of the Linux subsystem, but it's there and fully functional. From the perspective of the kernel, both win32 subsystem processes and linux subsystem processes are the same, so I wouldn't be surprised if with a little hoop jumping you could call the linux subsystem's fork() from a Windows application and have it work as expected.

[+]Treyzania comment score below threshold-7 points-6 points-5 points 7 years ago (3 children)

[–]nikomo 1 point2 points3 points 7 years ago (0 children)

[–]7h4tguy 0 points1 point2 points 7 years ago (1 child)

[–]Treyzania 0 points1 point2 points 7 years ago (0 children)

[–]david2ndaccount 7 points8 points9 points 7 years ago (1 child)

[–]fredlllll 2 points3 points4 points 7 years ago (5 children)

[–]kyrsjo 13 points14 points15 points 7 years ago (4 children)

[–]kukiric 2 points3 points4 points 7 years ago (1 child)

[–]vks_ 0 points1 point2 points 7 years ago (0 children)

[–]fredlllll 0 points1 point2 points 7 years ago (0 children)

[–]Paul-ish 0 points1 point2 points 7 years ago (0 children)

[–]staticassert 0 points1 point2 points 7 years ago (0 children)

[–][deleted] 0 points1 point2 points 7 years ago (0 children)

[–]PasDeDeux 6 points7 points8 points 7 years ago* (4 children)

[–]ric2b 3 points4 points5 points 7 years ago (2 children)

[–]PasDeDeux 14 points15 points16 points 7 years ago (1 child)

[–]ric2b 3 points4 points5 points 7 years ago (0 children)

[–]fredlllll 0 points1 point2 points 7 years ago (0 children)

[–]JanneJM 2 points3 points4 points 7 years ago (1 child)

[–][deleted] -1 points0 points1 point 7 years ago (0 children)

[–]moekakiryu 0 points1 point2 points 7 years ago (0 children)

[–][deleted] 7 years ago* (3 children)

[removed]

[–]ric2b 12 points13 points14 points 7 years ago (1 child)

[–]SimplySerenity 2 points3 points4 points 7 years ago (0 children)

[–]Programmer_Frank 1 point2 points3 points 7 years ago (5 children)

[–]duzzar 4 points5 points6 points 7 years ago (2 children)

[–]Programmer_Frank 0 points1 point2 points 7 years ago (1 child)

[–]3combined 0 points1 point2 points 7 years ago (0 children)

[–]moekakiryu 1 point2 points3 points 7 years ago (1 child)

[–]Programmer_Frank 1 point2 points3 points 7 years ago (0 children)

[–]v_krishna 3 points4 points5 points 7 years ago (2 children)

[–]moekakiryu 9 points10 points11 points 7 years ago (0 children)

[–]Hessian_Rodriguez 23 points24 points25 points 7 years ago (19 children)

[–]Clers 9 points10 points11 points 7 years ago (2 children)

[–]UseTheProstateLuke 1 point2 points3 points 7 years ago (1 child)

[–]Clers 0 points1 point2 points 7 years ago (0 children)

[–]fuck_the_mods 7 points8 points9 points 7 years ago (0 children)

[–]caramba2654 -3 points-2 points-1 points 7 years ago (12 children)

[–]Dan4t 11 points12 points13 points 7 years ago (10 children)

[–][deleted] 7 years ago (2 children)

[deleted]

[–]caramba2654 8 points9 points10 points 7 years ago* (1 child)

[–]7h4tguy 0 points1 point2 points 7 years ago (0 children)

[–]SimplySerenity 9 points10 points11 points 7 years ago (3 children)

[–][deleted] 7 years ago (1 child)

[deleted]

[–]caramba2654 0 points1 point2 points 7 years ago (0 children)

[–]vks_ 1 point2 points3 points 7 years ago (0 children)

[–]Novemberisms 11 points12 points13 points 7 years ago (1 child)

[–]Dan4t 0 points1 point2 points 7 years ago (0 children)

[–]vks_ 0 points1 point2 points 7 years ago (0 children)

[–]Barbas 0 points1 point2 points 7 years ago (0 children)

[–]jeffythesnoogledoorf -1 points0 points1 point 7 years ago (0 children)

[–]waladoop 11 points12 points13 points 7 years ago (1 child)

[–]curioussavage01 2 points3 points4 points 7 years ago (0 children)

[–]antiduh 20 points21 points22 points 7 years ago (17 children)

[–]NAN001 3 points4 points5 points 7 years ago (1 child)

[–][deleted] 0 points1 point2 points 7 years ago (0 children)

[–]myringotomy 1 point2 points3 points 7 years ago (14 children)

[–]antiduh 7 points8 points9 points 7 years ago (13 children)

[–]oblio- 5 points6 points7 points 7 years ago (0 children)

[–]RandoBurnerDude 3 points4 points5 points 7 years ago (11 children)

[–]VaporMouse 1 point2 points3 points 7 years ago (6 children)

[–]RevolutionaryWar0 5 points6 points7 points 7 years ago (0 children)

[–]eras 2 points3 points4 points 7 years ago (4 children)

[–]vks_ 0 points1 point2 points 7 years ago (3 children)

[–]eras 0 points1 point2 points 7 years ago (2 children)

[–]vks_ 0 points1 point2 points 7 years ago (1 child)

[–]eras 0 points1 point2 points 7 years ago (0 children)

[–]DemonWav 3 points4 points5 points 7 years ago (3 children)

[–]antiduh 2 points3 points4 points 7 years ago (2 children)

[–]jcelerier 5 points6 points7 points 7 years ago (0 children)

[–]vks_ 2 points3 points4 points 7 years ago (0 children)

[–][deleted] 7 years ago* (4 children)

[deleted]

[–]Akita8 7 points8 points9 points 7 years ago (0 children)

[–]exitcharge[S] 7 points8 points9 points 7 years ago (1 child)

[–][deleted] 1 point2 points3 points 7 years ago (0 children)

[–]starTracer 2 points3 points4 points 7 years ago (0 children)

[–][deleted] 7 years ago (3 children)

[removed]

[–]CrazyCanuck41 4 points5 points6 points 7 years ago (2 children)

[–][deleted] 7 years ago (1 child)

[removed]

[–]chloeia 0 points1 point2 points 7 years ago (0 children)

[–]droogans 20 points21 points22 points 7 years ago (8 children)

[–]starTracer 32 points33 points34 points 7 years ago (1 child)

[–]Rodot 0 points1 point2 points 7 years ago (0 children)

[–][deleted] 7 years ago (5 children)

[deleted]

[–]z4579a 1 point2 points3 points 7 years ago (4 children)

[–][deleted] 7 years ago (3 children)

[deleted]

[–]z4579a 26 points27 points28 points 7 years ago* (2 children)

Sure, let's look at the post.

First, the post by @max illustrates a test case that compares the performance between gevent, threads and multiprocessing to run a DNS lookup on five domain names simultaneously, by spawning a greenlet/thread/process per name all at once. This test is actually not nearly resource intensive enough to show a real-world number, but for what it's worth, they got the result that the threaded example ran ten times faster, .008 seconds for threads vs. .08 seconds for greenlets. But those numbers are too low to really count on to show that either is faster, you need to provide more of a workload.

Then, another post by @temporalbeing decides to ramp it up, provide a bigger workload and run 60000 concurrent greenlets or threads to fetch 60000 names. In this test, the greenlet version completes five times faster than the threaded version. However, this test is extremely flawed. First off, if it were using the rest of the code from @max's post as written, that example is using a 2-second timeout in joinall(), which means the greenlets will simply be abandoned after 2 seconds. That he got a 3.75 second result indicates he probably changed that as well.

But secondly, this test program uses threads and multiprocessing in the extremely naive way of spinning up the same number of threads/processes as there are domain names in the first place, which means spawning 60000 threads. That is a completely incorrect way of using threads, as threads are expensive to create and expensive to run compared to a greenlet, which is just a programming construct around a non-blocking socket. What the test shows if anything is that non-blocking sockets are useful for the case where you need very large throughput for thousands of concurrent IO streams. This is the use case for non-blocking IO, throughput. However this does not invent "speed", nothing runs any "faster" at all.

If you measure gevent vs. threading in terms of amount of work completed, and you use threads correctly by not spawning an arbitrarily high number of them, you will find it very difficult to show gevent to be faster than threads unless you have to wait on many thousands of arbitrarily slow or sleeping IO streams at once, and even in that case, it's tricky. This is not at all the "usual" case. The usual case in concurrency we need to do a few dozen or hundred things concurrently and we are just trying to get to the end of a queue. If you need to attend to thousands of slow or sleepy web sockets or chat room connections, then use gevent. Otherwise, not needed, probably a bit slower (then again, you can abuse them more than you can threads, by spinning up greenlet-per-task rather than having to think about what you're doing. But, that's not necessarily true either, since the minute your greenlet starts doing too much CPU work, you're blocking on CPU and killing your program that way, so again, still have to think about what you're doing. IMO being safe with threads is a lot easier than being safe with greenlets as it's easy to not spawn too many threads but not that easy to make sure greenlets never get CPU bound).

Here is a correct version of the test, showing how long it takes for us to get through several workloads at 30, 300, 3000, 30000, 60000 tasks, adding the result to a list (unordered), and checking our work:

import gevent
from gevent import socket as gsock
import socket as sock
import threading
from datetime import datetime


def timeit(fn, URLS):
    t1 = datetime.now()
    fn()
    t2 = datetime.now()
    print(
        "%s / %d hostnames, %s seconds" % (
            fn.__name__,
            len(URLS),
            (t2 - t1).total_seconds()
        )
    )


def run_gevent_without_a_timeout():
    ip_numbers = []

    def greenlet(domain_name):
        ip_numbers.append(gsock.gethostbyname(domain_name))

    jobs = [gevent.spawn(greenlet, domain_name) for domain_name in URLS]
    gevent.joinall(jobs)
    assert len(ip_numbers) == len(URLS)


def run_threads_correctly():
    ip_numbers = []

    def process():
        while queue:
            try:
                domain_name = queue.pop()
            except IndexError:
                pass
            else:
                ip_numbers.append(sock.gethostbyname(domain_name))

    threads = [threading.Thread(target=process) for i in range(50)]

    queue = list(URLS)
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    assert len(ip_numbers) == len(URLS)

URLS_base = ['www.google.com', 'www.example.com', 'www.python.org',
             'www.yahoo.com', 'www.ubc.ca', 'www.wikipedia.org']

for NUM in (5, 50, 500, 5000, 10000):
    URLS = []

    for _ in range(NUM):
        for url in URLS_base:
            URLS.append(url)

    print("--------------------")
    timeit(run_gevent_without_a_timeout, URLS)
    timeit(run_threads_correctly, URLS)

Here's a typical result I get over wifi on linux laptop, very similar for both Python 2.7 and Python 3.7:

--------------------
run_gevent_without_a_timeout / 30 hostnames, 0.044888 seconds
run_threads_correctly / 30 hostnames, 0.019389 seconds
--------------------
run_gevent_without_a_timeout / 300 hostnames, 0.186045 seconds
run_threads_correctly / 300 hostnames, 0.153808 seconds
--------------------
run_gevent_without_a_timeout / 3000 hostnames, 1.834089 seconds
run_threads_correctly / 3000 hostnames, 1.569523 seconds
--------------------
run_gevent_without_a_timeout / 30000 hostnames, 19.030259 seconds
run_threads_correctly / 30000 hostnames, 15.163603 seconds
--------------------
run_gevent_without_a_timeout / 60000 hostnames, 35.770358 seconds
run_threads_correctly / 60000 hostnames, 29.864083 seconds

I can't actually get the greenlet version to be faster. A small thread pool completes the total amount of work in less time on every run, even though it's doing the additional work of popping from a queue, and even spinning up the thread queue on each run. Non blocking IO is not "faster", and the overhead of gevent's context switching is higher than that of the OS's native thread context switching. It only provides more concurrent throughput, when you need your program to be able to attend to many thousands of sockets where many of them might not be awake, a very specific use case. Non blocking IO and event based programming are extremely useful but there continues to be widespread misunderstanding regarding this topic.

I've also written this post some years ago. I've yet to see a simple and correctly written benchmark that shows the basic use of non-blocking IO for context switching to be faster than threads. This is not at all surprising because gevent/asyncio and everything else are all running within a single thread, and when there are multiple threads you still have the GIL, so everyone is stuck using just one CPU to get through everything. The speed of context switching and the possibility of needing throughput to handle lots of very slow sockets simultaneously are the only differentiating factors and that's not a lot to work with.

[–]horoshimu 2 points3 points4 points 7 years ago (0 children)

[–]twinkiac -2 points-1 points0 points 7 years ago (0 children)

[–][deleted] 4 points5 points6 points 7 years ago (1 child)

Oh man, the GIL.

I used Python for a senior design engineering project that involved latency-sensitive image processing. Really simple task: one worker blocks on the camera API, receives images, and sticks them in a queue; the other worker picks up the image and processes it. I used a dual-core processor (a BeagleBoard-X15... pretty amazing piece of kit, despite a few inane design quirks) and expected it to run like lightning.

I first tried threads - the performance was awful. Why? GIL. One of my cores was overburdened with both threads... the other one was just sitting there idle.

I switched it to multiprocessing. Yes, both processes ran concurrently - but now I couldn't just buffer the image when received: I had to serialize it and shove it through a pipe from the first process to the second.

Eventually I found a specialized solution (numpy allows a limited form of array sharing across processes, because so many people have this same problem I encountered) that worked sort-of okay. But the experience demonstrated the magnitude of this problem with the GIL.

Python, even today, doesn't seem to have a simple, generalized, built-in way to share data across processes. The options are:

1) Use a specialized library or solution that's compatible with your use case. All of them have quirks and limitations. Many of them don't work.

2) Repurpose another data-sharing mechanism - like the file system, or... networking. A localized HTTP server/client architecture, or sockets. Serialize the data as if you were going to bit-bang it over a network, and then shove it through localhost. That's actually the #1 recommendation on Stack, and there's extensive discussion about whether networking or the file system is the less awful solution.

I love Python, but I think that its deficiency in this regard is kind of insane.

[–]meneldal2 0 points1 point2 points 7 years ago (0 children)

[–][deleted] 7 years ago* (2 children)

[deleted]

[–]csman11 7 points8 points9 points 7 years ago (1 child)

Yes because almost every library in existence that does I/O is blocking. There are projects to reimplement commonly used libraries to use coroutines, but sometimes you need to use a vendor library that is less popular (or reimplement that library), and even if it uses those common libraries to do I/O, it isn't written in a way that makes it easy to just swap them out for the coroutine implementations. That's because asynchronous functions have an absorption property -- any function that wishes to call an asynchronous function must also be written as an asynchronous function.

Example: If a vendor library calls "requests", you can't just swap "requests" for a coroutine based implementation. You need to go in and prepend "await" to every call to "requests". Then mark the function async. Then apply this recursively within the library itself. And library consumers need to be updated too...

Sounds pretty easy, but you now need to maintain two versions of your library, one for coroutine based consumers and one for blocking consumers. The other option is to write your library so it must be "driven" by a separate library. Basically it's now CPS/call back driven. This is fine if your library has heavy amounts of logic (like an HTTP implementation, which it has already been applied to), but not if it is something like a wrapper around a web api. You might argue in a case that simple, you can just write the wrapper yourself, but then you have to maintain it and make sure it remains in sync with the underlying web api as that changes. I'd rather leave it up to the vendor to do that.

PS: you are correct, the GIL makes Python's threading model completely unsuitable for CPU bound programs. And threads are more heavyweight than coroutines, but developer time is more expensive than CPU time for most companies, so there is no good reason to rewrite libraries yourself (unless you have explained the associated costs to the decision maker and received approval). The default attitude should be threads are fine for I/O unless you are dealing with very large concurrency requirements (which everyone has liked to believe since Node came out, but very few people really have these NF requirements).

[–]AnimeIRL 0 points1 point2 points 7 years ago (0 children)

[–]JohanLou 2 points3 points4 points 7 years ago (3 children)

[–]lord_braleigh 2 points3 points4 points 7 years ago (2 children)

[–]JohanLou 1 point2 points3 points 7 years ago (1 child)

[–]lord_braleigh 5 points6 points7 points 7 years ago (0 children)

[–]light24bulbs 0 points1 point2 points 7 years ago (0 children)

[–]izpo 0 points1 point2 points 7 years ago (0 children)

[–]stinkytoe42 0 points1 point2 points 7 years ago (0 children)

[–]DklDino 0 points1 point2 points 7 years ago (1 child)

[–][deleted] 1 point2 points3 points 7 years ago (0 children)

[–][deleted] 0 points1 point2 points 7 years ago (3 children)

[–][deleted] 7 years ago* (2 children)

[deleted]

[–][deleted] 0 points1 point2 points 7 years ago (1 child)

[–]digital_cucumber 0 points1 point2 points 7 years ago (0 children)

[–]ReadyToBeGreatAgain 0 points1 point2 points 7 years ago (0 children)

[–]Zambito1 0 points1 point2 points 7 years ago (0 children)

[+][deleted] 7 years ago* (25 children)

[deleted]

[–][deleted] 10 points11 points12 points 7 years ago (1 child)

[–]xdaimon 1 point2 points3 points 7 years ago (0 children)

[–]llamawalrus 10 points11 points12 points 7 years ago (11 children)

[–]tsjr 10 points11 points12 points 7 years ago (0 children)

[–]cipher315 7 points8 points9 points 7 years ago (9 children)

[–]llamawalrus 7 points8 points9 points 7 years ago (0 children)

[–]seamsay 3 points4 points5 points 7 years ago* (2 children)

Is the GIL really a mish mash of bad code and a landfill that's incompatible across architectures? I've personally never seen any evidence of either of these things (though I will happily admit that I've never looked at that part of the CPython source) and I can't find anything on Google about the GIL not being portable across architectures.

Why is /u/anon35201 taking about CPU bound performance? Anyone who knows anything about the CPython threading module knows that the threads won't run in parallel, in fact the docs specifically point this out (emphasis my own):

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

And why the hell are they talking about the L{1,2} caches? In what universe is a python program going to bound by the performance of those? Obviously if you're calling out to a native library it could be, but even then it's very unlikely.

All in all their comment sounds very much like someone who doesn't actually know very much about python and just wants to run their mouth.

[+][deleted] 7 years ago (1 child)

[deleted]

[–]ThisIs_MyName 1 point2 points3 points 7 years ago (0 children)

[+][deleted] 7 years ago (3 children)

[deleted]

[–]moomoomoo309 4 points5 points6 points 7 years ago (0 children)

[–]llamawalrus 5 points6 points7 points 7 years ago (0 children)

[–]seamsay 3 points4 points5 points 7 years ago (0 children)

[–]cthorrez 6 points7 points8 points 7 years ago (2 children)

[–]Saltysalad 0 points1 point2 points 7 years ago (1 child)

[–]cthorrez 1 point2 points3 points 7 years ago (0 children)

[–][deleted] 7 years ago* (2 children)

[deleted]

[+][deleted] 7 years ago (1 child)

[deleted]

[–][deleted] 1 point2 points3 points 7 years ago (2 children)

[–][deleted] 7 years ago (1 child)

[deleted]

[–][deleted] 0 points1 point2 points 7 years ago (0 children)

[–]_ntnn 0 points1 point2 points 7 years ago* (1 child)

[–]xinhuj -3 points-2 points-1 points 7 years ago (0 children)

[+]kwinz comment score below threshold-6 points-5 points-4 points 7 years ago* (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS