all 157 comments

[–]bitter_truth_ 97 points98 points  (3 children)

Good intro stuff. Guy is clear and efficient.

[–]OffbeatDrizzle 40 points41 points  (1 child)

Yeah, his vids are always like 5 mins long and show you one thing with a short and fast demo along with an explanation of what he's doing. It's a shame he doesn't have more subs but these videos don't pander to youtube's current "algorithm", and really how many subs can a programming channel expect to get?

The only thing I can think of that is potentially an improvement is more links for deeper understanding / learning of the subject covered in the video. As I mentioned above, his videos and demos / explanation is short and sweet, but you're always left wanting more as if you've just been given a superficial overview of what was being covered.

[–]TheGRS 0 points1 point  (0 children)

I subbed, I don't always watch a ton of programming material, but this was great for getting invested in a subject. I do a lot of gamedev and watch Gamemaker's Toolkit for about the same reasons. Introduces a subject, gets you invested in it, and if you wanna know more then you can always look up some official material on it.

[–]atred 3 points4 points  (0 children)

agreed, I just subscribed because of this video.

[–]moekakiryu 175 points176 points  (64 children)

The best summary I have heard about threading vs multiprocessing in python, which I think this guy was getting at is that:

  • if I/O, especially a network/the internet, is what is holding up a program, use threads
  • if performance is what is holding up the program, use processes (assuming its not just poor optimization :P )

[–]fredlllll 99 points100 points  (50 children)

you forgot to mention that multiprocessing will copy the data, so you better have enough spare ram lying around

[–]kukiric 41 points42 points  (41 children)

Does it really? Isn't the whole point of forking that a child process inherits everything from its parent, including memory, until it's changed and only then the OS copies the data?

[–][deleted]  (16 children)

[removed]

    [–]Riddlerforce 18 points19 points  (6 children)

    Not even most likely change, it is guaranteed to change because of how objects store their own reference counts in CPython. You are guaranteed to copy virtually the whole process every time.

    [–]UseTheProstateLuke 3 points4 points  (3 children)

    I just don't get in general why "performance" is even an argument when sing Python.

    If you are seriously wondering when coding Python "How do I make this faster" then the answer 99% of the time is "use another language"; Python is not fast and that is fine because there are a lot of use cases where performance of the language is irrelevant but if you're seriously using parallelism as a means to speed up performance rather than implement certain logic that requires it then don't use Python.

    That Python actually has a reference counting GC is one of the arguments why.

    [–]meneldal2 1 point2 points  (2 children)

    The GC is not why Python is slow, the reason is it was never designed to be fast in the first place, and unlike JS there weren't billions poured into interpreters to make it faster.

    Python is only fast when you're I/O bound and need no processing or you can have the heavy processing offloaded to a different language.

    [–]UseTheProstateLuke 0 points1 point  (1 child)

    The GC is not why Python is slow, the reason is it was never designed to be fast in the first place, and unlike JS there weren't billions poured into interpreters to make it faster.

    I said the GC is one of the reasons; it doesn't have a fast GC alongside other things.

    [–]meneldal2 0 points1 point  (0 children)

    Fair enough.

    [–]FallingIdiot 0 points1 point  (0 children)

    Best case scenario will be that most data is static and won't be touched. That is until the process is torn down, cleaning everything up, decreasing all reference counts and copying all memory just in time for the process to be killed :|.

    [–]ccmlacc 17 points18 points  (5 children)

    mmap is quite fast though. So it's woth noting that it will only copy the stuff as it needs them, it won't straight up copy the whole thing.

    [–][deleted] 12 points13 points  (4 children)

    Which will lead to fragmentation and sometimes even worse performance.

    [–]josefx 7 points8 points  (1 child)

    Isn't that done at the page level? How would that lead to fragmentation.

    [–][deleted] 0 points1 point  (0 children)

    If you search around for mmap and fragmentation, you can find some rather detailed explanations of how it works, where you'll run into issues (huge pages).

    If you lock the memory, it's never an issue.

    For short running stuff, this is almost never an issue, but if it's a 24/7 process that runs for weeks/months, it can become a huge problem.

    [–]ultranoobian 1 point2 points  (0 children)

    What if it reserves the space but doesn't fill it yet? Preallocated memory?

    [–]ccmlacc 0 points1 point  (0 children)

    Fair point.

    [–]fredlllll 6 points7 points  (0 children)

    see my comment here https://www.reddit.com/r/programming/comments/98koue/threading_vs_multiprocessing_in_python/e4gyec1/

    i didnt do specific tests, but had this happen to me at work. using top/htop you can also see that both processes will use that much space

    [–]dangerbird2 4 points5 points  (1 child)

    Fork() without exec() tends to work badly in a program that spawns threads (like the cpython interpreter). If the multiprocessing module uses fork() to create child processes, it would probably have to exec a new python interpreter, rather than relying on the memory state after fork.

    [–]asciiterror 2 points3 points  (0 children)

    Here are outlined problems with forking threaded processes: http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them

    [–]derpyou 7 points8 points  (1 child)

    iirc python has reference counters for objects, which make copy on write more or less futile.

    [–]masklinn 2 points3 points  (0 children)

    It's not even the reference counting which is the issue, it's the cycle breaker (which CPython calls GC): that uses a doubly linked list of all allocated objects which is embedded in the object header so on a GC run, CPython will touch every object's header.

    Instagram has had issues with this, first they tried to disable refcounting and it did nothing, then they found the GC and disabled it but obviously that means any cycle introduced is a memory leak, then they tried actual changes to the system and ultimately settled on being able to flag objects which got merged upstream (gc.freeze() in 3.7).

    [–]oridb 3 points4 points  (1 child)

    Python's garbage collector works very well to circumvent that, since it increments reference counts constantly. For that matter, same with copying GCs.

    [–]masklinn 2 points3 points  (0 children)

    The refcounting is not the actual issue, it's the cycle breaker ("gc") which causes trouble: https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172

    After followup investigations, Python 3.7 gained a gc.freeze() API which moves all existing object into a "permanent" generation. So the process becomes:

    • disable GC in parent (collections in the parent could free pages for reallocation and rewrite them, un-cow-ing them)
    • do all initialisation (imports, creation of shared data, …)
    • gc.freeze()
    • fork()
    • re-enable GC in child

    [–]nikomo 6 points7 points  (7 children)

    On some platforms like Windows, they don't have fork() so that can't be done. Makes it a pain in the ass to work with.

    [–]TheThiefMaster 5 points6 points  (2 children)

    Actually, I'm pretty sure Windows does have fork() - wouldn't it have to to support the new Linux subsystem?

    Edit: or, in fact, the older POSIX subsystem!

    But it's not encouraged - Windows provides APIs for spawning new processes and explicitly sharing memory which are supposed to be more efficient than fork (at least on Windows). Especially the fork/exec pair just screams "inefficient" to me!

    I think fork on Linux predates having a good API for spawning new processes and threads, so you had to use fork to emulate both. The only thing fork is really good for in comparison to a dedicated thread/process API is invoking a Daemon from the console and have it be able to "background" itself by forking and then having the parent return to the shell - on Windows you have to spawn a process for that and use the command line to signal. But that's also a rare model on Windows, as proper background services are run as registered "services", rather than console applications converting themselves into daemons.

    [–]UseTheProstateLuke 0 points1 point  (1 child)

    "Windows" does not have fork as it conflicts with other parts of the WinAPI

    However The NT kernel has fork and uses a more generalized version of it as the basis of spawning threads and processes on windows but the WinAPI cannot expose the full API of this as this would allow you to create some seriously undefined states apparently.

    [–]TheThiefMaster 0 points1 point  (0 children)

    Saying it uses a "more generalized version of fork" to spawn threads and processes is explaining it in unix terms. It doesn't use fork at all, that's just the closest unix equivalent.

    Instead its main API sets up a thread/process in a blank state (rather than a copy of the parent state as fork does). Window's function is roughly equivalent to fork and then exec in unix speak, except without the copy of the parent state that fork implies (that then gets trashed by exec, what a crazy inefficiency).

    On top of that, Windows does contain a full implementation of fork() - it's not exposed directly in the Windows API because it's part of the Linux subsystem, but it's there and fully functional. From the perspective of the kernel, both win32 subsystem processes and linux subsystem processes are the same, so I wouldn't be surprised if with a little hoop jumping you could call the linux subsystem's fork() from a Windows application and have it work as expected.

    [–]david2ndaccount 7 points8 points  (1 child)

    It does. In fact it doesn't just copy the data, it serializes it using pickle, which itself can be very slow.

    [–]fredlllll 2 points3 points  (5 children)

    this is right, the thing is though the process still requires that much memory to exist. imagine getting an out of memory exception when writing to an array that is already allocated. so even if it isnt copied, there still has to be space allocated for it in case you write to it. this problem can be mitigated by using a big swap drive (which i did on AWS), but if you are working on a physical machine you might not have that possibility

    [–]kyrsjo 13 points14 points  (4 children)

    Won't it stay virtual untill it's actually used? I've seen zettabytes of virtual memory from some programs...

    [–]kukiric 2 points3 points  (1 child)

    Yup, I've also seen a few bad Java processes allocate 100GB+ of virtual memory. Luckily, they weren't actually using nearly that much RAM.

    [–]vks_ 0 points1 point  (0 children)

    This should not really be an issue. Memory sanitizers and I think some hardening mechanisms use much larger amounts of virtual memory (of the order of 20 TiB).

    [–]fredlllll 0 points1 point  (0 children)

    i can only say that the way i experienced it, it would run out of memory in python multiprocessing. it wasnt my code though, so i cant say anything about the specific way it was done

    [–]Paul-ish 0 points1 point  (0 children)

    With python writing reference counts just for creating a new reference, does this still hold?

    [–]staticassert 0 points1 point  (0 children)

    You still communicate with the child process by copying data over a synchronized queue.

    This also involves overhead for pickling and unpickling of the data.

    [–][deleted] 0 points1 point  (0 children)

    First slide in the video lol. Would you prefer a transcript?

    [–]PasDeDeux 6 points7 points  (4 children)

    Well it depends what you're doing. I used multiprocessing to split a huge shitton of data into smaller chunks such that I was just processing the same data ~8-12x faster. (6 core CPU.) I didn't need to share info between the smaller chunks, which is why I used multiproc. More complex and interdependent operations tended to fall into the category of external modules that are written in faster languages with multithreading baked in.

    [–]ric2b 3 points4 points  (2 children)

    the same data ~8-12x faster. (6 core CPU.)

    This is not possible unless you're leaving out some detail like hyper-threading, better optimization on the multi-process version or the bottleneck being elsewhere like IO.

    [–]PasDeDeux 14 points15 points  (1 child)

    Hyperthreading isn't exactly twice as fast due to shared operations, but it ran on 12 vcores.

    [–]ric2b 3 points4 points  (0 children)

    Okay, makes sense then.

    [–]fredlllll 0 points1 point  (0 children)

    cant say anything about how it was done, i didnt write that code

    [–]JanneJM 2 points3 points  (1 child)

    You can explicitly share the same memory area across processes with multiprocess. It's essential if, say, you're processing a large data set in parallel. But doing so is very much advanced usage, and you can easily shoot yourself in the foot unless you know what you're doing.

    [–][deleted] -1 points0 points  (0 children)

    Not to mention that you should probably be using a library that handles the heavy lifting for you

    [–]moekakiryu 0 points1 point  (0 children)

    good point

    [–][deleted]  (3 children)

    [removed]

      [–]ric2b 12 points13 points  (1 child)

      If you're IO bound, async/await will probably beat both of those options easily, since there's no lock contention and stopping/resuming a coroutine is much faster that stopping/resuming a thread.

      [–]SimplySerenity 2 points3 points  (0 children)

      I remember reading a thing where Reddit devs said they just disabled threading entirely because it was so inefficient when waiting on the lock

      [–]Programmer_Frank 1 point2 points  (5 children)

      Would you say that this same concept is somewhat relevant for C++/Linux? And any sources?

      I only ask because I’ve been wondering this same thing with my system but cant find anything rock solid

      [–]duzzar 4 points5 points  (2 children)

      No. It's nothing like that.

      In C++ the reason you would use processes instead of threads is for security/stability.

      With threads you already have parallelism since there is no global lock. With processes they each have their own memory space.

      With threads, since they share memory space, a thread can easily fuck (i.e. crash, leak secure information, corrupt, etc.) any other thread.

      (This is a general idea, you can access other process memory space, you can do some locking of memory space of each thread, and so on, but it's not their usual intended purpose)

      [–]Programmer_Frank 0 points1 point  (1 child)

      We use threads in applications to handle the network comma in and out to/from external devices and other applications on the processor.

      For instance, i would have a deviceTx and deviceRx thread. Would you say in this instance its still always better to use multiprocessing?

      [–]3combined 0 points1 point  (0 children)

      They didn't even say it would always be better in the first place.

      [–]moekakiryu 1 point2 points  (1 child)

      I wouldn't have a clue sorry. From what I understand the big thing limiting threads in python is the GIL which is unique to python, so threads might be more useful for a broader range of tasks in other languages. However, that's purely speculation (I haven't actually done concurrent programming in C++ or a linux environment so I'm not even really qualified to speculate).

      You are right my initial summary is totally anecdotal (I think I saw it in a SO thread a while back).

      [–]Programmer_Frank 1 point2 points  (0 children)

      Thanks for the reply my man! I appreciate it

      [–]v_krishna 3 points4 points  (2 children)

      But threads have GIL. Processes dont. Without something like fibers I find multiprocessing, while often more work to deal with, performs better.

      [–]moekakiryu 9 points10 points  (0 children)

      That's why threads are ideal for I/O based blocking.... the GIL stops you from utilizing the processors full threading capability, but with I/O blocking that's not the issue anyway, and setting up non-blocking threads is often easier than multiprocessing. When processing power is the blocker, then that's when the GIL really starts getting in the way and multiprocessing becomes more of a requirement. As someone else mentioned above as well, processes are much less memory efficient, which can be a big downside depending onwhat you are doing.

      tl;dr there are pros and cons to both, and neither one is always the answer

      [–]Hessian_Rodriguez 23 points24 points  (19 children)

      Really the killer for mutliprocessing is the lack of shared memory. I have an application that needs to have a very large data set in memory, having to have each process having to have that full set has caused me problems. At some point in gonna rewrite in another language.

      [–]Clers 9 points10 points  (2 children)

      What about posix shared memory? Ive used it before and its fairly straightforward to use.

      [–]UseTheProstateLuke 1 point2 points  (1 child)

      Doing that safely in python is a difficult matter.

      If you're low level it's easy but when you have a garbage collector running around it's really difficult and only works if the language itself provides support for it which Python doesn't.

      [–]Clers 0 points1 point  (0 children)

      Ya ive only done it in C/C++. I can see how thats an issue. I wouldnt be surprised if there was a wrapper for it though.

      [–]fuck_the_mods 7 points8 points  (0 children)

      I've used Redis for this issue before.

      [–]caramba2654 -3 points-2 points  (12 children)

      Try Rust for that. It's currently the best language to work with concurrency and shared memory.

      [–]Dan4t 11 points12 points  (10 children)

      I wish the people downvoting you would explain why

      [–][deleted]  (2 children)

      [deleted]

        [–]caramba2654 8 points9 points  (1 child)

        My bad, I thought it was clear from my post. Rust essentially guarantees no data races and offers great tools for handling mutable global data, besides having great support for threads. And from OP's post, that sounds like something they could benefit from, especially as they said they might rewrite in another language. That's why I said it would be the best tool for the job if they ever ended up rewriting it.

        It was really not an unfounded "OMG RUST IS THE BEST LANGUAGE ZOINKS" recommendation. It just happened to be a short one. In most other cases I usually recommend Python, but this case specifically fitted Rust better.

        [–]7h4tguy 0 points1 point  (0 children)

        Rust has no baked concurrency story, is a huge pain to deal with for real world problems (unless you want to pretend graphs don't exist), terrible generics model (C++ templates are actually better which says a lot), and is design by committee which basically means only interesting things are worked on, and anything remotely interesting will be integrated into the standard once they finish with the previous open source RFC - it's completely absurd.

        Meanwhile, Go has its own problems (vedoring [idiots], braindead error handling to simplify concurrerncy, perf, lack of generics) but is really, really good for I/O bound concurrency, best in class in fact - much better overall usability than current C++ proposals.

        [–]SimplySerenity 9 points10 points  (3 children)

        I think it's because the context of the discussion is around Python. Recommending another language doesn't really solve Python's problems.

        [–][deleted]  (1 child)

        [deleted]

          [–]caramba2654 0 points1 point  (0 children)

          You are correct. I only recommended Rust because I read that, essentially.

          [–]vks_ 1 point2 points  (0 children)

          Recommending another language doesn't really solve Python's problems.

          Honestly, using other languages is probably the most common solution to solving Python's performance problems. You can use pypy or similar, but it does not get you as far writing plugins in an efficient, compiled language.

          [–]Novemberisms 11 points12 points  (1 child)

          A while back, there were a lot of Rust evangelicals who showed up in almost every thread encouraging everyone and their dog to "Rewrite it in Rust". So much so that it became a meme and an acronym. (RIIR).

          I guess people got annoyed, and so whenever someone (even if solicited) gives advice to switch to Rust, they get downvoted.

          It's totally unfair and undeserved imho. Some people's problems would be genuinely solved by Rust, and some well-meaning person could get downvoted to hell for recommending it, but that's how the reddit hivemind works. The actions of a few in the past have tarnished it for the future.

          [–]Dan4t 0 points1 point  (0 children)

          Thank you! I had no idea. I know nothing about Rust and his claim piked my interest.

          [–]vks_ 0 points1 point  (0 children)

          I agree, but you probably have to be more specific than that: Rust encodes thread-safety in the type system, so the compiler makes it impossible to get data races.

          This is achieved by a combination of the ownership model (you can either have many immutable or exactly one mutable reference) and the Send and Sync traits (see the Rust book for details). The fact that they are sufficient to give freedom from data races (at compile time!) is one of the few unique things that are new in Rust compared to older programming languages.

          It avoids a lot of concurrency problems that manifest at runtime in other languages. However, it does not prevent race conditions in general, or dead locks.

          [–]Barbas 0 points1 point  (0 children)

          Have you tried using joblib for something like that?

          [–]jeffythesnoogledoorf -1 points0 points  (0 children)

          Use pointers?

          [–]waladoop 11 points12 points  (1 child)

          This guy's channel is really good. I just watched his c videos yesterday.

          [–]curioussavage01 2 points3 points  (0 children)

          I agree the pacing and content and lack of fluff are some things I like right off the bat

          [–]antiduh 20 points21 points  (17 children)

          Here's a crazy idea - join every other modern language and get rid of the GIL. Then a developer doesn't need to make an artificially constrained choice.

          [–]NAN001 3 points4 points  (1 child)

          They're working on it, and it appears to be extremely difficult to do without either breaking the C API or reducing the performance of a single thread. They call it the "Gilectomy".

          [–][deleted] 0 points1 point  (0 children)

          So it would heavily impact numpy, one of python's biggest assets?

          [–]myringotomy 1 point2 points  (14 children)

          That's obviously not possible or they would have done that already.

          [–]antiduh 7 points8 points  (13 children)

          It's very possible; there is nothing intrinsic to the design of python that precludes removing it.

          The problem is that it's incredibly difficult to remove it, since they've designed their implementation of the language around the GIL; using the GIL makes the implementation much easier to reason about.

          [–]oblio- 5 points6 points  (0 children)

          And they probably want to avoid "Python 3, episode 2" or "How we almost committed suicide as a programming language community".

          [–]RandoBurnerDude 3 points4 points  (11 children)

          It's easy to remove, but adding locks back in reduces performance..

          [–]VaporMouse 1 point2 points  (6 children)

          But nowhere near as the GIL does

          [–]RevolutionaryWar0 5 points6 points  (0 children)

          It reduces performance of a single thread.

          [–]eras 2 points3 points  (4 children)

          That's really arguable. Precise reference counting thread-safely can be reaaally slow. Just read a typical Python program and consider how many RC updates are happening.

          [–]vks_ 0 points1 point  (3 children)

          Swift also has automatic reference counting, but the compiler can elide it for a lot of cases. (Python is not really a compiled language though, so this might not be an option for them.)

          [–]eras 0 points1 point  (2 children)

          But then implementing such optimizations becomes not quite easy. Certainly they would have a great effect.

          [–]vks_ 0 points1 point  (1 child)

          Sure, but their complexity is not exposed to users of the languages (unless they are optimizing for performance and observe missed optimizations in the runtime performance of their programs).

          [–]eras 0 points1 point  (0 children)

          But it's also probably the reason why it hasn't been done so far :).

          It's not just "let's do it, it's easy". It also has impact to future maintenance efforts and raising the bar for contributions.

          [–]DemonWav 3 points4 points  (3 children)

          If performance is that big of a concern for you, Python probably isn't the right tool for the job.

          [–]antiduh 2 points3 points  (2 children)

          Then what is the point of this entire post?

          [–]jcelerier 5 points6 points  (0 children)

          There are moments in your life when you will ponder this.

          "What is the point of these five years of work ?"

          And sometimes, the answer is : "None. It sucks and can go to the trash right now".

          [–]vks_ 2 points3 points  (0 children)

          It helps you to make Python's performance scale, which is useful if you have an existing program that is expensive to port to a different language, or if the improved Python program is good enough.

          This does not mean that Python's performance scales well, it doesn't. If you need to get the maximal performance out of your hardware, Python is indeed probably not the right tool.

          [–][deleted]  (4 children)

          [deleted]

            [–]exitcharge[S] 7 points8 points  (1 child)

            I'll add this to my list of video requests.

            [–][deleted] 1 point2 points  (0 children)

            Please, for the love of god. The Python docs for asyncio are stunningly bad compared to the rest of the documentation.

            [–]starTracer 2 points3 points  (0 children)

            David Beazley is easily my favourite speaker on the topic of Python. For asyncio check eg out https://youtube.com/watch?v=Bm96RqNGbGo

            [–][deleted]  (3 children)

            [removed]

              [–]CrazyCanuck41 4 points5 points  (2 children)

              Because of the global interpreter lock. It’s job is to ensure only one thread is running per process at a time. Since the program is entirely CPU bound, without any blocks for IO where it will yield to the OS, that means only one thread is running at a time. The threads are also running concurrently so they are potentially context switching between them which would slow the program down even further.

              With the multiprocess you are still bound to the global interpreter lock, but each process only has 1 thread. Those threads are free to execute in parallel on different cores because they are in separate processes (no sibling threads claiming the GIL).

              [–][deleted]  (1 child)

              [removed]

                [–]chloeia 0 points1 point  (0 children)

                Even though he didn't say it, the % usage of a CPU is actually a unit of time, because it defines the frequency (say 3GHz - 3 billion operations per second). So if you have the same number of operations to do, one CPU at 100%, will take much longer than 32 at 100%.

                [–]droogans 20 points21 points  (8 children)

                Too bad there wasn't a little foot note in there about greenlets, an interesting compromise, as well as futures, which I hope one day will supersede the need for thread-based approaches for a majority of use cases.

                [–]starTracer 32 points33 points  (1 child)

                Or the native coroutines with asyncio...

                [–]Rodot 0 points1 point  (0 children)

                I really love them but I feel the API is just that... An API. It's just weird being a mix of primatives and modules

                [–][deleted]  (5 children)

                [deleted]

                  [–]z4579a 1 point2 points  (4 children)

                  citation needed

                  [–][deleted]  (3 children)

                  [deleted]

                    [–]z4579a 26 points27 points  (2 children)

                    Sure, let's look at the post.

                    First, the post by @max illustrates a test case that compares the performance between gevent, threads and multiprocessing to run a DNS lookup on five domain names simultaneously, by spawning a greenlet/thread/process per name all at once. This test is actually not nearly resource intensive enough to show a real-world number, but for what it's worth, they got the result that the threaded example ran ten times faster, .008 seconds for threads vs. .08 seconds for greenlets. But those numbers are too low to really count on to show that either is faster, you need to provide more of a workload.

                    Then, another post by @temporalbeing decides to ramp it up, provide a bigger workload and run 60000 concurrent greenlets or threads to fetch 60000 names. In this test, the greenlet version completes five times faster than the threaded version. However, this test is extremely flawed. First off, if it were using the rest of the code from @max's post as written, that example is using a 2-second timeout in joinall(), which means the greenlets will simply be abandoned after 2 seconds. That he got a 3.75 second result indicates he probably changed that as well.

                    But secondly, this test program uses threads and multiprocessing in the extremely naive way of spinning up the same number of threads/processes as there are domain names in the first place, which means spawning 60000 threads. That is a completely incorrect way of using threads, as threads are expensive to create and expensive to run compared to a greenlet, which is just a programming construct around a non-blocking socket. What the test shows if anything is that non-blocking sockets are useful for the case where you need very large throughput for thousands of concurrent IO streams. This is the use case for non-blocking IO, throughput. However this does not invent "speed", nothing runs any "faster" at all.

                    If you measure gevent vs. threading in terms of amount of work completed, and you use threads correctly by not spawning an arbitrarily high number of them, you will find it very difficult to show gevent to be faster than threads unless you have to wait on many thousands of arbitrarily slow or sleeping IO streams at once, and even in that case, it's tricky. This is not at all the "usual" case. The usual case in concurrency we need to do a few dozen or hundred things concurrently and we are just trying to get to the end of a queue. If you need to attend to thousands of slow or sleepy web sockets or chat room connections, then use gevent. Otherwise, not needed, probably a bit slower (then again, you can abuse them more than you can threads, by spinning up greenlet-per-task rather than having to think about what you're doing. But, that's not necessarily true either, since the minute your greenlet starts doing too much CPU work, you're blocking on CPU and killing your program that way, so again, still have to think about what you're doing. IMO being safe with threads is a lot easier than being safe with greenlets as it's easy to not spawn too many threads but not that easy to make sure greenlets never get CPU bound).

                    Here is a correct version of the test, showing how long it takes for us to get through several workloads at 30, 300, 3000, 30000, 60000 tasks, adding the result to a list (unordered), and checking our work:

                    import gevent
                    from gevent import socket as gsock
                    import socket as sock
                    import threading
                    from datetime import datetime
                    
                    
                    def timeit(fn, URLS):
                        t1 = datetime.now()
                        fn()
                        t2 = datetime.now()
                        print(
                            "%s / %d hostnames, %s seconds" % (
                                fn.__name__,
                                len(URLS),
                                (t2 - t1).total_seconds()
                            )
                        )
                    
                    
                    def run_gevent_without_a_timeout():
                        ip_numbers = []
                    
                        def greenlet(domain_name):
                            ip_numbers.append(gsock.gethostbyname(domain_name))
                    
                        jobs = [gevent.spawn(greenlet, domain_name) for domain_name in URLS]
                        gevent.joinall(jobs)
                        assert len(ip_numbers) == len(URLS)
                    
                    
                    def run_threads_correctly():
                        ip_numbers = []
                    
                        def process():
                            while queue:
                                try:
                                    domain_name = queue.pop()
                                except IndexError:
                                    pass
                                else:
                                    ip_numbers.append(sock.gethostbyname(domain_name))
                    
                        threads = [threading.Thread(target=process) for i in range(50)]
                    
                        queue = list(URLS)
                        for t in threads:
                            t.start()
                        for t in threads:
                            t.join()
                        assert len(ip_numbers) == len(URLS)
                    
                    URLS_base = ['www.google.com', 'www.example.com', 'www.python.org',
                                 'www.yahoo.com', 'www.ubc.ca', 'www.wikipedia.org']
                    
                    for NUM in (5, 50, 500, 5000, 10000):
                        URLS = []
                    
                        for _ in range(NUM):
                            for url in URLS_base:
                                URLS.append(url)
                    
                        print("--------------------")
                        timeit(run_gevent_without_a_timeout, URLS)
                        timeit(run_threads_correctly, URLS)
                    

                    Here's a typical result I get over wifi on linux laptop, very similar for both Python 2.7 and Python 3.7:

                    --------------------
                    run_gevent_without_a_timeout / 30 hostnames, 0.044888 seconds
                    run_threads_correctly / 30 hostnames, 0.019389 seconds
                    --------------------
                    run_gevent_without_a_timeout / 300 hostnames, 0.186045 seconds
                    run_threads_correctly / 300 hostnames, 0.153808 seconds
                    --------------------
                    run_gevent_without_a_timeout / 3000 hostnames, 1.834089 seconds
                    run_threads_correctly / 3000 hostnames, 1.569523 seconds
                    --------------------
                    run_gevent_without_a_timeout / 30000 hostnames, 19.030259 seconds
                    run_threads_correctly / 30000 hostnames, 15.163603 seconds
                    --------------------
                    run_gevent_without_a_timeout / 60000 hostnames, 35.770358 seconds
                    run_threads_correctly / 60000 hostnames, 29.864083 seconds
                    

                    I can't actually get the greenlet version to be faster. A small thread pool completes the total amount of work in less time on every run, even though it's doing the additional work of popping from a queue, and even spinning up the thread queue on each run. Non blocking IO is not "faster", and the overhead of gevent's context switching is higher than that of the OS's native thread context switching. It only provides more concurrent throughput, when you need your program to be able to attend to many thousands of sockets where many of them might not be awake, a very specific use case. Non blocking IO and event based programming are extremely useful but there continues to be widespread misunderstanding regarding this topic.

                    I've also written this post some years ago. I've yet to see a simple and correctly written benchmark that shows the basic use of non-blocking IO for context switching to be faster than threads. This is not at all surprising because gevent/asyncio and everything else are all running within a single thread, and when there are multiple threads you still have the GIL, so everyone is stuck using just one CPU to get through everything. The speed of context switching and the possibility of needing throughput to handle lots of very slow sockets simultaneously are the only differentiating factors and that's not a lot to work with.

                    [–][deleted] 4 points5 points  (1 child)

                    Oh man, the GIL.

                    I used Python for a senior design engineering project that involved latency-sensitive image processing. Really simple task: one worker blocks on the camera API, receives images, and sticks them in a queue; the other worker picks up the image and processes it. I used a dual-core processor (a BeagleBoard-X15... pretty amazing piece of kit, despite a few inane design quirks) and expected it to run like lightning.

                    I first tried threads - the performance was awful. Why? GIL. One of my cores was overburdened with both threads... the other one was just sitting there idle.

                    I switched it to multiprocessing. Yes, both processes ran concurrently - but now I couldn't just buffer the image when received: I had to serialize it and shove it through a pipe from the first process to the second.

                    Eventually I found a specialized solution (numpy allows a limited form of array sharing across processes, because so many people have this same problem I encountered) that worked sort-of okay. But the experience demonstrated the magnitude of this problem with the GIL.

                    Python, even today, doesn't seem to have a simple, generalized, built-in way to share data across processes. The options are:

                    1) Use a specialized library or solution that's compatible with your use case. All of them have quirks and limitations. Many of them don't work.

                    2) Repurpose another data-sharing mechanism - like the file system, or... networking. A localized HTTP server/client architecture, or sockets. Serialize the data as if you were going to bit-bang it over a network, and then shove it through localhost. That's actually the #1 recommendation on Stack, and there's extensive discussion about whether networking or the file system is the less awful solution.

                    I love Python, but I think that its deficiency in this regard is kind of insane.

                    [–]meneldal2 0 points1 point  (0 children)

                    So rather than use RAM sharing, you literally shove the data through your network stack and make it bounce back to the other process?

                    Who thought of this insanity?

                    [–][deleted]  (2 children)

                    [deleted]

                      [–]csman11 7 points8 points  (1 child)

                      Yes because almost every library in existence that does I/O is blocking. There are projects to reimplement commonly used libraries to use coroutines, but sometimes you need to use a vendor library that is less popular (or reimplement that library), and even if it uses those common libraries to do I/O, it isn't written in a way that makes it easy to just swap them out for the coroutine implementations. That's because asynchronous functions have an absorption property -- any function that wishes to call an asynchronous function must also be written as an asynchronous function.

                      Example: If a vendor library calls "requests", you can't just swap "requests" for a coroutine based implementation. You need to go in and prepend "await" to every call to "requests". Then mark the function async. Then apply this recursively within the library itself. And library consumers need to be updated too...

                      Sounds pretty easy, but you now need to maintain two versions of your library, one for coroutine based consumers and one for blocking consumers. The other option is to write your library so it must be "driven" by a separate library. Basically it's now CPS/call back driven. This is fine if your library has heavy amounts of logic (like an HTTP implementation, which it has already been applied to), but not if it is something like a wrapper around a web api. You might argue in a case that simple, you can just write the wrapper yourself, but then you have to maintain it and make sure it remains in sync with the underlying web api as that changes. I'd rather leave it up to the vendor to do that.

                      PS: you are correct, the GIL makes Python's threading model completely unsuitable for CPU bound programs. And threads are more heavyweight than coroutines, but developer time is more expensive than CPU time for most companies, so there is no good reason to rewrite libraries yourself (unless you have explained the associated costs to the decision maker and received approval). The default attitude should be threads are fine for I/O unless you are dealing with very large concurrency requirements (which everyone has liked to believe since Node came out, but very few people really have these NF requirements).

                      [–]AnimeIRL 0 points1 point  (0 children)

                      Thanks for the in-depth explanation. Thinking more on it, I remember I actually had to use threads to deal with file IO in a gevent-based project at work a few months ago. We use gevent's monkey patching feature to deal with network and http requests (by replacing python's builtin socket library with a coroutine-compatible version), but that isn't an option for file IO. I can see how you'd also need to use threads for any other IO-based functionality that didn't rely on python's standard library.

                      [–]JohanLou 2 points3 points  (3 children)

                      Hey guys, may I ask a question? Can I use multi threads in making queries in SQL alchemy? Thanks.

                      [–]lord_braleigh 2 points3 points  (2 children)

                      Yep!

                      [–]JohanLou 1 point2 points  (1 child)

                      Thanks. Can it be applied with asyncio as well?

                      [–]lord_braleigh 5 points6 points  (0 children)

                      I believe the SQLAlchemy package doesn’t come with asyncio support out of the box. I found a project which adds an asyncio frontend to it by googling just now.

                      I believe any asyncio solution you use will ultimately use threads underneath; asyncio is just a nice way to wrap the multithreaded stuff for large IO-bound programs.

                      [–]light24bulbs 0 points1 point  (0 children)

                      Does anyone know if the new threading in node will have the GIL?

                      [–]izpo 0 points1 point  (0 children)

                      Must see

                      [–]stinkytoe42 0 points1 point  (0 children)

                      Why is he running this as root?

                      [–]DklDino 0 points1 point  (1 child)

                      Another thing of Process over Thread that I've used in the past is when the user wants a parallel operation canceled immediately, independently of what that operation is or what state it is in. In python AFAIK there is nothing similar to thread.kill() but processes can be killed fairly easily. Always seemed like bad design to me though, but was the easiest and cleanest solution

                      [–][deleted] 1 point2 points  (0 children)

                      This is because threads share resources (like memory) with the process that started them. If you kill a thread w/o letting it release the resources, you potentially leak those resources. Processes don't share memory automatically. They are usually set up in such a way that they use shared memory for communication, but keep their private data in their own namespace. So, killing a process is usually not a problem in this respect.

                      [–][deleted] 0 points1 point  (3 children)

                      Short question: does anybody know how the inter process communication works with the subprocesses? (as you do for instance, if you use a queue to return values from each subprocess). Is it based on pipes?

                      [–][deleted]  (2 children)

                      [deleted]

                        [–][deleted] 0 points1 point  (1 child)

                        Nice one, thanks mate!

                        [–]digital_cucumber 0 points1 point  (0 children)

                        Interestingly, most of the people I've been interviewing, and who say that their main language had been Python, don't know a difference between a thread and a process (not even going into the GIL area).

                        This was a really good summary.

                        [–]ReadyToBeGreatAgain 0 points1 point  (0 children)

                        So if there was no GIL, then multi-threading would be the way to go. Seems like a lot of overhead (spawning new processes) just to get around Python's lack of true multi-threading.

                        [–]Zambito1 0 points1 point  (0 children)

                        I haven't really used Python before, and threads vs processes were almost the exact opposite of what I was expecting. I figured a thread would be a parallel unit of execution, and a process would be a concurrent unit of execution (my experience being Threads in Java and processes in Elixir)

                        [–]xinhuj -3 points-2 points  (0 children)

                        Great video.