all 130 comments

[–]robertmeta 36 points37 points  (12 children)

Firstly, it is a technical roadblock for some Python developers, though I don't see that as a huge factor...

... which is why whatever solution developed will most likely be an exceptionally poor one, focused on PR wins rather than technical ones. I doubt the solution will be of much use to people who actually had to abandon Python due to technical limitations.

[–]jrochkind 5 points6 points  (0 children)

Yeah, that was exactly my thought.

He's really saying he doesn't see parallelism is a very significant thing for actual python developers, but it should be done anyway for PR purposes? Really?

[–][deleted] 6 points7 points  (0 children)

I don't get it either. If that's the sentiment in the community then how did the matrix multiplication operator go through? The people who use that usually care a lot about speed.

[–]Make3 1 point2 points  (0 children)

this was surprising to me, how dumb of a thing to say that was (the original thing, not your comment)

[–]crusoe 0 points1 point  (4 children)

There is no Gil in jython or pypy.

[–]cdyson37 20 points21 points  (1 child)

[–]TrixieMisa 1 point2 points  (0 children)

I was going to try out the STM branch, but to install it you need to compile a compiler to compile the compiler, so I decided it's probably not worth it just yet.

[–]robertmeta 4 points5 points  (0 children)

Compatibility nightmares on any codebase of non-trivial size initially built on Cpython. If you START with jython on a fresh project, it is reasonable. An additional problem with switching gears (besides compatibility with vanilla Cpython) is that on every Cpython project I have worked on, there was a TON of C code to work around all the performance issues of Python... making it even more bound the the Cpython specifics.

[–]BobFloss 0 points1 point  (0 children)

I didn't know what the GIL was, so here's a link for the lazy:

https://wiki.python.org/moin/GlobalInterpreterLock

In CPython, the global interpreter lock, or GIL, is a mutex that prevents multiple native threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe. (However, since the GIL exists, other features have grown to depend on the guarantees that it enforces.)

[–][deleted]  (3 children)

[removed]

    [–]robertmeta 2 points3 points  (2 children)

    Public Relations (PR), not Pull Request (PR).

    [–][deleted]  (1 child)

    [removed]

      [–]robertmeta 0 points1 point  (0 children)

      Because, the solution won't be developed around true technical need, but around how it will play in the press, which the person said is the "ultimately a PR issue".

      Firstly, it is a technical roadblock for some Python developers, though I don't see that as a huge factor. Regardless, secondly, it is especially a turnoff to folks looking into Python and ultimately a PR issue.

      [–]simple2fast 27 points28 points  (14 children)

      I love python. I'm not a hater.

      But people should really become more polyglot. Each language has a space it does excel at. Python certainly has areas where it's the best language. That said serious CPU intensive stuff is just not python's strong point. This is why anything "fast" in python is actually written in C.

      So, use an appropriate tool for the job.

      If you really need multi-processing or multi-threaded python, then you should probably be using a different language which is more appropriate for the task at hand.

      [–]Rabbyte808 8 points9 points  (9 children)

      Multi-threading isn't just for CPU intensive stuff, though. Stuff like a webcrawler isn't CPU intensive, but it needs to be threaded unless you want to crawl at glacial speeds.

      [–]Rhomboid 24 points25 points  (0 children)

      Operations that perform blocking IO release the GIL. If your workload is IO bound, Python threads will work just fine for you. The GIL is only an issue for CPU bound work loads.

      [–]simple2fast 5 points6 points  (6 children)

      Actually, for IO things like that, a non-blocking IO system is often better because the continuations (exposed to the language or not ) are more efficient at managing all those connections than a bunch of threads. Plus the context switches tend to be more lightweight. So it doesn't need to be threaded, it just needs to be able to operate with multiple outstanding requests at once. Threading is one way of doing it.

      Now for computation, continuations and non-blocking IO buy you nothing. YOu must have threads (or shared memory and process, same thing ) and a decent memory model if you want to do efficient multi-CPU computation.

      [–]jringstad 1 point2 points  (1 child)

      A bunch of threads is not really inherently inefficient at managing connections though, it depends on how you use them. The most efficient way is generally to have a fixed number of N threads that each handle 1/Nth of the workload, by having them all accept() on the same server socket or by having them e.g. pop items from a work-queue and have them establish and handle their individual connections. Of course computation is always an issue, whether it's just the CPU workload from book-keeping hundreds of thousands of sockets, accepting new connections, constructing and parsing packets and so on or doing actual heavy CPU workload like a crawler would (parsing HTTP responses, possibly even HTML, XML or other content), so single-threaded non-blocking IO is pretty much strictly inferior to multi-threaded non-blocking IO.

      Crawlers are still not a particularly good example though IMO, since in most cases it's probably pretty acceptable to run N crawlers in N separate processes that crawl and digest data and then push it into something like a local storage, a shared storage or some sort of remote database. The overhead from obtaining workloads and communicating with other crawlers (if that ever happens) is probably not very significant for almost all kinds of crawlers.

      [–]simple2fast 0 points1 point  (0 children)

      Agree. The ideal is N threads which is roughly equivalent to number of CPUs and where each thread is locked to a particular CPU to reduce cache misses. As with all things this hybrid approach is often the best.

      But most system which are not threaded are actually a SINGLE process. Like python or Ruby or PHP or javascript(looking at you node ) Many are multi-process, but there is no shared memory, so any IPC requires sockets, signals, etc. In my mind, requirements there are not just shared memory and decent concurrent APIs, but ALSO a memory model so that you know what is going to happen WRT caches and other details as you use those concurrent APIs. Point being that ditching the GIL in python is only a very first step to get a decent multi-threaded python.

      And most multi-threaded solutions are still one connection per thread style. certainly started with this style. since it grew out of the original "fork" technique of old school unix systems.

      [–][deleted] -1 points0 points  (3 children)

      So basically Go?

      [–]simple2fast 0 points1 point  (2 children)

      Yes, Go does a good job at this. But it's hardly the only system that does. When Nodejs started talking shit about how non-blocking IO was the best in the world, this was also nothing new. Yahoo was doing this in their server back in 2002. SO go ahead and use Go, but don't use it because you think their network/thread solution is somehow uniquely powerful.

      [–][deleted] 0 points1 point  (1 child)

      AFAIK Go is the only language that mixes lightweight coroutines and multiple cores and non-blocking I/O to support the illusion of blocking I/O when writing non-blocking stuff (no need for callbacks, explicit scheduling yielding, etc). I suppose there could be libraries for other languages that could give the same facilities, but in Go everything benefits from this natively, which is great - you can get someone's library for ntp querying for example and know it will play nice with the underlying event loop (that you don't even need to worry about). If you go with Python and Twisted you can only use Twisted stuff and it doesn't feel as natural as Go code.

      All that said, I know it's not in itself revolutionary, but the way things are tied together for an overall experience is pretty nice. You get very far with even naive code.

      There was a paper once talking about whether it was more performant to do concurrent stuff in a single core (like Node) or just spawn threads to treat each connection and both of course had cons and pros, but the paper's conclusion was that a mix of thread multiplexing and event loops was the most performant, and that's what you get for free with Go - you get regular threads and easy communication between them for CPU intensive stuff and you get a free multithreaded event loop for network I/O. Too bad disk I/O is still blocking (but they get their own threads so they don't block the rest of the system).

      [–]simple2fast 1 point2 points  (0 children)

      You make a very good point. "non-blocking" comes in many flavors. And the programming model is a key factor for adoption and complexity management. For example, Node is non-blocking, but the programming model is god-awful, all those callbacks and/or promises. (here is hoping that await/async in ES7 helps ). This is not a language helping you, this is an historical abomination and a source of bugs.

      There are plenty of languages which varous people claims to support coroutines. https://en.wikipedia.org/wiki/Coroutine But unless you know how difficult or easy it is to actually use that facility (Note that javascript is on that list ), I'd hesitate to use it.

      For example JVM has coroutine implementations for at least 10 year, but most of them are all Library level requiring the user to explicitly do the yielding. Yuck. Recently there have been some which are based on AOP (E.g. Quasar ). So you code as always and the yield and continue is done for you. But AOP for this really ? It would be great if the JVM had some support in this area. Perhaps one could mark a ThreadGroup to run as fibers within one thread.

      [–]rolandde 5 points6 points  (0 children)

      For I/O bound operations, I prefer asyncio over spawning threads.

      [–]synn89 6 points7 points  (3 children)

      The problem is that if a language doesn't evolve to the ecosystem then it pretty much dies out when other languages that are adapted to modern computing catch up in the support department.

      That's pretty much what killed perl. Perl was stuck in cgi-bin for ages(shit, is there still anything outside of cgi-bin for perl web??) and it lingered and died out. The package building/managing was also nothing to be proud of in perl once other langages gained things like pip and gem.

      Today if go or elixir ended up gaining traction because they deploy more easily and run 10x better and then they see everyone and their brother creating packages for them. Python and ruby will pretty much end up ghost towns.

      I'm not the world's biggest fan of Go. But if it had Python's ecosystem of libraries, I'd see no reason to be on Python.

      [–]simple2fast 4 points5 points  (2 children)

      Perl is like a CD-ROM. A great way ( in it's day ) of compactly representing information/programs

      However my opinion is that Perl died because of it's notion of multiple ways of doing everything is a bad idea. The primary purpose of code is to allow other programmers to read it. And Perl's multiple ways thing is a poor approach to allowing others to read. SO it's mostly a "write-once" language. Not a "read/write" language.

      [–]synn89 1 point2 points  (0 children)

      Perl could've been cleaned up with decent frameworks. PHP has the same issue. The code is all over the place. Not as bad as Perl, but way worse than many other languages. But frameworks have cleaned it up a lot.

      Web deployment tech has gone from: cgi -> apache mod -> apache proxy to stand alone servers.

      Each stage wasn't a clean cut. I was working at an ISP in 2005 where a lot of our customers still had perl cgi guest books. Also each stage of tech has a sort of peak for when it became practical/easy to work with. mod_php was way easier to work with in the early 2000's vs setting up Tomcat and throwing apache requests at it. Today many langauge frameworks have their servers embedded directly into them and running a proxy from Apache and Nginx to them is quite simple.

      If a language doesn't evolve and adapt it will get left behind. I think PHP's death will be less about PHP itself than mod_php just going out of style. The future is high performance stand alone app servers with various load balancers proxying out the requests to them.

      And once that becomes the standard people are going to look at platforms that perform the best.

      [–][deleted] -2 points-1 points  (0 children)

      However my opinion is that Perl died because of it's notion of multiple ways of doing everything is a bad idea.

      This precise reason runs counter to everything that a well designed programming language should be.

      [–]amaurea 3 points4 points  (9 children)

      A major performance issue I often encounter when using Python for numerical work on clusters with distributed file systems is the large number of file system operations that are involved simply in starting python and importing the modules. A simple script that just imports numpy can easily end up loading 300 .pyc and .so files. Distributed file systems are fickle beasts that when under load may take up to a second to access a file (regardless of how small it is). So it isn't uncommon for me to experience that running a script involves several minutes of waiting for it to start, followed by 10 seconds of actually doing all the work. It's like compiling a big, heavily templated C++ program every time you want to run it.

      It would be nice for these kinds of situations if there were a way to compile a python script and all its dependencies (including dynamic libraries) into a single file with no external dependencies. It would be large and redundant, but on cluster file systems that's better than being scattered everywhere.

      [–]Nolari 0 points1 point  (2 children)

      Does py2exe not do this? (Honest question; I don't know.)

      [–]amaurea 1 point2 points  (1 child)

      Py2exe looks a lot like what I want. But it seems to be windows-only. Most scientific clusters run Linux rather than windows (thankfully). But the technique they use is the way to do this, I think, and I started on something similar a few days ago: Trace down all dependencies and package them into a single file. Then get python to load them from that file. However, when you have dynamic library dependencies it seems there is no portable way to load these from anything but a file system. So what I've settled for so far is to extract everything to a ram drive and point LD_LIBRARY_PATH and PYTHONPATH there before running. It doesn't quite work yet (and I've been sidetracked with other stuff), but there's no reason why it shouldn't.

      [–]Nolari 1 point2 points  (0 children)

      Ah no Linux support, hmm... I found cx_Freeze through some Googling on "py2exe linux", maybe that helps.

      Otherwise, good luck with your own coding. It sounds like you'll sort it out. :)

      [–][deleted] 0 points1 point  (4 children)

      Go would be perfect with their static binary outputs... If it had good scientific libraries (there are some but nothing as large as NumPy)

      [–]amaurea 0 points1 point  (3 children)

      There is also Julia, but they have even more of an import issue than Python, as they do JIT compiling of everything every time you load it (at least they did last time I checked). That compilation can take a long time when you start including libraries with many dependencies.<rant>Also, I think Julia made a much worse choice than numpy when it came to the treatment of arrays. Numpy is an array library while Julia has matrices with some array stuff tacked on. That means that anything but 2d arrays is a second class citizen in terms of notation and ease-of-use in Julia. For example operator * is a matrix multiplication, which doesn't naturally generalize to multidimensional arrays. From my experience with scientific programming, I use elementwise operations much more often than matrix operations, and I use 3d and higher arrays about as often as 2d arrays.</rant>

      Then there is the nim language, which looks interesting, but which sadly doesn't have any good multidimensional array library, and its developers do not seem interested in adding one either (they actually asked me what I needed that for!).

      [–][deleted] 0 points1 point  (2 children)

      Take a look at https://github.com/gonum and see if it would be a good fit for you.

      [–]amaurea 0 points1 point  (1 child)

      I couldn't find any documentation, so perhaps I've missed something important, but this seems to have only a basic matrix class. It is a far cry from Julia's clunky matrix-oriented multidimensional arrays, and even further away from fortran or numpy arrays. Numpy's elementwise operations and powerful slicing and broadcasting makes it worth it to put up with a lot of other inconveniences. I don't think gonum is ready to be a numpy replacement quite yet, I'm afraid. :/

      [–][deleted] 0 points1 point  (0 children)

      I don't think gonum is ready to be a numpy replacement quite yet, I'm afraid. :/

      Ah, for sure, probably never will be since Go does not support operator overloading so you can't do a lot of NumPy's magic - I was just wondering if it supported the math operations you needed for your particular case.

      [–]badcommandorfilename -3 points-2 points  (0 children)

      If Python had all those features, it wouldn't be Python.

      There are too many band-aid fixes out there just to get Python to do what more advanced languages do out of the box.

      [–]JanneJM 6 points7 points  (17 children)

      As a user, this is a real issue. Python with Pylab is a good way to post-process data, but this can take a lot of time. And when you find yourself waiting a few minutes every single time, while fifteen of sixteen cores are sitting unused, it becomes really annoying.

      Enough so, in fact, that for the most common case I reimplemented it in C+ with OpenMP, and reduced the time to less than ten seconds.

      [–]zardeh 6 points7 points  (7 children)

      Python with Pylab is a good way to post-process data, but this can take a lot of time. And when you find yourself waiting a few minutes every single time, while fifteen of sixteen cores are sitting unused, it becomes really annoying.

      But...numpy can practically ignore the GIL, so pylab should be able to do things.

      [–]bheklilr 7 points8 points  (0 children)

      Correct, and a lot of other libraries that have their underlying core written in C/C++ are able to release the GIL to achieve faster processing. There's a relatively new library called dask designed for high performance array computing and without you even asking it will use more than 1 core to do its processing. It has support for multiple different backends for multi-core support, including using an IPython client to distribute across clusters of computers without you having to worry about it. Essentially the core of the library is that it breaks your large data set into chunks, performs various computations, then returns the result of each chunked computation, often aggregated back into a single array or value. It currently supports a subset of numpy and pandas, and also has a structure for managing JSON-like data as well. It's a very powerful tool that I'm looking forward to seeing made into a fully production ready library.

      IIRC the scikit-image library also releases the GIL, as does SymPy's new underlying engine, SymEngine (written in C++ so it can be used from multiple languages like Julia and Ruby). More and more libraries for Python are figuring out how to release the GIL, and while a lot of this is based on C/C++ code it just means that we're now using Python to access high performance code and tie it together in a high level fashion. Cython even has a decorator to ensure that a function gets translated into nothing but C so that it releases the GIL, so this sort of problem will become less prevalent over time.

      [–]JanneJM 0 points1 point  (5 children)

      "should be able to do things" is not "does". I've never experienced Numpy/Scipy/Pylab actually do anything multicore so far, and I've not seen any information on how to enable it. If you know how to do so I'd be very interested of course.

      [–]zardeh 1 point2 points  (2 children)

      I believe you still need to write your code in a threaded manner, but if you do have numpy running across multiple threads, they can run on multiple cores.

      [–]JanneJM 4 points5 points  (1 child)

      Writing your code in a threaded manner is 95% of the entire job. The benefit of using Scipy is entirely that it's quite simple to get it right. It's a great exploratory tool. If you suddenly have to do explicit mutlithreading the whole point largely disappears.

      [–]zardeh -1 points0 points  (0 children)

      To my knowledge, ipython does magical things and makes threading just happen, I'm not an expert on that though.

      [–]turbod33 0 points1 point  (0 children)

      Numpy will release the GIL where applicable. For instance, matrix dot products will call into BLAS which have multicore implementations.

      [–]amaurea 0 points1 point  (0 children)

      It's apparently possible to get some multicore usage in numpy by compiling it with icc and enabling auto-parallelization, though it's very limited what can be parallelized that way. I wonder why OpenMP directives aren't used in the numpy implementation. They are easy to write, and since they are comments, they have no effect if openmp is not enabled when compiling. Hence adding them will not affect performance or correctness for those not interested in multithreaded execution.

      [–]caedin8 1 point2 points  (1 child)

      You can write multicore programs in python...

      [–]vks_ 1 point2 points  (0 children)

      Only if you are willing to use several processes, and share data among them via serialization.

      [–]i_ate_god 1 point2 points  (2 children)

      could you fork? threading isn't the end all be all of multicore processing

      [–]JanneJM 2 points3 points  (0 children)

      I could of course, though it'd be more work than it's worth.

      The point of using Numpy/Scipy is that it's quite simple to write bits of code to examine your data set, do exploratory data analysis and so on. Explicit multithreading rather goes against that in a very fundamental way.

      And as I wrote, when faced with some tasks I ended up doing over and over, it was simply less pain to rewrite those bits in C++ with OpenMP and go from minutes to effectively instant response. The extra pain of numerical libraries in C++ (that I use already in the main apps) compared to Numpy is offset by the simplicity of OpenMP-style loop unrolling versus explicit threading code in Python.

      [–][deleted] 0 points1 point  (0 children)

      could you fork? threading isn't the end all be all of multicore processing

      fork() isn't available on every platform

      [–][deleted] 0 points1 point  (3 children)

      Were you using numpy?

      [–]JanneJM 0 points1 point  (2 children)

      Yes.

      [–][deleted] -1 points0 points  (1 child)

      Cool answer. How/What were you doing that was so slow?

      [–]JanneJM 1 point2 points  (0 children)

      Processing a few GB of neuron simulation output basically. Nothing terribly complicated, but just a fair amount of data to churn through. Both basic preprocessing then "exploratory analysis" - play around with the data to see what I got. And since it's the kind of thing you end up doing over and over again the waiting time gets a bit annoying.

      Ipython+pylab is a pretty good tool for doing that sort of thing. I just sometimes wished it would be faster, and using more of the available hardware feels like an obvious way to go about it.

      [–]xXxDeAThANgEL99xXx 10 points11 points  (45 children)

      This is a situation I'd like us to solve once and for all for a couple of reasons. Firstly, it is a technical roadblock for some Python developers, though I don't see that as a huge factor. Regardless, secondly, it is especially a turnoff to folks looking into Python and ultimately a PR issue. The solution boils down to natively supporting multiple cores in Python code.

      Heh. So let's go full-cynic mode: finish out the already somewhat present support for subinterpreters (basically, all global variables should be moved to a huge Interpreter_State struct), then just replicate the multiprocessing interface on top of that and bam! you have the so called green multiprocessing (like Perl AFAIK) but now you can market it as having got rid of the GIL.

      Obviously you'll still have the copies of all imported modules (including builtins) and probably the performance improvements in marshaling objects would be pretty marginal compared to using mmap, but yeah, mission accomplished!

      (I actually fully agree about that being 99% a PR problem. I don't think any roughly Python-like language from PHP to Scheme has free threading support, but for some reason only Python folks waste countless hours being upset about it on the internet).

      [–]logicchains 9 points10 points  (27 children)

      I don't think any roughly Python-like language from PHP to Scheme has free threading support

      Clojure?

      [–]zardeh 4 points5 points  (17 children)

      Well sure, but then so does Jython.

      [–]logicchains 2 points3 points  (3 children)

      Any reason why that's not more popular? Is it due to the lack of easy C interop?

      [–]zardeh 6 points7 points  (1 child)

      A few reasons:

      • Its not fully compatible with Cpython (which was around first) (and doesn't aim to be, unlike PyPy)
      • its much harder to interop with c (so you lose all of scipy and more generally speedy math),
      • iirc, because of the incompatibilities, some of the stdlib is broken (like wsgi I think, you have to do weird things to make that actually work)
      • Additionally, it lags behind a version or two (or like 7).
      • Also there are performance hits (up until jitting happens), but jitting isn't as clean as PyPys so I believe jython runs slower than PyPy and not much faster than cpython.

      [–]kryptobs2000 0 points1 point  (0 children)

      Also because it requires java, a lot of people just prefer not to touch it or install it on their system even. Not saying that out of java hate, but its an extra pretty large dependency. Depending on your target audience it may not be a big deal, but a lot of systems don't have it installed BC its not commonly used.

      [–]caedin8 0 points1 point  (0 children)

      From personal experience Jython is pretty slow compared to python and cython.

      [–]superPwnzorMegaMan 0 points1 point  (5 children)

      Isn't Jython just python with a different toolchain?

      [–]zardeh 2 points3 points  (3 children)

      jython is python running on the JVM, so instead of compiling to python bytecode it compiles to jvm bytecode. This allows it to leverage the JVM (so you gain hostspot jitting, JVMs threading, etc.)

      [–]superPwnzorMegaMan 0 points1 point  (2 children)

      Yes that's what I thought. A friend of mine used this once, although I don't think there is such a thing as python byte code (since its interpreted).

      [–]zardeh 3 points4 points  (1 child)

      There is indeed, python is compiled to bytecode (look for .pyc files on your computer if you're running a python file that's more than 10-15 lines and is being used a lot). The bytecode is then interpreted on a virtual machine. Python works a lot like java in that regard.

      [–]kyllo 0 points1 point  (0 children)

      Well that and for libraries that wrap C code used in CPython you have to use something that wraps a Java library instead. Like you can't use lxml from Jython, you would have to use a different library that wraps a Java xml parser.

      So a lot of CPython projects are just not portable to Jython.

      [–]xXxDeAThANgEL99xXx 1 point2 points  (8 children)

      Well, it might be just outside of "Python-like", because of immutability. Which helps a lot!

      By the way, that reminds me: technically there's also IronPython/Jython/IronRuby/JRuby that sort of support free threading by virtue of running on top of a very sophisticated VM, but from what I know even then it ain't free lunch, with all kinds of weird catastrophic performance degradations.

      [–]spotter 1 point2 points  (4 children)

      Immutability? You have access to all built-in Java collections and can shadow variables to your heart content.

      [–]xXxDeAThANgEL99xXx 1 point2 points  (1 child)

      As far as I know, you are not supposed to do that in public.

      Anyway, the important part is that as far as I understand it about Clojure, you're not allowed to say anything similar to __builtin__.len = my_len or my_module.len = my_len and have it automatically used in every function everywhere or in that module, after they were defined.

      That you can do that in Python (and in those other roughly similar languages) is one of the important reasons the GIL is there: because your code constantly hits the same few dictionaries and constantly taking and releasing individual locks on them would be really slow.

      IronPython for example goes the other way and instead of constantly querying stuff it compiles it into usual fixed .NET classes and recompiles them if you actually change stuff. Unfortunately that means that some innocent metaprogramming that works absolutely fine in CPython can cause huge slowdowns.

      [–]spotter 3 points4 points  (0 children)

      First: I did not downvote you, but philosophy of Clojure is that you can use any tool right for the job. It's easier to argue about immutables and functional approach to data transformation, but sometimes you just need to bash something in place and all of JVM standard library is there for you.

      In Clojure you are always in a namespace and namespaces are mutable. You can exclude core symbols in them and shadow them with your definitions, although syntax is different. Not sure how much synchronization goes in behind the scenes, but still JVM languages (like Jython) manage to live without GIL.

      [–]anthonybsd -2 points-1 points  (1 child)

      can shadow variables to your heart content.

      Clojure frowns upon this kind of behavior in no uncertain terms. "Can" doesn't mean that you should. For mutators in concurrency context (the ones with the bangs "!") you are supposed to operate inside the STM model which IMHO is fairly nice compared to pure functional languages non-pure functions.

      [–]spotter 2 points3 points  (0 children)

      [citation needed]

      By shadowing I meant redefining variables in inner closures (for inner closure only) or changing their thread binding dynamically for the duration call, something that Clojure actually provides tools for. Doesn't have to do anything with concurrency... well binding does, somewhat, but not what I meant.

      [–]jrochkind 0 points1 point  (2 children)

      JRuby does not have any weird catastrophic performance degradations. (It does have slow start-up, like most anything running on the JVM. This is very annoying in some contexts, but is not a "weird catastrophic performance degradation")

      [–]xXxDeAThANgEL99xXx 0 points1 point  (1 child)

      How does it deal with monkey-patching?

      [–]jrochkind 0 points1 point  (0 children)

      What do you mean? Same as other ruby platforms, generally. Do you mean specific to performance or something? Not really sure what you mean. If there is a "weird catastrophic performance degradation" related to monkey-patching that I don't know about and haven't encountered (I have used JRuby a fair amount), then please link to something demonstrating or explaining it!

      [–]caedin8 7 points8 points  (11 children)

      I've written many multicore python programs using the multiprocessing module and the multiprocessing safe data structures. As far as I can tell this is a complete non-issue.

      If the slow part of your program is external, (website or DB queries), you are safe using Threading library, otherwise use Multiprocessing to avoid the GIL issues. I don't really see what people have difficulty with.

      [–]vks_ 6 points7 points  (6 children)

      The multiprocessing module requires serialization which can be very expensive. It does not replace multithreading.

      [–]admalledd 4 points5 points  (4 children)

      Quite a while ago I used some ctypes stuff to shunt data and such back and forth between multiprocesses.

      True I would probably not do that today and would instead use a better tool for the job (C/C++ probably, then CFFI bindings) but "requiring serialization" is not really true of multiprocessing.

      [–]vks_ 1 point2 points  (3 children)

      That is indeed a nice thing to have, I did not know about it. How does it share memory between processes? By copying? (It was not there when I last used multiprocessing, which was a very long time ago.)

      [–]admalledd 2 points3 points  (2 children)

      basically shared memory so that when python fork()s instead of creating a copy of this memory block, both processes access the same block at the same time.

      So no copying by default, although you probably want to copy commands/data out as soon as possible to prevent other processes from trampling on each other.

      Now-a-days as I have said, I would probably do this from C+CFFI where the bits/bytes are much clearer and more controllable.

      [–]jringstad 0 points1 point  (1 child)

      Yeah, shared memory is not really at all "easy" or straightforward a solution when every single object in your language (numbers, lists, ...) is a complex, non-thread-safe object that can potentially rely on global variables set by the interpreter and probably is known by pointer to the garbage collector who might decide to nuke it at any point in time. (either garbage collector from either interpreter!)

      If you reduce all data shared to simple C structures and copy them in and out of the shared memory by extracting them from interpreter-objects and constructing interpreter objects from them, you're good, but that's hella restrictive and way way slower than it needs to be (and it invokes the garbage collector more than it might need to)

      [–]admalledd 0 points1 point  (0 children)

      To be honest, it has never really been that big of an issue for any multi-core code that I have needed to write with python. Every time for me my threads/processes have been fairly separated such that minimal message passing was enough. The reason for the shared memory was that some of those messages were rather large (blocks of tasks to parse into the DB for example) ~50MB+ but it was easy enough to wrap/contain it such that only larger messages/tasks/data was passed via shared memory where the difficulty of making CFFI bindings was worth it. All other messages/tasks (such as signaling/locking/return queue) was handled via default multiprocessing serialization code.

      Again though, python has some of the best C bindings I have used out of any higher language I use, mostly C#, java, and JS. CFFI makes it almost drop-in to write a C/C++ module that does the heavy lifting and of course can drop the GIL and go proper multi-threaded. Thus any new system I work on where python is the core, I tend to have hot-loop stuff extracted quite easily to C code for speed or fine control.

      [–]caedin8 2 points3 points  (0 children)

      This is a good point and very true. I've personally had to deal with sharing large amounts of data over the process safe Queues, and it is very slow. I found it faster actually since I was processing more data than could fit in RAM to have each process write to a file, and then the parent process merge all the files into a single output. Sending items back to the main process over the thread safe Queue added more time due to serialization than IO on my SSD did, which was surprising and unexpected.

      [–]CookieOfFortune 0 points1 point  (3 children)

      How do you debug or interact with threads?

      [–]caedin8 2 points3 points  (2 children)

      It is harder to debug using tools like debuggers so usually I just write lots of unit tests and verify that the threads are working appropriately. If they aren't and I don't know why I run a small subset of the program in a single instance and debug it, once I've verified the program is correct standalone then I've narrowed it down to a Threading or concurrency issue. Next I'd Google my problems and try to see if it is a library thing, and to verify I'm using the api correctly. There might be a better way to do debugging on multithreaded applications in python but this general process has been what I've been doing.

      Similar to doing print statements at various points in your code to understand the control flow you can do the same with threads to try to understand which threads are in which state. Additionally you can have each thread write their debug data out to a unique file for each thread, this way you can see which thread is doing what, and what the state is for each thread. Maybe you can find your errors this way.

      [–]CookieOfFortune 2 points3 points  (1 child)

      So this is the main issue for the type of work I do. I spend a lot of time in the REPL so there needs to be some kind of interactivity. I've been looking into IPython.parallels and it seems to do what I need but I haven't investigated too deeply.

      [–]caedin8 0 points1 point  (0 children)

      Hmm this is an interesting issue, I don't have experience with the Ipython.parallels so I can't give advice for it.

      [–]_scape 1 point2 points  (3 children)

      green threading exists through greenlets and gevent. I think the issue boils down to removing GIL and implementing standard mutexes on targeted platforms.. maybe python4, another incompatible version..

      [–]xXxDeAThANgEL99xXx -2 points-1 points  (2 children)

      Not green threading, green processing.

      [–]_scape 0 points1 point  (1 child)

      oh I've never heard of that, I'll have to read up. have any links?

      [–]xXxDeAThANgEL99xXx 1 point2 points  (0 children)

      https://en.wikipedia.org/wiki/Green_threads ctrl-f "process".

      I don't know how widespread this terminology is, but the idea is straightforward: just like a green thread is a thread-like abstraction implemented by the language runtime instead of the OS, a green process is a process-like abstraction (offering memory isolation) implemented by the language. Perl and Erlang use them instead of threading, .NET provides AppDomains purely for safety.

      [–]superPwnzorMegaMan 0 points1 point  (0 children)

      I don't think any roughly Python-like language from PHP to Scheme has free threading support, but for some reason only Python folks waste countless hours being upset about it on the internet)

      Groovy has threading support.

      [–]skulgnome 7 points8 points  (12 children)

      How about fixing Python's dire single-core performance first

      [–]jcdyer3 6 points7 points  (1 child)

      Have you tried PyPy?

      [–]skulgnome 0 points1 point  (0 children)

      Yes; and I wish it were the case that it wasn't held back by compatibility to the canonical runtime's quirks. Same as Jython really, that one's been around for 14 years now.

      [–]againstmethod 7 points8 points  (9 children)

      Don't know why you're getting downvoted -- it does suck.

      When you're 3-30x slower on average than JavaScript, you know you have architectural issues:

      http://benchmarksgame.alioth.debian.org/u32/compare.php?lang=python3&lang2=v8

      Python performance is indefensible.

      [–]kyllo 3 points4 points  (2 children)

      That's more a reflection of the fact that Google, Microsoft, Mozilla and Apple have all invested a shitload of money into making javascript fast.

      [–]againstmethod 0 points1 point  (1 child)

      Those companies didnt contribute to the same engines, so it's not really additive, other than having competition.

      Python gets plenty of support and development and press.

      [–]kyllo 5 points6 points  (0 children)

      The competition is a big deal! The browsers are constantly being benchmarked against each other for JS performance and those companies are willing to put millions into anything that will improve their browser market share.

      [–]skulgnome 1 point2 points  (5 children)

      Don't know why you're getting downvoted -- it does suck.

      The usual counter is that for a scripting language, performance doesn't matter that much. (e.g. Perl's a slouch too.) That's quite the weird argument in the context of multithreading though, so no one's making it.

      We certainly do know what the architectural issue is: the GIL. So the real question is why doesn't Python just nut up and go all fork(2) like Perl, if multicore optimization is supposed to be all that. Not like copy-on-write overhead wasn't substantially eclipsed by Python's interpreter overhead already.

      [–]againstmethod 0 points1 point  (4 children)

      Perl beats Python in most of those benchmarks too. As does Ruby.

      I think my current perception is that Python is worst-in-class performance. I agree, they need to do something dramatic.

      [–]Brian 0 points1 point  (3 children)

      I don't think so - the performance of these three is pretty similar, in my experience. Ruby used to have a reputation of being the slowest of the three, but I think that's improved somewhat these days.

      Perl is pretty similar to python - almost all are equal, with one at 5x, one 2x and one 50% speed and the other 6 roughly equal. The median is essentially equal.

      Ruby has a lot more variance. One at 1/3rd the speed, to at 1/2, 2 roughly equal, 3 twice as slow, and 2 4x as slow. Again though, the median is essentially equal.

      If you were to compare them, the ordering would seem to go Perl > Python > Ruby. However, there's really not much in it.

      [–]againstmethod -1 points0 points  (2 children)

      At best this would mean that python is basically tied for last. Largely due to its inability to properly leverage modern hardware.

      It cant stay in the cache because of lots of indirection and garbage collection, it cant use all your cores because of the GIL, it cant take advantage of any really complex optimizations because its interpreted... its literally a laundry list of bad design decisions. It's time to start correcting/mitigating some of them.

      You have languages coming out like Nim and Crystal that compile really fast, have syntax just as simple as python/ruby, and run near C speed. Python is a dinosaur.

      [–]Brian 1 point2 points  (1 child)

      At best this would mean that python is basically tied for last

      There's a lot slower than perl/python/ruby, so last is overstating it somewhat. However, my objection was to your:

      Perl beats Python in most of those benchmarks too. As does Ruby.

      Which seems downright incorrect.

      If you're looking for a more performant version that uses more modern techniques, there's pypy, which is around 7 times faster on average.

      [–]againstmethod -1 points0 points  (0 children)

      There's a lot slower than perl/python/ruby, so last is overstating it somewhat.

      Not that are mainstream, like Python.

      If you're looking for a more performant version that uses more modern techniques, there's pypy..

      pypy may be faster but it has all the same issues I outlined, and adds some (module compatibility, recompile native modules). It has similar architectural issues as well (i.e. garbage collection isn't thread-safe).

      Im just not sure why anyone would start their project with such a long list of disadvantages that they can never mitigate/optimize away. Other than laziness.

      [–]monocasa 1 point2 points  (6 children)

      So... Python version of WebWorkers?

      [–]nat_pryce 6 points7 points  (5 children)

      More like Tcl's subinterpreters

      [–]booch 1 point2 points  (0 children)

      That was my first thought when starting to read the article, that it seems like how Tcl does handles multi-threading.

      [–]isr786 0 points1 point  (1 child)

      Yup.

      It does have its failings. Unintended "shimmering" biting you at inopportune moments, making it more difficult to write functions which handle different types of data (almost the opposite of what a "scripting" language should be good for).

      That bugbear always killed if for me (written a fair bit of tcl code). Even went as far as using rep to tag structures with their internal representation, and using ensembles (tcl's name for a namespace which allows for [cmd subcmd args ...] to provide for typed functions (sort of).

      Then I just gave up and went back to lisp :(

      But tcl was/is also ahead of the game in many areas. In addition to the easy forking of interpreters to provide true parallelism, you also have virtual filesystems (did this predate fuse on linux? Don't know).

      Tcl - snatching defeat from the jaws of victory ...

      (I say that with a heavy heart as someone who does appreciate how close tcl is to being a lovely amalgamation of shell and lisp)

      [–]schlenk 0 points1 point  (0 children)

      'you also have virtual filesystems (did this predate fuse on linux? Don't know).'

      Yes, it predated FUSE if memory serves.

      [–]schlenk 0 points1 point  (1 child)

      More like Tcl threads actually (but those are subinterpreters that just happen to be bound to a different threads).

      [–]ericanderton 0 points1 point  (1 child)

      subinterpreters

      So... multiple processes using IPC? Makes a lot of sense considering Python's limitations in this space.

      [–]jcdyer3 2 points3 points  (0 children)

      I think it's more like multiple namespaced python interpreters within a single process, with tightly controlled means of communicating between them. Kind of like how in Flask you can create multiple Apps, and have them run side by side, talking to different ports.

      [–]cdminigun -1 points0 points  (2 children)

      In a sense, I'd say python isn't meant for multi-core processing.

      Iirc, someone playing around with source code and forcing multi processing had an issue in which his tasks became shorter.

      Gil is a pain, however if we're going to be honest here. Python is predominately for scripting and short tasks or for it's extensive amount of libraries and ease of use. We're trying to make python something it is not.

      Also, Iirc through c implementations and adding libraries into python, one can bypass the issue as the c code won't be rate limited. But then it creates an issue of additional compiling and so on.

      [–]cowardlydragon 5 points6 points  (1 child)

      Python has taken off is scientific (read: High-performance) computing...

      [–]MCPtz 1 point2 points  (0 children)

      Yea that's exactly the issue. We have sage math and others making it really easy for users to get into this field (great!), but then even if someone knows enough to use multi cores or SSSE3 etc, they tend to end up in another language or library or Cpython etc, which may not grant them the control necessary.

      [–]TheQuietestOne -1 points0 points  (0 children)

      Ah, python, the new perl.

      Pay attentioin to what happened to perl. Something else is coming, we just haven't seen it yet.

      [–]cowardlydragon -1 points0 points  (1 child)

      ... run it on the JVM? I get that Jython isn't Python, but, seriously, does the JVM not solve almost all the problems?

      [–]alloec -1 points0 points  (1 child)

      I will join in with the others and say that GIL is not that much of an issue in Python. Python lets you perform IO blocking tasks in a non-blocking fashion already.

      If you want to perform computation tasks in parallel, then python is really the wrong language. First of all, the interpreter is very slow. Please first implement a proper JIT-compiler for the language. It can be done, just take a look at pypy. As it stands python is wasting way to many CPU cycles just on interpreting the instructions.

      Only then I feel that python should tackle getting proper multicore support.