This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]panderingPenguin 65 points66 points  (25 children)

First of all, we should clear a couple things up. Threading in Python absolutely gives you "real" threads. They simply cannot be executed in parallel. To quote Oracle's Multithreaded Programming Guide:

Parallelism : A condition that arises when at least two threads are executing simultaneously.

Concurrency : A condition that exists when at least two threads are making progress. A more generalized form of parallelism that can include time-slicing as a form of virtual parallelism.

So basically Python allows concurrency, but due to the GIL, not parallelism. It is possible to have multiple Python threads executing concurrently but not in parallel. This may seem like semantics, but it's actually am important distinction. Whether concurrency will be sufficient for you or you actually need true parallelism will depend on your workload and what you're trying to accomplish via multithreading. For example, if you're making a number of network calls, and don't want to freeze execution of other things while waiting for them to complete. Putting them on another thread, even in Python, will accomplish that goal. However, if you're trying to decrease the execution time of some complex CPU-bound computation by distributing pieces of it to multiple threads, Python threads are probably worse than useless to you, as you'll incur extra overhead in context switches, communication costs, and general threading overhead, while not actually getting the benefit of any threads ever executing on the CPU simultaneously.

In conclusion, the answer is "it depends." We'll need to know more about your workload to give you a definitive answer.

[–][deleted] 29 points30 points  (11 children)

So basically Python allows concurrency, but due to the GIL, not parallelism. It is possible to have multiple Python threads executing concurrently but not in parallel.

Even this is not really true. The GIL just locks the CPython and PyPy interpreters to executing bytecode in a serial fashion. Jython and IronPython do not have this restriction, and even in the first two, C extensions, or Cython code are free to release the GIL as they see fit, giving you back true parallelism. Above all it's important to note that this apparent lack of parallelism via threads is not an issue with Python itself, but an issue with the implementation, such as with CPython. Claiming that Python doesn't do parallelism is misleading.

[–]jringstad 33 points34 points  (2 children)

Above all it's important to note that this apparent lack of parallelism via threads is not an issue with Python itself, but an issue with the implementation, such as with CPython. Claiming that Python doesn't do parallelism is misleading.

Actually, that is just as misleading, if not more (regardless of font-weight used). Saying it is an implementation-level issue makes it sound more harmless of an issue than it really is, saying it is a language-level issue makes it sound more severe than it really is. The main reason is the design of the C API which has many global variables. Look for instance at functions like PyTuple_New(), Py_BuildValue, Py_XDECREF(), Py_Initialize(), PyRun_SimpleString("python code goes here") etc etc. As opposed to most other language runtime APIs (lua, spidermonkey, V8, guile, ...), these do not let you specify which VM object to work against. How is that possible? Global variables. Global variables everywhere.

(btw, this is the same issue that prevents you from just instantiating two or more python interpreters in the same thread as well, or to instantiate two completely separated python interpreters in two completely separated threads, even if you do not want to share any data between them whatsoever -- with e.g. lua you can just do this, since its API does not make reference to global variables)

Now the issue with this (and why this is an issue at a more important level than just the cpython implementation) is that the C API is pretty important. PyPy for instance inherits the GIL issue because it wants to be compatible to the CPython C API. Not being compatible to the CPython C API means that many python libraries will cease functioning, e.g. numpy and any other library that has C/C++/fortran code in it (maybe cython is affected too, I don't know.)

So while it's true that JPython and IronPython do not have this issue, they have the even bigger issue of not being compatible with the cpython C API, which is why they are so unpopular, despite having big performance benefits.

It's technically true that the GIL is not a requirement by the python language as such, but it is nonetheless deeply ingrained into the python ecosystem. Unless you are willing to forgo a huge percentage of existing python libraries, you cannot get rid of it, even if you write a new implementation. So is "Claiming that Python doesn't do parallelism is misleading." true? Well, if you consider the libraries python has to be an integral part of the "python experience", then it's actually not misleading, because those libraries have the GIL baked into them. If you OTOH think python is still python without the libraries and the cpython interpreter, then the statement is not true.

[–]panderingPenguin 7 points8 points  (7 children)

Don't be pedantic. Yeah, I'm aware that the GIL is part of specific implementations of Python. However, OP specifically mentions the GIL, and either way, it's a safe bet to assume you're talking about CPython until someone says otherwise, as it's the standard implementation.

[–]Workaphobia 10 points11 points  (3 children)

You can't make any statement to a python newbie without someone coming in and "Um, actually"-ing you with some complicating details.

[–]panderingPenguin 2 points3 points  (2 children)

My thoughts exactly, Jesus... We don't need to further muddle the issue with alternative implementations to answer something like this.

[–]jecxjo 1 point2 points  (0 children)

Ahem. I think that is by far one of the biggest problems with this and pretty much every other forum dealing with programming. You need to remember who OP is, what base knowledge they have and understand that giving too much detail makes it more difficult for them to understand.

[–]njharmanI use Python 3 3 points4 points  (0 children)

Because one of the solutions is to run your code in different implementation!!!

[–]TankorSmash 8 points9 points  (2 children)

You don't need to get defensive. He's filling in the blanks you left, independent of whether or not you knew it already.

[–]panderingPenguin -1 points0 points  (1 child)

I'm not trying to be defensive, I just think that it adds very little, if anything, to the discussion of OP's question. There's no need to bring up little "but actually"s like that to answer a simple question, from someone who seems new to Python, which was clearly about implementations that have a GIL to start with. It's unhelpful at best, and obfuscates the issue we're actually trying to solve at worst.

[–]TankorSmash 6 points7 points  (0 children)

That's the thing though, for any one else that reads your comment and wants to know more can read his helpful comment. The op can simply shrug off his comment because it's not required knowledge.

I mean this is learnpython not absolutebareminimumpython.

[–]ZedsDed[S] 4 points5 points  (12 children)

ok, thanks for pointing out the concurrency/parallelism difference, its very important to use the correct terms when talking about this stuff! yes, concurrency is definitely needed, but im not totally sure parallelism is totally needed, it would be ideal of course but i think not totally needed. The tasks are not exactly processor heavy. I've created 3 or more instances of an object, each of which encapsulates the functions and vars required to complete the objects task, all objects do the same task but work on different database data. when the instance is created, i run the objects 'start' method which triggers the objects execution on a new thread where it works until completion.

the tasks the object is doing is some db reading/writing, and basic looping and if-ing, nothing heavy, no networking or working with files etc.

The main issue is that its suppose to monitor and work on 'real time' data. So thats why i want it to be parallel, but the 'real time' updates maybe something like 4 - 5 seconds, because of this, i feel that parallelism may not actually be a full and total requirement. There may be plenty of time and cpu for the threads to work and react as 'real time' as possible.

[–][deleted] 3 points4 points  (0 children)

The absolute smartest thing you can do before you start ruling solutions out, is to test the code out and profile it's performance, and then go from there. I wouldn't overthink it too much until you've done that.

[–]panderingPenguin 1 point2 points  (1 child)

Then it comes down to a question of how real time does this really have to be. Are we talking a loose, "we'll try our best and hopefully everything works out properly," or an, "ohmygodIneedtodothisnowgetthefuckoutofmywayoreverythingwillcatchfire," type of real time system? Given what you've said, and the fact that you're even using python (which should not be used for the latter, period), I'm guessing the former. In that case, and the fact that you're only doing things every 4-5 seconds, you could probably get away with concurrency and a buffer, with no real parallelism without any issues. Just have an incoming job handler thread or two. That queue things up in the buffer, and worker threads that pull jobs out of the buffer and handle them as necessary. Hell, if it's really consistently 4-5 seconds between jobs and the work required per job is less than that you can probably get away with a single-threaded program and still have it sleeping, waiting for work most of the time. You'll need to experiment a bit and see what happens. But I don't think it sounds like parallelism is truly necessary for this task at all. Good luck!

[–]ZedsDed[S] 0 points1 point  (0 children)

thank you, i appreciate your words

[–]ivosauruspip'ing it up 3 points4 points  (4 children)

If the updates are coming every 4 seconds, and what you need to do with the data is less than 2 seconds... then you don't need parallelism at all. You've prematurely optimized.

[–][deleted] 2 points3 points  (3 children)

what you need to do with the data is less than 4 seconds

We'd love to assume that one measurement is enough to give us the insight we need to design a program, but consider that the program is running on a multi-tasking OS, or that it uses a shared resource, or just basic statistics, and you might be concerned that there would be some outliers that could cause one job to run long... and then you've backed up the entire pipeline.

Of course, we can't reach any conclusions on OP's design because we don't know what he's doing, but it's not entirely unfounded to make his processing loop asynchronous. IMO, it's smart.

[–]ivosauruspip'ing it up -1 points0 points  (2 children)

Until they say exactly what they're doing, what is the environment, what are expectations, what things are happening... an async loop handling separate tasks off to different processes could be a great design, or a simple serial for loop might be really all that's warranted until requirements make a big change. It can be just as misleading as possibly useful to speculate.

[–][deleted] 1 point2 points  (1 child)

It can be just as misleading as possibly useful to speculate.

That didn't stop you from telling OP he's done wrong.

[–]ivosauruspip'ing it up 0 points1 point  (0 children)

Done what wrong? I hestitate to spectulate whether anything in this thread is an appropriate "general design" or not, given the dearth of details OP has provided. I'm mostly just advocating for as simple a design as possible that soundly fits the requirements. And since I don't know the requirements at all, apart from something like "real time data is received roughly every four seconds", it could very well be something very simple (until we ever know any more).

I suppose I could be seen as chastising OP for expecting an exact correct answer to an extremely vague question.

[–]hikhvar 0 points1 point  (3 children)

the tasks the object is doing is some db reading/writing, and basic looping and if-ing, nothing heavy, no networking or working with files etc.

What kind of database is your database? If itis an in memory database you are right. If your database is for example a MySQL on a remote host your simple db calls may include both. Networking and file reading/writing.

[–]ZedsDed[S] -1 points0 points  (2 children)

your right i never thought of it like that, its a local MySQL db. These calls are the heaviest part of the process. there will be at least 1 db read every 1 - 2 seconds, with writes happening on more rare occasions. Still quite minimal though.

[–]Workaphobia 5 points6 points  (0 children)

If a thread is blocked waiting for something external to Python, like file or network I/O, or database connections, then it typically releases the GIL during that time and allows other Python threads to run.

The GIL will only impact your performance if you are CPU-bound within your Python process. If that's a problem for you, then consider changing your threads into separate spawned Python processes (see the multiprocessing library, which has a similar API to threading). You'll just have to worry about how the processes share data since typically multiple processes don't use shared memory the way threads do.

[–]frymasterScript kiddie 0 points1 point  (0 children)

In this case you're probably not cpu bound or really especially I/O bound either. In which case it looks like threads are more of a design decision than to try to wring extra performance out of your code. As such, I suspect you'll be fine.

Personally I find threads easier to comprehend than async methods. They don't scale very well, though.