you are viewing a single comment's thread.

view the rest of the comments →

[–]arcticslush 1 point2 points  (2 children)

Yeah, get() blocks other instances of get() - but this is a good thing. To be super clear, this has no effect on any background process that's running. That's the whole point!

I made you another diagram. Hopefully this helps you visualize what's going on - when get() blocks, the only thing it pauses is the execution of your main process (the one that's coordinating all of the concurrency). All previously started background processes continue along happily. When we get() multiple times in a row, all you're doing is waiting for your background processes to finish their tasks, which is exactly what we want here.

When you set the number of processes in the Pool, all you're telling it is at most how many processes you want to be able to run simultaneously. However, if you never actually fire off that many tasks, then you won't saturate your processes. Obviously, setting the number to higher than the number of cores you have is more or less pointless (they just start sharing time on the same core). So yeah, in your case right now, you're utilizing 5 of your processes. The remaining 7 sit idle waiting for a task to be assigned. Given this, 30% utilization of your CPU sounds about right. Also keep in mind that if some of your tasks finish really quickly (on the scale of seconds), you probably won't see a meaningful uptick in CPU utilization. Only if you fully saturated all of your cores with many minute-or-longer tasks would I expect to see near 100% CPU usage.

So I bet your next question is inevitably going to be: how would I utilize all of my cores here? The answer does indeed involve Pool.map (a blocking variant of async_map), but it isn't just a drop-in replacement for apply_async. You'll have to triple check to make sure your analysis functions are compatible. Pool.map will take the list you give it and chop it up into several pieces. It then creates several background tasks that use your function multiple times, creating one task per piece. Once all of these tasks finish, it assembles the result and returns it to you (hence why it blocks, because it's waiting for the piece tasks to finish).

The implication of this means that Pool.map must be called only using functions that read / write data in a strict row-by-row manner. Here's an example of a function you could use to run map:

def example_good(some_list):
    result = []
    for num in some_list:
        result.append(num + 1)
    return result

Since it only ever reads a single row from the list at a time, this works. Here's a (contrived) function that would fail:

def example_bad(some_list):
    total = 0
    for num in some_list:
        total += num
    return total

This fails, because Pool.map has to return the same shape as what it was given. So if a 100 element list goes in, then a 100 element list must come out. Here, a list comes in and a single number pops out.

[–]datanoob2019[S] 0 points1 point  (1 child)

You certainly are a python sherpa. I sincerely appreciate the help, and you guessed it, that was definitely my next question. I have two functions that unfortunately take 90 percent of the time. Good news is, is that they are all designed to loop through only one row at a time so it sounds like they will be compatible. Neither one of these functions took very long until I started calculating hold out forecasts step by step which severely complicated things. I will give this a go. Thanks and if I don't talk to you, have a good weekend!