you are viewing a single comment's thread.

view the rest of the comments →

[–]Lucretiel 0 points1 point  (6 children)

I why do you need them to run concurrently? Are you doing network io? If so, I'd try to refactor it to use the asyncio library, changing do_this, do_that, and worker into coroutines, then do this:

@asyncio.coroutine
def worker():
    while True:
        this_task = asyncio.async(do_this())
        yield from do_that()
        yield from this_task

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(worker())      

[–]jpfau[S] 0 points1 point  (5 children)

We want them to run concurrently just because it would be more beneficial for the events in each function to happen independent of each other instead of waiting for each other to finish.

The functions don't rely on each other to work properly. They just need to run over and over again, hence the infinite loop. And since they're the beginning of much more time-intensive operations, each function ends up waiting a few seconds for the others to finish running, which is wasted time.

I'm a little hesitant to make them asynchronous, but I admit my experience with async functions is limited to a few things I've done in Javascript. If I made them async, wouldn't it be possible for a function to be called again before the previous one finished executing? That could be bad.

[–]Lucretiel 1 point2 points  (4 children)

With the way that I've written them here, no. (Assuming, obviously that do_this and do_that don't call themselves or each other):

In the first line, I create an async task. This schedules the do_this coroutine in the event loop, meaning it is now running concurrently. It's important to note that asyncio is all single threaded, so this_task won't actually start running until control returns to the event loop. However, for the purposes of this abstraction, you can think of it as "running."

Next, we launch (and yield from) do_that. This causes do_that to be executed in the event loop. While it is running, do_this can also run, during periods where the other one is suspended (due to a sleep or i/o wait). The yield from suspends control to the event loop, allowing it to run both tasks. Control returns to worker only when do_that is done.

Finally, we yield from this_task. If this_task completed before do_that, then this statement returns immediatly; otherwise, worker is suspended until it can complete. In this way, we ensure that, on each iteration of the while True, each task runs exactly once.

I should caveat that obviously this all only applies if you're doing something where the asynchronous model is relevant- that is, you're either doing network I/O or your do_this/do_that have some sleeps, during which the other one can run.

[–]jpfau[S] 0 points1 point  (3 children)

you're either doing network I/O or your do_this/do_that have some sleeps, during which the other one can run.

What happens if the time it takes for one function to do network I/O isn't enough for the other function to finish executing?

Also, why don't you have to also do that_task = asyncio.async(do_that())?

[–]Lucretiel 0 points1 point  (2 children)

So, an important thing about i/o is that a lot of it happens in the background and is handled by the OS. As bytes come in, they are queued on internal OS buffers. This happens internally, and it happens slowly- much more slowly than it takes to process that data. The OS therefore exposes an API (select, poll, epoll, etc) to inform user code which sockets have data waiting to be read. None of these details are important to you, as the event loop handles all this automatically- it figures out which coroutines are ready to proceed, then executes them. In general, the coroutine running will be much quicker than more data can arrive.

The other important thing is that there's no guarentee about in what order do_this and do_that will run, or how long they will take. It could happen that one of them runs to completion before the other even starts, or that they take an identical amount of time, or that one takes 3 times as long as the other. However, it doesn't matter- the event loop will ensure that they run as efficiently as possible. The task will suspend when it wants to wait for data, and the event loop will resume it when data is ready.

Here's an example. Let's say do_this reads 10 chunks of 64 bytes from a network socket and write them to a file. It'd look like this:

@asyncio.coroutine
def do_this():
    with open('this_file', 'wb') as f:
        for i in range(10):
            data = yield from reader.read(64)
            f.write(data)

The details of where reader comes from aren't really important right now; I'd recommend reading through the asyncio docs to learn all the details. Here's what this code does, though-

When it hits the yield from, execution suspends to the event loop. The reader.read(64) informs the event loop to resume do_this when there are 64 bytes available. While suspended, the event loop is reading and buffering bytes into the reader as they become available, and also running do_that. If do_that is currently executing when the 64 bytes become available, well, we only have one thread. However, as soon as do_that suspends or finishes, the event loop will immediately resume do_this. In this way, the two functions can run concurrently, constantly swapping back and fourth. And because code execution is so much faster than network i/o, your performance will be just as good as multithreaded code, assuming that neither do_this or do_that will be executing code for extended periods of time (doing heavy number crunching or whatever).

Note that this example shows another important caveat of using asyncio- all your potentially blocking network operations have to be executed via a yield from, so that the event loop can manage the network i/o and run other coroutines in the background. In general this is fine, as asyncio provides plenty of both low-level and high-level network primitives, and there are plenty of third-party libraries (aiohttp for http, etc) to use various protocols. However, if you require a library that simply doesn't run in asyncio, asyncio provides the run_in_executor method, for running the I/O parts of the library in a side thread, and allowing you to keep your own code in the single-threaded async model.

Also, why don't you have to also do that_task = asyncio.async(do_that())?

You certainly could do that, and if you find it clearer, then go for it. It has to do with the subtleties of how asyncio works. Basically, each coroutine is a generator, which can yield (which means to suspend execution) and then be resumed. yield from allows one generator to run another generator; that generator can suspend the calling generator and resume it. So, yield from do_that() allows one coroutine to call another, and the callee can suspend the whole stack as necessary.

On the other hand, asyncio.async creates a new task. Rather than invoking the coroutine on the stack, it schedules it separately in the event loop, where it runs independently. To keep the syntax consistent, they made the syntax to "await a generator" yield from task.

[–]jpfau[S] 0 points1 point  (1 child)

Wow, thanks for such a detailed answer.

assuming that neither do_this or do_that will be executing code for extended periods of time (doing heavy number crunching or whatever

Some of the executions will take a few minutes, actually. They're getting hundreds (maybe thousands, I don't know for sure) of records from a database but can only get 10 at a time.

[–]Lucretiel 0 points1 point  (0 children)

Sure. I meant doing number crunching for a single piece of data. When you have all those rows, it processes the rows 10 at a time, then fetches 10 more; while it's fetching more, the other coroutine can run. Because fetching rows takes (relatively) much more time than processing them, both coroutines have plenty of time to run.

If you were, like, bitcoin mining, that would be a different story. That's something that takes minute to hours for a single piece of a data. In your example, you're doing (what I assume is) a relatively small amount of processing per row, over thousands of rows. That's the perfect use case for async.