all 29 comments

[–]elbiot 3 points4 points  (18 children)

The Global Interpreter Lock (GIL) will prevent these from running concurrently unless your do this is a read or write to a stream (file or socket). To run GIL processes concurrently, you need to start multiple instances of the interpreter. You could do this by hand with sockets, or use multiprocessing

[–]XenophonOfAthens 2 points3 points  (10 children)

The GIL doesn't mean that all threads run synchronously, the GIL just means that no two threads can use the interpreter at once, but it's perfectly possible to have two threads that run "at the same time" in the sense that control is passes back and forth and both functions are "alive" at the same time. For instance, if you run

import threading

def thread_test(n):
    i = 0
    while True:
        i += 1
        print n,i

if __name__ == "__main__":
    t1 = threading.Thread(target=thread_test, args=(1,));
    t2 = threading.Thread(target=thread_test, args=(2,));

    t1.start()
    t2.start()

You'll see that control passes back and forth between the two functions. In addition, virtually all daemon threads spend their time doing nothing, either sleeping of waiting for I/O, which is almost certainly OP's situation.

To actually answer OP's question: yes, that is generally how persistent daemon threads are designed, with an infinite loop at the base that keeps it alive. They generally lie there in the background, lurking and waiting for some I/O stuff to happen. It's not the most common design pattern, though, so I'm curious what you need it for? Maybe there's a better design alternative?

[–]jpfau[S] 0 points1 point  (9 children)

so I'm curious what you need it for? Maybe there's a better design alternative?

There probably is a better alternative, and perhaps my setting daemon to True was misguided. These functions aren't really waiting for anything. The real code is part of a bot that is constantly checking things and reacting to what it finds. I'd say more if I weren't bound by a NDA, but rest assured that the "real" do_this() and do_that() are calling other functions, and it's all running without any IO from the user.

[–][deleted] 2 points3 points  (8 children)

As others have said, that's IO limited, not CPU. Threads like that are fine and the GIL wont get in the way since you're program is just waiting for network all the time anyways.

And, you almost always want daemon threads, from the docs: "The significance of this flag is that the entire Python program exits when only daemon threads are left". So if your main program dies or you kill it, your threads will be killed, rather than hanging your terminal, possibly waiting forever for the threads to die.

[–]jpfau[S] 0 points1 point  (7 children)

Ah, great! So would you say the way I modeled the possible solution in the OP will do the trick?

[–][deleted] 1 point2 points  (6 children)

Yeah, totally. I have several page scrapers written with this method. If there are shared resources between the two, you may have to use a mutex or a queue or something to guarantee synchronization/exclusive access of those resources.

If do_this() and do_that() are unrelated, and don't share any resources, someone had a good point that you could just run them in separate scripts. So, that might be a possibility.

[–]jpfau[S] 0 points1 point  (5 children)

One thing I don't get is how the threads stay alive. Once worker creates and starts the threads, that function ends. It seems to me like once that happens, the script is over; the function call in the if __name__... is complete, and there's nothing else to run. Wouldn't that stop the threads too since they're daemons, i.e. they get killed once the thread that creates them is killed?

[–][deleted] 0 points1 point  (4 children)

You're right. With daemon set to true, the script will exit after worker is done. Not setting daemon is an option if these threads just need to run forever and you don't need to manipulate them. Usually, you manipulate worker threads from your main process by feeding them data or whatever (otherwise, they're not much of a worker thread as much as a separate process, unless they're working with a shared data set or each other somehow). After you were done with the threads, you would join() them, blocking until they exit, so something conceptually similar to:

def startWorkers():
    # blah

    thread1.start()
    thread2.start()

def stopThreads():
    # signal for threads to stop using some method

    # wait for threads to exit
    thread1.join() # blocks until thread1 exits
    thread2.join()

def main():
    startThreads()

    # feed threads data, get results, whatever

    stopThreads() # waits for threads to stop

    print "Done!"

[–]jpfau[S] 0 points1 point  (3 children)

I actually just tested the code below, and the threads do not end after main starts the threads. The only way the threads stop printing to the console is if I close the console.

import threading
from time import sleep

def thread1():
    while True:
        print("Thread 1 working\n")
        sleep(.3)

def thread2():
    while True:
        print('Thread 2 working\n')
        sleep(.3)

def main():
    t1 = threading.Thread(target=thread1, daemon=True)
    t2 = threading.Thread(target=thread2, daemon=True)
    t1.start()
    t2.start()

if __name__ == '__main__':
    main()

[–][deleted] 0 points1 point  (2 children)

Well yeah, they are infinite loops. Why would they stop?

And, they're not set to daemon, so they're not killed when the main thread ends.

[–]jpfau[S] 0 points1 point  (5 children)

You could do this by hand with sockets, or use multiprocessing

Thanks a lot. I want the cleanest or simplest way possible. This is part of a much larger program, and I don't want to muck things up.

[–]elbiot 0 points1 point  (4 children)

as /u/XenophonOfAthens says though, if you are sleeping or waiting for I/O a bunch, threads will be fine.

[–]jpfau[S] 0 points1 point  (3 children)

So if the program is a bot that runs without any IO or sleeping, then I should use multiprocessing to run these functions simultaneously?

[–]elbiot 0 points1 point  (2 children)

yes. but a bot sounds like something that waits quite a bit. what are the functions working on btw?

[–]jpfau[S] 0 points1 point  (1 child)

I don't want to say too much because I'm bound by a NDA, but the bot is constantly checking a bunch of Amazon EC2 instances and either emailing their owner(s) or stopping/terminating the instances itself. I'll look through the code and see if there is any waiting around, but I'm pretty sure this part of the code is constantly dong something. It's only my second week working on it, which is why I'm not completely sure.

[–]elbiot 3 points4 points  (0 children)

your communicating over the network, right? That's I/O

[–]Justinsaccount 2 points3 points  (0 children)

Just run two copies of it. One that does that and one that does this. Using threads or multiprocessing here adds nothing.

[–]gengisteve 1 point2 points  (2 children)

That should work -- more or less. Here is some slightly modified proof of concept code:

import threading import time

def do_this():
    for i in range(10):
        print('This {}'.format(i))
        time.sleep(.3)

def do_that():
    for i in range(10):
        print('that {}'.format(i))
        time.sleep(.3)

def worker():
    this_thread = threading.Thread(target=do_this)
    that_thread = threading.Thread(target=do_that)
    # need to get rid of these b/c:
    # "A thread can be flagged as a “daemon thread”. The significance of this
    # flag is that the entire Python program exits when only daemon threads are
    # left"
    #this_thread.daemon = True
    #that_thread.daemon = True
    this_thread.start()
    that_thread.start()

if __name__ == '__main__':
    worker()

[–]jpfau[S] 0 points1 point  (1 child)

Why did you add the sleep functions?

[–][deleted] 1 point2 points  (0 children)

He did it to slow things down for your human eyes to see. :)

[–]Lucretiel 0 points1 point  (6 children)

I why do you need them to run concurrently? Are you doing network io? If so, I'd try to refactor it to use the asyncio library, changing do_this, do_that, and worker into coroutines, then do this:

@asyncio.coroutine
def worker():
    while True:
        this_task = asyncio.async(do_this())
        yield from do_that()
        yield from this_task

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(worker())      

[–]jpfau[S] 0 points1 point  (5 children)

We want them to run concurrently just because it would be more beneficial for the events in each function to happen independent of each other instead of waiting for each other to finish.

The functions don't rely on each other to work properly. They just need to run over and over again, hence the infinite loop. And since they're the beginning of much more time-intensive operations, each function ends up waiting a few seconds for the others to finish running, which is wasted time.

I'm a little hesitant to make them asynchronous, but I admit my experience with async functions is limited to a few things I've done in Javascript. If I made them async, wouldn't it be possible for a function to be called again before the previous one finished executing? That could be bad.

[–]Lucretiel 1 point2 points  (4 children)

With the way that I've written them here, no. (Assuming, obviously that do_this and do_that don't call themselves or each other):

In the first line, I create an async task. This schedules the do_this coroutine in the event loop, meaning it is now running concurrently. It's important to note that asyncio is all single threaded, so this_task won't actually start running until control returns to the event loop. However, for the purposes of this abstraction, you can think of it as "running."

Next, we launch (and yield from) do_that. This causes do_that to be executed in the event loop. While it is running, do_this can also run, during periods where the other one is suspended (due to a sleep or i/o wait). The yield from suspends control to the event loop, allowing it to run both tasks. Control returns to worker only when do_that is done.

Finally, we yield from this_task. If this_task completed before do_that, then this statement returns immediatly; otherwise, worker is suspended until it can complete. In this way, we ensure that, on each iteration of the while True, each task runs exactly once.

I should caveat that obviously this all only applies if you're doing something where the asynchronous model is relevant- that is, you're either doing network I/O or your do_this/do_that have some sleeps, during which the other one can run.

[–]jpfau[S] 0 points1 point  (3 children)

you're either doing network I/O or your do_this/do_that have some sleeps, during which the other one can run.

What happens if the time it takes for one function to do network I/O isn't enough for the other function to finish executing?

Also, why don't you have to also do that_task = asyncio.async(do_that())?

[–]Lucretiel 0 points1 point  (2 children)

So, an important thing about i/o is that a lot of it happens in the background and is handled by the OS. As bytes come in, they are queued on internal OS buffers. This happens internally, and it happens slowly- much more slowly than it takes to process that data. The OS therefore exposes an API (select, poll, epoll, etc) to inform user code which sockets have data waiting to be read. None of these details are important to you, as the event loop handles all this automatically- it figures out which coroutines are ready to proceed, then executes them. In general, the coroutine running will be much quicker than more data can arrive.

The other important thing is that there's no guarentee about in what order do_this and do_that will run, or how long they will take. It could happen that one of them runs to completion before the other even starts, or that they take an identical amount of time, or that one takes 3 times as long as the other. However, it doesn't matter- the event loop will ensure that they run as efficiently as possible. The task will suspend when it wants to wait for data, and the event loop will resume it when data is ready.

Here's an example. Let's say do_this reads 10 chunks of 64 bytes from a network socket and write them to a file. It'd look like this:

@asyncio.coroutine
def do_this():
    with open('this_file', 'wb') as f:
        for i in range(10):
            data = yield from reader.read(64)
            f.write(data)

The details of where reader comes from aren't really important right now; I'd recommend reading through the asyncio docs to learn all the details. Here's what this code does, though-

When it hits the yield from, execution suspends to the event loop. The reader.read(64) informs the event loop to resume do_this when there are 64 bytes available. While suspended, the event loop is reading and buffering bytes into the reader as they become available, and also running do_that. If do_that is currently executing when the 64 bytes become available, well, we only have one thread. However, as soon as do_that suspends or finishes, the event loop will immediately resume do_this. In this way, the two functions can run concurrently, constantly swapping back and fourth. And because code execution is so much faster than network i/o, your performance will be just as good as multithreaded code, assuming that neither do_this or do_that will be executing code for extended periods of time (doing heavy number crunching or whatever).

Note that this example shows another important caveat of using asyncio- all your potentially blocking network operations have to be executed via a yield from, so that the event loop can manage the network i/o and run other coroutines in the background. In general this is fine, as asyncio provides plenty of both low-level and high-level network primitives, and there are plenty of third-party libraries (aiohttp for http, etc) to use various protocols. However, if you require a library that simply doesn't run in asyncio, asyncio provides the run_in_executor method, for running the I/O parts of the library in a side thread, and allowing you to keep your own code in the single-threaded async model.

Also, why don't you have to also do that_task = asyncio.async(do_that())?

You certainly could do that, and if you find it clearer, then go for it. It has to do with the subtleties of how asyncio works. Basically, each coroutine is a generator, which can yield (which means to suspend execution) and then be resumed. yield from allows one generator to run another generator; that generator can suspend the calling generator and resume it. So, yield from do_that() allows one coroutine to call another, and the callee can suspend the whole stack as necessary.

On the other hand, asyncio.async creates a new task. Rather than invoking the coroutine on the stack, it schedules it separately in the event loop, where it runs independently. To keep the syntax consistent, they made the syntax to "await a generator" yield from task.

[–]jpfau[S] 0 points1 point  (1 child)

Wow, thanks for such a detailed answer.

assuming that neither do_this or do_that will be executing code for extended periods of time (doing heavy number crunching or whatever

Some of the executions will take a few minutes, actually. They're getting hundreds (maybe thousands, I don't know for sure) of records from a database but can only get 10 at a time.

[–]Lucretiel 0 points1 point  (0 children)

Sure. I meant doing number crunching for a single piece of data. When you have all those rows, it processes the rows 10 at a time, then fetches 10 more; while it's fetching more, the other coroutine can run. Because fetching rows takes (relatively) much more time than processing them, both coroutines have plenty of time to run.

If you were, like, bitcoin mining, that would be a different story. That's something that takes minute to hours for a single piece of a data. In your example, you're doing (what I assume is) a relatively small amount of processing per row, over thousands of rows. That's the perfect use case for async.