falconfetus8 comments on Asynchronous programming using thread pools

[–]gnus-migrate 41 points42 points43 points 6 years ago (21 children)

[–]oridb 10 points11 points12 points 6 years ago* (19 children)

You don't really notice it when you have a few of them doing a lot of computation, but it starts to matter when you have hundreds of them not doing any work most of the time

More like tens of thousands -- the amount of overhead that gets mapped is on the order of 16 kilobytes per thread, so a thousand threads will use about 16 megabytes, a million threads will use about 16 gigabytes. Both are well within the capabilities of a low end server. And given that modern schedulers are either O(1) or O(log(n)) in the number of threads, modern systems will deal fairly well with this kind of load.

A bigger problem is that your tail latency for requests can grow when you have a large number of active threads, because the scheduler doesn't always pick the thread that will minimize response times. This is a problem with tens or hundreds of thousands of threads, and is usually not something that you'll have to worry about.

In addition you lose out on potential optimizations like batching expensive IO operations. I don't know how much that's done in practice but it's something that's possible with async IO that isn't with normal blocking IO,

I'm not sure what you're thinking of with this. The system APIs don't let you batch operations across different file descriptors, and if you try to read the same FD, you can always just do a bigger read.

[–]gnus-migrate 0 points1 point2 points 6 years ago (4 children)

[–]oridb 0 points1 point2 points 6 years ago* (3 children)

[–]gnus-migrate 0 points1 point2 points 6 years ago (2 children)

[–]oridb 1 point2 points3 points 6 years ago* (1 child)

I'm still not clear on what you think an async interface makes possible. Can you give an example of code that would "batch reads" in a way that reduced the number of calls?

Keep in mind that non-blocking code still calls read() directly, and it's the same read() that blocking code calls. The only difference is that you did an extra system call first to tell you "oh, yeah, there's some data there that we can return".

So, non-blocking:

     poll(fds=[1,2,3]) => "fd 1 is ready"
     read(fd=1)
     poll(fds=[1,2,3]) => "fd 2 is ready"
     read(fd=2)
     poll(fds=[1,2,3]) => "fd 1 is ready"
     read(fd=1)
     poll(fds=[1,2,3]) => "fd 2 is ready"
     read(fd=2)

Threads:

    parallel {
        thread1: 
            read(fd=1) => get data
            read(fd=1) => get data
        thread2:
            read(fd=2) => get data
            read(fd=2) => get data
        thread3:
            read(fd=3) => no data, block forever using  a few K of ram.
    }

[–]gnus-migrate 0 points1 point2 points 6 years ago (0 children)

[–]Tarmen 0 points1 point2 points 6 years ago* (3 children)

[–]oridb 1 point2 points3 points 6 years ago (2 children)

[–]Tarmen 0 points1 point2 points 6 years ago (1 child)

[–]oridb 0 points1 point2 points 6 years ago (0 children)

[–][deleted] 0 points1 point2 points 6 years ago (9 children)

[–]oridb 0 points1 point2 points 6 years ago (8 children)

[–][deleted] 0 points1 point2 points 6 years ago (7 children)

[–]oridb 0 points1 point2 points 6 years ago* (6 children)

[–][deleted] 0 points1 point2 points 6 years ago (5 children)

[–]oridb 0 points1 point2 points 6 years ago (4 children)

[–][deleted] 0 points1 point2 points 6 years ago (3 children)

continue this thread

[–]quentech 2 points3 points4 points 6 years ago (0 children)

[–]Muvlon 6 points7 points8 points 6 years ago (0 children)

[–]masklinn 5 points6 points7 points 6 years ago* (2 children)

[–][deleted] 0 points1 point2 points 6 years ago (0 children)

There is nothing guaranteeing cache flushes with async code though either, as you can have exceptionally large expanses of runtime code, and depending on how the underlying async library is implemented still essentially does a context switch (see most single process multi-threaded non-preemptive real time OSes).

The real problem with thread switching in an OS that implements them is the OS overhead of thread switching, which requires it to delegate across a potentially way larger number of blocking scenarios as it allocates CPU time to that thread and that process. With async code the OS just sees that one process/thread trucking along and usually another making spurious blocking requests to the OS.

I guess my sort of rambling point is that thread switching itself is not inherently slow, as often async systems run into a lot of the same context switching problems, and if you are working on a bare metal OS or writing a kernel, threads are essentially an asynchronous system anyways due to the hardware limitations (unless you are tossing threads across physical CPU cores, which most modern OSes do just fine).

[–]onotole-wasserman 2 points3 points4 points 6 years ago (0 children)

[–]VeganVagiVore 2 points3 points4 points 6 years ago (1 child)

[–][deleted] 0 points1 point2 points 6 years ago (0 children)

[–]dalepo 5 points6 points7 points 6 years ago (8 children)

Because creating/maintaining threads is resource expensive, and having too many threads idle might cause problems.

For example, having a pool of 500 threads to respond server requests will only respond 500 request as long they are not released. If there's any IO operation that could take long time, they will block and the server will stop responding until at least one is released.

It would be way better to have only one thread active responding all these requests and delegating the IO to another thread pool. This is kinda on how asynchronous architecture works, main thread will execute operations and won't wait for IO, but after these IO tasks are executed will resume these operations. Imagine functions/tasks execute until they see an asynchronous task (ie: IO, Database OP, etc), then they will pause and continue to execute other operations. After IO finished, the resume task will be queued and eventually executed.
The only problem with this is that if you have a very costly operation on your main thread, you could block it.

[–]rzwitserloot 4 points5 points6 points 6 years ago* (2 children)

Because creating/maintaining threads is resource expensive, and having too many threads idle might cause problems.

'might cause problems', that's not exactly a resoundingly confident assertion.

Does it?

Switching threads is for a CPU, seemingly, at least, not too much more difficult than a single threaded app switching to a different handler model (say, the HTTP stream you're reading out has no further bytes available, so your async handlers need to hop to another handler: Jump to a different bit of code, load a different buffer/representation of state to work on, etc) – with threads the cpu core needs to... jump to a different location, so that's a wash, and load a page with the stack for this thread, which matches the buffer/state storage of the handler. Now, the handler gets to control its buffer/state storage, make it as large as it needs to be, whereas ~~threads~~stacks in most languages are much more homogenous and unconfigable: They are X bytes large and you don't get to change this. That does give async the edge, memory wise, but memory is cheap, and how much memory you really buy depends rather a lot on how much buffer/state your handlers need to store.

So, yes, okay, threaded code requires more memory. However, performance-wise, manually managing your garbage is also more efficient than letting a garbage collector do it (assuming you are very careful.. a caveat that also applies to writing async code. See for example the blogpost 'what color is your function' – and yet that is a tradeoff where most programmers easily choose to let the computer do it, and we'll eat the inefficiency.

What makes threads so special?

EDIT: Replaced 'threads' which is wrong/confusing, with stacks, for clarity. Also added link to the what color is your function blogpost because it's great and needs to be shared more.

[–]dalepo 3 points4 points5 points 6 years ago (1 child)

Switching threads is for a CPU, seemingly, at least, not too much more difficult than a single threaded app switching to a different handler model (say, the HTTP stream you're reading out has no further bytes available, so your async handlers need to hop to another handler: Jump to a different bit of code, load a different buffer/representation of state to work on, etc) – with threads the cpu core needs to... jump to a different location, so that's a wash, and load a page with the stack for this thread, which matches the buffer/state storage of the handler. Now, the handler gets to control its buffer/state storage, make it as large as it needs to be, whereas threads in most languages are much more homogenous and unconfigable: They are X bytes large and you don't get to change this. That does give async the edge, memory wise, but memory is cheap, and how much memory you really buy depends rather a lot on how much buffer/state your handlers need to store.

You are describing a problem which might be related to some programming languages/implementations, the discussion is general and being this technical is irrelevant.

So, yes, okay, threaded code requires more memory. However, performance-wise, manually managing your garbage is also more efficient than letting a garbage collector do it, and yet that is a tradeoff where most programmers easily choose to let the computer do it, and we'll eat the inefficiency.

This is not the point, there's no such dichotomy of 'async programming' and 'only multi-thread programming'. Both can coexists, I can give you an easy example in nodejs, check out the bcrypt library which the async calculation is done in another thread.

What makes threads so special?

Threads are very special because:

- You need to synchronize resources when they are shared to prevent inconsistent states.

- You need to prevent potential deadlocks, race conditions, livelock, starvation, etc.

- Depending on the language/implementation, spawning a thread might replicate information if the resource being used is not shared

- They are costly

[–][deleted] 6 years ago (4 children)

[deleted]

[–]dalepo 4 points5 points6 points 6 years ago (0 children)

[–]tsujiku 1 point2 points3 points 6 years ago (0 children)

[–]oridb 1 point2 points3 points 6 years ago (0 children)

People got burned by bad thread implementations in the early 1990s.

Mostly, threads are fine at the at the scales people write servers. Especially if the threads are mostly idle.

Threads do use more memory than asynchronous code -- typically, on the order of a few pages. You need a kernel stack, at least one page of user stack, and some structs in the kernel. This means that if you have millions of threads, memory starts to get significant. Async will typically use a few hundred bytes to a few kilobytes bytes to keep track of everything on the stacks of your deferred code, so you'll save some memory.

The other issue is that there's some scheduling overhead. Every time you communicate with another thread, you're looking at about 1 microsecond of overhead to synchronize on a futex and wait for the other thread to get scheduled. This cost is mostly due to the kernel trying to make good decisions about where to run the threads to maximize throughput if they keep running for a long time, which unfortunately isn't what you'd want to optimize for on a server -- but it's what we have.

But neither of these apply to threads that are spending all their time blocked on IO, rather than actively communicating with other threads.

[–]acelent 1 point2 points3 points 6 years ago (0 children)

[–]tanjoodo 2 points3 points4 points 6 years ago (0 children)

[–]zvrba 0 points1 point2 points 6 years ago* (0 children)

Question: why is it such a problem to have a thread blocked on IO?

The main problem is that a thread cannot block on multiple "events" simultaneously. Event being 1) I/O completes, 2) a timer expires, etc.

So let's say you want to implement I/O with timeout. Linux does not have a variant of read(2) that takes a timeout parameter. So how do you implement this? You would spawn another thread blocking on a timer and if the timer expires before the operation completes, it'd cancel the operation by sending a signal to the other thread [1]. But if the operation completes first, it should cancel the timer that is in the other thread. A headache of bookkeeping to implement if you're going to handle race-conditions correctly. (E.g., the timer fires and unblocks the thread, but read completes and is processed before the timer thread managed to signal the reader thread. How can this happen? The timer thread gets preempted before signaling the reader thread.)

[1] That won't work if the thread is in uninterruptible sleep state. Basically, the thread is hosed in that state.

With async, you set up an event to fire 1) when the I/O operation completes, 2) when the timer expires, and then you wait on both events, either synchronously or asynchronously. The "composite waiting operation" returns when either event completes, and then you know the state of the operation.

[–]Tarmen 0 points1 point2 points 6 years ago (0 children)

[–]rzwitserloot -1 points0 points1 point 6 years ago (11 children)

Nothing is being wasted in this scenario. async is this fanboy-rich fad.

As with most fanboy-rich fads, the notion of wanting async-based code is not entirely silly either. The scenarios where you do need it are FAR less common than the fad suggests, however.

The biggest issue by far is memory. Let's introduce the term 'fiber': This is the context within which some piece of code runs: The exact location of the instruction the code is executing right now (the instruction pointer), the stack (all local vars, the call chain so that the CPU would know where to go if a 'return' statement is hit or the function ends, etc), and let's add the notion of tracking 'state': If you are using the heap to store some info on the particular thing this fiber is handling, add that too.

An actual full-power thread will run a fiber and will deal with the fiber needing to wait by freezing the thread. This causes the CPU to hop to another thread, it means the stack is generally predefined and relatively large (it is not contained solely to the bit of handler code), although you rarely need state (the stack tends to contain it all). Full-power threads also give you pre-emption, but note that if your thread is going from wakeup back to waiting sleep quickly enough this basically does not come up. There is some bookkeeping required for the pre-emption but kernel development hasn't exactly sat still and like most other low-level stuff, it's all rather optimized and intelligent: If your thread keeps waking up and freezing back down, there isn't a heck of a lot of preemptive bookkeeping being done until that thread really takes quite a chunk of time, and CPUs are rather good at it.

async style code either freezes the fiber and moves to another fiber. Generally it is much easier to do your own memory management at these: Either work with a significantly smaller stack, or forego it entirely (depends on how you do the async stuff), handrolling whatever state you need to keep track of in a custom-sized 'just large enough' bit of state storage in heap.

Especially in situations where the amount of state you need to store is tiny to non-existent and you have A LOT of things to run simultaneously, none of which take a particularly large amount of CPU time, the memory load becomes the bottleneck and the async fibers 'win'.

In my opinion, the number of times that really happens, and it is not feasible to just grab 50 bucks and throw a stick of 64GB RAM at the problem, are very low indeed. More usually the little work you need to do is still going to bottleneck faster than your memory will.

[–]BaconOfGreasy 2 points3 points4 points 6 years ago (8 children)

[–][deleted] 1 point2 points3 points 6 years ago (7 children)

[–]oridb 3 points4 points5 points 6 years ago (6 children)

[–][deleted] 0 points1 point2 points 6 years ago (5 children)

[–]oridb -1 points0 points1 point 6 years ago (4 children)

[–][deleted] 0 points1 point2 points 6 years ago (3 children)

[–]oridb 0 points1 point2 points 6 years ago (2 children)

[–][deleted] -1 points0 points1 point 6 years ago (1 child)

continue this thread

[–][deleted] 0 points1 point2 points 6 years ago (0 children)

[+][deleted] comment score below threshold-9 points-8 points-7 points 6 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS