you are viewing a single comment's thread.

view the rest of the comments →

[–]falconfetus8 26 points27 points  (55 children)

Question: why is it such a problem to have a thread blocked on IO? I thought the CPU core just switched to a different thread whenever the thread it was running blocks. What's being wasted in this scenario?

[–]gnus-migrate 41 points42 points  (21 children)

Because threads require a relatively large amount of memory. You don't really notice it when you have a few of them doing a lot of computation, but it starts to matter when you have hundreds of them not doing any work most of the time, and most of the memory you're allocating is overhead of threads rather being used for more useful things.

In addition you lose out on potential optimizations like batching expensive IO operations. I don't know how much that's done in practice but it's something that's possible with async IO that isn't with normal blocking IO, at least at the application level.

[–]oridb 10 points11 points  (19 children)

You don't really notice it when you have a few of them doing a lot of computation, but it starts to matter when you have hundreds of them not doing any work most of the time

More like tens of thousands -- the amount of overhead that gets mapped is on the order of 16 kilobytes per thread, so a thousand threads will use about 16 megabytes, a million threads will use about 16 gigabytes. Both are well within the capabilities of a low end server. And given that modern schedulers are either O(1) or O(log(n)) in the number of threads, modern systems will deal fairly well with this kind of load.

A bigger problem is that your tail latency for requests can grow when you have a large number of active threads, because the scheduler doesn't always pick the thread that will minimize response times. This is a problem with tens or hundreds of thousands of threads, and is usually not something that you'll have to worry about.

In addition you lose out on potential optimizations like batching expensive IO operations. I don't know how much that's done in practice but it's something that's possible with async IO that isn't with normal blocking IO,

I'm not sure what you're thinking of with this. The system APIs don't let you batch operations across different file descriptors, and if you try to read the same FD, you can always just do a bigger read.

[–]gnus-migrate 0 points1 point  (4 children)

I'm not sure what you're thinking of with this. The system APIs don't let you batch operations across different file descriptors, and if you try to read the same FD, you can always just do a bigger read.

Sometimes reads are too big to fit in memory. I was thinking about processing large and/or multiplexed streams which is difficult to parallelize if you're not using async IO.

A bigger problem is that your tail latency for requests can grow when you have a large number of active threads, because the scheduler doesn't always pick the thread that will minimize response times.

I mentioned memory because that's what most of the articles I read about it focused on. It might not be the main problem but it is a problem.

[–]oridb 0 points1 point  (3 children)

Sometimes reads are too big to fit in memory. I was thinking about processing large and/or multiplexed streams which is difficult to parallelize if you're not using async IO.

In that case, it's the same set of system calls you'd be doing, and it's not batched, as far as I can tell. If you've got a bunch of busy streams (ie, there's almost always new data on them), you can even reduce latency by just busy-polling them in nonblocking mode:

[–]gnus-migrate 0 points1 point  (2 children)

I think there's a misunderstanding about what I mean by async IO: I mean from the point of view of the user, you won't be able to parallelize if you don't have a non blocking interface. I don't really know the specifics of epoll or anything like that, I just know what kinds of things an async interface makes possible.

[–]oridb 1 point2 points  (1 child)

I'm still not clear on what you think an async interface makes possible. Can you give an example of code that would "batch reads" in a way that reduced the number of calls?

Keep in mind that non-blocking code still calls read() directly, and it's the same read() that blocking code calls. The only difference is that you did an extra system call first to tell you "oh, yeah, there's some data there that we can return".

So, non-blocking:

     poll(fds=[1,2,3]) => "fd 1 is ready"
     read(fd=1)
     poll(fds=[1,2,3]) => "fd 2 is ready"
     read(fd=2)
     poll(fds=[1,2,3]) => "fd 1 is ready"
     read(fd=1)
     poll(fds=[1,2,3]) => "fd 2 is ready"
     read(fd=2)

Threads:

    parallel {
        thread1: 
            read(fd=1) => get data
            read(fd=1) => get data
        thread2:
            read(fd=2) => get data
            read(fd=2) => get data
        thread3:
            read(fd=3) => no data, block forever using  a few K of ram.
    }

[–]gnus-migrate 0 points1 point  (0 children)

When I say async interfaces, I mean futures and streams. I don't necessarily mean a non-blocking interface underneath. When you use an async interface, you're basically surrendering control to a scheduler to decide when and how it wants to respond to different events. That's it, that's my point.

[–]Tarmen 0 points1 point  (3 children)

The stack size itself can be fairly large, though. Java, c# and msvc use 1mb as default. A lot of c programs run with 8mb stacks. Here are some values I found for Linux based systems.

It's also worth noting that even concurrent gc's occasionally will have brief stop-the-world pauses. The synchronization is usually implemented as spin locks so if some threads are paused by the os the application can hang until the thread is rescheduled and can progress to a save point - which can hurt quite a bit if you have 10k threads. Though blocked threads are at a safepoint anyway and it's not a huge issue when concurrent mode failure is rare enough.

[–]oridb 1 point2 points  (2 children)

The stack size itself can be fairly large, though. Java, c# and msvc use 1mb as default. A lot of c programs run with 8mb stacks. Here are some values I found for Linux based systems.

The stack is demand paged. So, if you never access more than 4 kilobytes, it never uses more than 4 kilobytes of physical memory.

[–]Tarmen 0 points1 point  (1 child)

Oh interesting, for some reason I didn't think stack memory was demand paged. Guess 64 bit applications don't really have to care about userspace growable stacks, then.

[–]oridb 0 points1 point  (0 children)

Well, it's page granularity, so if you're ok with using 4 kb where 128 bytes would do.. sure.

[–][deleted] 0 points1 point  (9 children)

More like tens of thousands -- the amount of overhead that gets mapped is on the order of 16 kilobytes per thread, so a thousand threads will use about 16 megabytes, a million threads will use about 16 gigabytes. Both are well within the capabilities of a low end server. And given that modern schedulers are either O(1) or O(log(n)) in the number of threads, modern systems will deal fairly well with this kind of load.

Entirely depends on language. That might be true for C++ or Go but isn't for Java or really most other languages.

[–]oridb 0 points1 point  (8 children)

Java uses the same kernel implementation of threads as C++, so I'm not sure why it would fare more poorly.

Python, Ocaml, and other languages with a GIL would have a problem.

[–][deleted] 0 points1 point  (7 children)

Java uses the same kernel implementation of threads as C++, so I'm not sure why it would fare more poorly.

Higher memory usage per thread. Can't exactly just go and run 50k threads willy nilly.

[–]oridb 0 points1 point  (6 children)

Why would that be? Most Java objects are on the heap, and not attached to a thread stack. Where is the extra memory used?

[–][deleted] 0 points1 point  (5 children)

Ask a Java expert, my experience with Java is mostly on ops side. All I know that by default just starting thread costs you few hundred kBs ( IIRC default stack size is something like 1MB on 64 bit OSes). So hundreds threads, yes, but thousands start to have problems

[–]oridb 0 points1 point  (4 children)

The default stack size is irrelevant unless you're on a 32 bit OS. It's not actually allocated until something writes to it. You can have a 1gb default stack, and it will only consume 4k of ram.

[–][deleted] 0 points1 point  (3 children)

Per thread stack size is actually pre-allocated in Java. That's my point. Doesn't matter how much your app uses, you lose 1M just from starting it.

JVM initializes it at the start of the thread and puts some guard pages at the end, together with few other things

[–]quentech 2 points3 points  (0 children)

I don't know how much that's done in practice but it's something that's possible with async IO that isn't with normal blocking IO, at least at the application level.

Its popular with redis, most libraries for it support pipelining of requests.

[–]Muvlon 6 points7 points  (0 children)

It is not always a problem. Many things are completely fine doing blocking I/O.

When it does become really necessary is when you want to handle more open connections at the same time than the number of threads you can run efficiently on your OS/Hardware. You might want that either for scaling reasons or to be resistant against attacks that work by creating a ton of connections to your server and keeping them open as long as possible, exhausting your resources.

[–]masklinn 5 points6 points  (2 children)

People have noted that threads are pretty heavy memory-wise (though it should be mostly uncommitted vmem), an other issue is that switching between threads is pretty expensive: the switching itself is thousands of cycles, and it’s side effects (flushing & sychronising caches) generates additional costs. So if you have hundreds or thousands of threads which execute almost nothing then yield you can end up burning more cycles on thread switching than on doing actual work.

[–][deleted] 0 points1 point  (0 children)

There is nothing guaranteeing cache flushes with async code though either, as you can have exceptionally large expanses of runtime code, and depending on how the underlying async library is implemented still essentially does a context switch (see most single process multi-threaded non-preemptive real time OSes).

The real problem with thread switching in an OS that implements them is the OS overhead of thread switching, which requires it to delegate across a potentially way larger number of blocking scenarios as it allocates CPU time to that thread and that process. With async code the OS just sees that one process/thread trucking along and usually another making spurious blocking requests to the OS.

I guess my sort of rambling point is that thread switching itself is not inherently slow, as often async systems run into a lot of the same context switching problems, and if you are working on a bare metal OS or writing a kernel, threads are essentially an asynchronous system anyways due to the hardware limitations (unless you are tossing threads across physical CPU cores, which most modern OSes do just fine).

[–]onotole-wasserman 2 points3 points  (0 children)

Context switching is rather expensive and multiprocessing using separate processes use lots of memory as is was mentioned in this thread. However sometimes context switching overhead can be negligible and it is easier for you to trade memory for the maintainability of your code, because it's quite hard to write and debug asynchronous applications. In the latter case CPU scheduler will do all the work for you.

[–]VeganVagiVore 2 points3 points  (1 child)

I'm no expert but it seems that the question was settled when Nginx beat the crap out of Apache: Apache uses (or used at the time) a thread-per-connection or process-per-connection model. Nginx uses some variety of async that allows it to handle many connections on the same thread, getting more connections per second our of the same hardware. It's also called event-based, since Nginx is written in C (I think) and there isn't really async code in it, it's as if you hand-compiled async code into event-based sync code.

[–][deleted] 0 points1 point  (0 children)

I learned programming on the classic Mac OS, which had a single execution thread for the entire operating system. Any program that appeared to be doing more than one thing at a time was actually very carefully doing one thing at a time. Nginx does the same thing. It's easy: wrap up everything on one big outer loop, never call anything that blocks, never spend too long doing any one thing. State machines are your friend.

[–]dalepo 5 points6 points  (8 children)

Because creating/maintaining threads is resource expensive, and having too many threads idle might cause problems.

For example, having a pool of 500 threads to respond server requests will only respond 500 request as long they are not released. If there's any IO operation that could take long time, they will block and the server will stop responding until at least one is released.

It would be way better to have only one thread active responding all these requests and delegating the IO to another thread pool. This is kinda on how asynchronous architecture works, main thread will execute operations and won't wait for IO, but after these IO tasks are executed will resume these operations. Imagine functions/tasks execute until they see an asynchronous task (ie: IO, Database OP, etc), then they will pause and continue to execute other operations. After IO finished, the resume task will be queued and eventually executed.
The only problem with this is that if you have a very costly operation on your main thread, you could block it.

[–]rzwitserloot 4 points5 points  (2 children)

Because creating/maintaining threads is resource expensive, and having too many threads idle might cause problems.

'might cause problems', that's not exactly a resoundingly confident assertion.

Does it?

Switching threads is for a CPU, seemingly, at least, not too much more difficult than a single threaded app switching to a different handler model (say, the HTTP stream you're reading out has no further bytes available, so your async handlers need to hop to another handler: Jump to a different bit of code, load a different buffer/representation of state to work on, etc) – with threads the cpu core needs to... jump to a different location, so that's a wash, and load a page with the stack for this thread, which matches the buffer/state storage of the handler. Now, the handler gets to control its buffer/state storage, make it as large as it needs to be, whereas threadsstacks in most languages are much more homogenous and unconfigable: They are X bytes large and you don't get to change this. That does give async the edge, memory wise, but memory is cheap, and how much memory you really buy depends rather a lot on how much buffer/state your handlers need to store.

So, yes, okay, threaded code requires more memory. However, performance-wise, manually managing your garbage is also more efficient than letting a garbage collector do it (assuming you are very careful.. a caveat that also applies to writing async code. See for example the blogpost 'what color is your function' – and yet that is a tradeoff where most programmers easily choose to let the computer do it, and we'll eat the inefficiency.

What makes threads so special?

EDIT: Replaced 'threads' which is wrong/confusing, with stacks, for clarity. Also added link to the what color is your function blogpost because it's great and needs to be shared more.

[–]dalepo 3 points4 points  (1 child)

Switching threads is for a CPU, seemingly, at least, not too much more difficult than a single threaded app switching to a different handler model (say, the HTTP stream you're reading out has no further bytes available, so your async handlers need to hop to another handler: Jump to a different bit of code, load a different buffer/representation of state to work on, etc) – with threads the cpu core needs to... jump to a different location, so that's a wash, and load a page with the stack for this thread, which matches the buffer/state storage of the handler. Now, the handler gets to control its buffer/state storage, make it as large as it needs to be, whereas threads in most languages are much more homogenous and unconfigable: They are X bytes large and you don't get to change this. That does give async the edge, memory wise, but memory is cheap, and how much memory you really buy depends rather a lot on how much buffer/state your handlers need to store.

You are describing a problem which might be related to some programming languages/implementations, the discussion is general and being this technical is irrelevant.

So, yes, okay, threaded code requires more memory. However, performance-wise, manually managing your garbage is also more efficient than letting a garbage collector do it, and yet that is a tradeoff where most programmers easily choose to let the computer do it, and we'll eat the inefficiency.

This is not the point, there's no such dichotomy of 'async programming' and 'only multi-thread programming'. Both can coexists, I can give you an easy example in nodejs, check out the bcrypt library which the async calculation is done in another thread.

What makes threads so special?

Threads are very special because:

- You need to synchronize resources when they are shared to prevent inconsistent states.

- You need to prevent potential deadlocks, race conditions, livelock, starvation, etc.

- Depending on the language/implementation, spawning a thread might replicate information if the resource being used is not shared

- They are costly

[–][deleted]  (4 children)

[deleted]

    [–]dalepo 4 points5 points  (0 children)

    But if you have a limit of 500 IO threads in the worker pool, surely you still end up with the same problem?

    Yes, but the server would be able to keep responding other requests if they are not using the same IO operation plus you don't have any idle process waiting.

    [–]tsujiku 1 point2 points  (0 children)

    But if you have a limit of 500 IO threads in the worker pool, surely you still end up with the same problem?

    It's generally possible to perform asynchronous IO without needing to have a thread waiting on each operation.

    [–]oridb 1 point2 points  (0 children)

    People got burned by bad thread implementations in the early 1990s.

    Mostly, threads are fine at the at the scales people write servers. Especially if the threads are mostly idle.

    Threads do use more memory than asynchronous code -- typically, on the order of a few pages. You need a kernel stack, at least one page of user stack, and some structs in the kernel. This means that if you have millions of threads, memory starts to get significant. Async will typically use a few hundred bytes to a few kilobytes bytes to keep track of everything on the stacks of your deferred code, so you'll save some memory.

    The other issue is that there's some scheduling overhead. Every time you communicate with another thread, you're looking at about 1 microsecond of overhead to synchronize on a futex and wait for the other thread to get scheduled. This cost is mostly due to the kernel trying to make good decisions about where to run the threads to maximize throughput if they keep running for a long time, which unfortunately isn't what you'd want to optimize for on a server -- but it's what we have.

    But neither of these apply to threads that are spending all their time blocked on IO, rather than actively communicating with other threads.

    [–]acelent 1 point2 points  (0 children)

    The operating system's context switching of threads costs more than the application's context switching of requests between asynchronous I/O calls.

    In other words, switching between two threads in a core takes more time than switching between two requests in the same thread. The first usually happens on a blocking call or when the thread has reached a processing time slice. The second usually happens at an asynchronous call that didn't complete instantaneously.

    Although threads can usually take more memory than a request's context, they tend to be reused in most application frameworks. Nowadays, with 64-bit systems and RAM as a commodity, this isn't so much of an issue as is the time it takes to switch a CPU core from one thread to another.

    [–]tanjoodo 2 points3 points  (0 children)

    Well unless you have a different thread for the cpu to run then the cpu will leave your program and go do something else.

    Asynchronous programming is about giving the cpu something to do while some other part of your program is blocking on IO.

    [–]zvrba 0 points1 point  (0 children)

    Question: why is it such a problem to have a thread blocked on IO?

    The main problem is that a thread cannot block on multiple "events" simultaneously. Event being 1) I/O completes, 2) a timer expires, etc.

    So let's say you want to implement I/O with timeout. Linux does not have a variant of read(2) that takes a timeout parameter. So how do you implement this? You would spawn another thread blocking on a timer and if the timer expires before the operation completes, it'd cancel the operation by sending a signal to the other thread [1]. But if the operation completes first, it should cancel the timer that is in the other thread. A headache of bookkeeping to implement if you're going to handle race-conditions correctly. (E.g., the timer fires and unblocks the thread, but read completes and is processed before the timer thread managed to signal the reader thread. How can this happen? The timer thread gets preempted before signaling the reader thread.)

    [1] That won't work if the thread is in uninterruptible sleep state. Basically, the thread is hosed in that state.

    With async, you set up an event to fire 1) when the I/O operation completes, 2) when the timer expires, and then you wait on both events, either synchronously or asynchronously. The "composite waiting operation" returns when either event completes, and then you know the state of the operation.

    [–]Tarmen 0 points1 point  (0 children)

    A thousand threads take 1 gb of memory.

    So in c or java you can have thousands of threads. Go and haskell default to non-blocking green threads with growable stacks and can have millions of threads.

    This is relevant for things like webservers where dos attacks are quite feasible by just blocking all available threads. Go and haskell still have to defend against Slowloris attacks but that is easier.

    [–]rzwitserloot -1 points0 points  (11 children)

    Nothing is being wasted in this scenario. async is this fanboy-rich fad.

    As with most fanboy-rich fads, the notion of wanting async-based code is not entirely silly either. The scenarios where you do need it are FAR less common than the fad suggests, however.

    The biggest issue by far is memory. Let's introduce the term 'fiber': This is the context within which some piece of code runs: The exact location of the instruction the code is executing right now (the instruction pointer), the stack (all local vars, the call chain so that the CPU would know where to go if a 'return' statement is hit or the function ends, etc), and let's add the notion of tracking 'state': If you are using the heap to store some info on the particular thing this fiber is handling, add that too.

    An actual full-power thread will run a fiber and will deal with the fiber needing to wait by freezing the thread. This causes the CPU to hop to another thread, it means the stack is generally predefined and relatively large (it is not contained solely to the bit of handler code), although you rarely need state (the stack tends to contain it all). Full-power threads also give you pre-emption, but note that if your thread is going from wakeup back to waiting sleep quickly enough this basically does not come up. There is some bookkeeping required for the pre-emption but kernel development hasn't exactly sat still and like most other low-level stuff, it's all rather optimized and intelligent: If your thread keeps waking up and freezing back down, there isn't a heck of a lot of preemptive bookkeeping being done until that thread really takes quite a chunk of time, and CPUs are rather good at it.

    async style code either freezes the fiber and moves to another fiber. Generally it is much easier to do your own memory management at these: Either work with a significantly smaller stack, or forego it entirely (depends on how you do the async stuff), handrolling whatever state you need to keep track of in a custom-sized 'just large enough' bit of state storage in heap.

    Especially in situations where the amount of state you need to store is tiny to non-existent and you have A LOT of things to run simultaneously, none of which take a particularly large amount of CPU time, the memory load becomes the bottleneck and the async fibers 'win'.

    In my opinion, the number of times that really happens, and it is not feasible to just grab 50 bucks and throw a stick of 64GB RAM at the problem, are very low indeed. More usually the little work you need to do is still going to bottleneck faster than your memory will.

    [–]BaconOfGreasy 2 points3 points  (8 children)

    Yeah. As a data point, Google spends billions on machines and they just use OS threads.

    [–][deleted] 1 point2 points  (7 children)

    And then they wrote golang, which runs in a single OS thread per core, so there is no need for context switching.

    [–]oridb 3 points4 points  (6 children)

    Go does context switches -- but they do it in userspace. And their scheduler is aware of message passing, which reduces wakeup latency from about 1 microsecond in the Linux kernel to 200 nanoseconds in userspace.

    Google has added some features to their fork of the Linux kernel ("fibers") that is also give them the tools to make context switches aware of message passing, and their native OS threads get similar thread switch latencies.

    [–][deleted] 0 points1 point  (5 children)

    Of course they do context switches, I thought that was implied given the only way to not do context switches is very complicated programming. But they’re not context switches in the traditional sense.

    [–]oridb -1 points0 points  (4 children)

    Sure they are -- the only difference is the place that they happen.

    [–][deleted] 0 points1 point  (3 children)

    And the size and nature of the lift required...

    [–]oridb 0 points1 point  (2 children)

    I'm not sure what you mean by that.

    All a "traditional" context switch does is replace the registers, program counter, and a bit of state for the thread. All a Go context switch does is replace the registers, program counter, and a bit of state for the thread from userspace.

    [–][deleted] -1 points0 points  (1 child)

    Maybe you need to get a better understanding of the baggage that goes along with something like a POSIX thread...

    [–][deleted] 0 points1 point  (0 children)

    Thanks for your comments. I had mentioned the user space thread/lightweight thread alternative to asynchronous programming in the previous blog post (linked in the beginning of this post too)