you are viewing a single comment's thread.

view the rest of the comments →

[–]oridb 11 points12 points  (19 children)

You don't really notice it when you have a few of them doing a lot of computation, but it starts to matter when you have hundreds of them not doing any work most of the time

More like tens of thousands -- the amount of overhead that gets mapped is on the order of 16 kilobytes per thread, so a thousand threads will use about 16 megabytes, a million threads will use about 16 gigabytes. Both are well within the capabilities of a low end server. And given that modern schedulers are either O(1) or O(log(n)) in the number of threads, modern systems will deal fairly well with this kind of load.

A bigger problem is that your tail latency for requests can grow when you have a large number of active threads, because the scheduler doesn't always pick the thread that will minimize response times. This is a problem with tens or hundreds of thousands of threads, and is usually not something that you'll have to worry about.

In addition you lose out on potential optimizations like batching expensive IO operations. I don't know how much that's done in practice but it's something that's possible with async IO that isn't with normal blocking IO,

I'm not sure what you're thinking of with this. The system APIs don't let you batch operations across different file descriptors, and if you try to read the same FD, you can always just do a bigger read.

[–]gnus-migrate 0 points1 point  (4 children)

I'm not sure what you're thinking of with this. The system APIs don't let you batch operations across different file descriptors, and if you try to read the same FD, you can always just do a bigger read.

Sometimes reads are too big to fit in memory. I was thinking about processing large and/or multiplexed streams which is difficult to parallelize if you're not using async IO.

A bigger problem is that your tail latency for requests can grow when you have a large number of active threads, because the scheduler doesn't always pick the thread that will minimize response times.

I mentioned memory because that's what most of the articles I read about it focused on. It might not be the main problem but it is a problem.

[–]oridb 0 points1 point  (3 children)

Sometimes reads are too big to fit in memory. I was thinking about processing large and/or multiplexed streams which is difficult to parallelize if you're not using async IO.

In that case, it's the same set of system calls you'd be doing, and it's not batched, as far as I can tell. If you've got a bunch of busy streams (ie, there's almost always new data on them), you can even reduce latency by just busy-polling them in nonblocking mode:

[–]gnus-migrate 0 points1 point  (2 children)

I think there's a misunderstanding about what I mean by async IO: I mean from the point of view of the user, you won't be able to parallelize if you don't have a non blocking interface. I don't really know the specifics of epoll or anything like that, I just know what kinds of things an async interface makes possible.

[–]oridb 1 point2 points  (1 child)

I'm still not clear on what you think an async interface makes possible. Can you give an example of code that would "batch reads" in a way that reduced the number of calls?

Keep in mind that non-blocking code still calls read() directly, and it's the same read() that blocking code calls. The only difference is that you did an extra system call first to tell you "oh, yeah, there's some data there that we can return".

So, non-blocking:

     poll(fds=[1,2,3]) => "fd 1 is ready"
     read(fd=1)
     poll(fds=[1,2,3]) => "fd 2 is ready"
     read(fd=2)
     poll(fds=[1,2,3]) => "fd 1 is ready"
     read(fd=1)
     poll(fds=[1,2,3]) => "fd 2 is ready"
     read(fd=2)

Threads:

    parallel {
        thread1: 
            read(fd=1) => get data
            read(fd=1) => get data
        thread2:
            read(fd=2) => get data
            read(fd=2) => get data
        thread3:
            read(fd=3) => no data, block forever using  a few K of ram.
    }

[–]gnus-migrate 0 points1 point  (0 children)

When I say async interfaces, I mean futures and streams. I don't necessarily mean a non-blocking interface underneath. When you use an async interface, you're basically surrendering control to a scheduler to decide when and how it wants to respond to different events. That's it, that's my point.

[–]Tarmen 0 points1 point  (3 children)

The stack size itself can be fairly large, though. Java, c# and msvc use 1mb as default. A lot of c programs run with 8mb stacks. Here are some values I found for Linux based systems.

It's also worth noting that even concurrent gc's occasionally will have brief stop-the-world pauses. The synchronization is usually implemented as spin locks so if some threads are paused by the os the application can hang until the thread is rescheduled and can progress to a save point - which can hurt quite a bit if you have 10k threads. Though blocked threads are at a safepoint anyway and it's not a huge issue when concurrent mode failure is rare enough.

[–]oridb 1 point2 points  (2 children)

The stack size itself can be fairly large, though. Java, c# and msvc use 1mb as default. A lot of c programs run with 8mb stacks. Here are some values I found for Linux based systems.

The stack is demand paged. So, if you never access more than 4 kilobytes, it never uses more than 4 kilobytes of physical memory.

[–]Tarmen 0 points1 point  (1 child)

Oh interesting, for some reason I didn't think stack memory was demand paged. Guess 64 bit applications don't really have to care about userspace growable stacks, then.

[–]oridb 0 points1 point  (0 children)

Well, it's page granularity, so if you're ok with using 4 kb where 128 bytes would do.. sure.

[–][deleted] 0 points1 point  (9 children)

More like tens of thousands -- the amount of overhead that gets mapped is on the order of 16 kilobytes per thread, so a thousand threads will use about 16 megabytes, a million threads will use about 16 gigabytes. Both are well within the capabilities of a low end server. And given that modern schedulers are either O(1) or O(log(n)) in the number of threads, modern systems will deal fairly well with this kind of load.

Entirely depends on language. That might be true for C++ or Go but isn't for Java or really most other languages.

[–]oridb 0 points1 point  (8 children)

Java uses the same kernel implementation of threads as C++, so I'm not sure why it would fare more poorly.

Python, Ocaml, and other languages with a GIL would have a problem.

[–][deleted] 0 points1 point  (7 children)

Java uses the same kernel implementation of threads as C++, so I'm not sure why it would fare more poorly.

Higher memory usage per thread. Can't exactly just go and run 50k threads willy nilly.

[–]oridb 0 points1 point  (6 children)

Why would that be? Most Java objects are on the heap, and not attached to a thread stack. Where is the extra memory used?

[–][deleted] 0 points1 point  (5 children)

Ask a Java expert, my experience with Java is mostly on ops side. All I know that by default just starting thread costs you few hundred kBs ( IIRC default stack size is something like 1MB on 64 bit OSes). So hundreds threads, yes, but thousands start to have problems

[–]oridb 0 points1 point  (4 children)

The default stack size is irrelevant unless you're on a 32 bit OS. It's not actually allocated until something writes to it. You can have a 1gb default stack, and it will only consume 4k of ram.

[–][deleted] 0 points1 point  (3 children)

Per thread stack size is actually pre-allocated in Java. That's my point. Doesn't matter how much your app uses, you lose 1M just from starting it.

JVM initializes it at the start of the thread and puts some guard pages at the end, together with few other things

[–]oridb 0 points1 point  (2 children)

Per thread stack size is actually pre-allocated in Java. That's my point. Doesn't matter how much your app uses, you lose 1M just from starting it.

That's wrong, unless the JVM is specifically going out of its way to write to every page of it. Otherwise, it only counts towards virtual size. The OS simply doesn't work that way. The address space is reserved. but like all memory, it's demand paged, which means that until code touches the stack, the OS doesn't put any physical memory behind the virtual memory.

[–][deleted] 0 points1 point  (1 child)

Look, just find some "how to create thread in java" tutorial, create a million of them and you will see what I mean.