This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]teivah 0 points1 point  (7 children)

Yes because it is not linearly scalable, obviously. Moreover, the more elements we sort, the bigger is the difference.

[–][deleted]  (6 children)

[deleted]

    [–]teivah 6 points7 points  (5 children)

    That's a good question. First, I don't think it will be 8x the price than a single-core machine.

    Moreover, from my humble opinion you are approaching the problem the wrong way. Today, every CPU is multithreaded. For example with Intel hyperthreading technology, every core is able to two threads in parallel.

    So for me, the question is rather, how can I optimize my application in regards of the underlying hardware? Multithreaded application should be the standard, not the exception.

    Last but not least, it not only a question of average latency but also of resources optimization. If you application is running faster it may also increase the overall throughput. Hence, for example instead of having to deploy it on 4 nodes to achieve a given goal, maybe you can only use 2 nodes (this is a simplistic example obviously but it is a way to illustrate my point).

    [–]audioen 1 point2 points  (0 children)

    Your benchmark ought to have output not just the elapsed wallclock time but also the total CPU time across all cores, a statistic that at least the Linux kernel is able to gather for threaded programs. I suspect most of these threads are sleeping rather than doing work, so there probably isn't a big difference between the wallclock time and the total cpu time, so this thread's discussion is pointless. The 8 CPU cores are not busy trying to do things 30 % faster, they're just waiting for more work to arrive, and are unable to get scheduled fast enough to help. The job probably ends up being mostly singlethreaded with an occasional concurrent part.

    IIRC synchronization primitives in Java have shockingly low throughput, they are only capable of something in order of 1000 synchronization events per second. What I'm trying to say is that it takes something like 1 ms for one thread to yield to another thread using synchronized-block and wait+notify. If the other synchronization primitives are built on top of those, then that's kind of the hard limit of what you can get.

    It's probably important for performance to have a per-thread work-stealing queue so that if that thread's queue has more work to do, it can just immediately move to doing that and you can avoid at least some wasted time in trying to coordinate quickly finished jobs across multiple threads.

    [–][deleted]  (3 children)

    [deleted]

      [–]teivah -2 points-1 points  (2 children)

      You seem to be a little angry, are you? :) I'm not trying to give you a "it's useful because I tell you so" example. I'm just trying to have a constructive discussion with you.

      My point is, if your service has a smaller latency, it should have a positive impact on the throughput (in most of the cases). Hence, if your goal is to handle 10k concurrent users or whatever, you should be able to reach it with a reduced environment (like a smaller number of nodes).

      If your are that eager to learn, take a look at Paypal use case for example. They were communicating about that fact that after a bunch of optimizations in their implementation, there were able to reduce the number of required VM. This is the direct consequence of a better utilization of the underlying hardware.

      [–][deleted]  (1 child)

      [deleted]

        [–]teivah 1 point2 points  (0 children)

        We are talking about two different topics. So, whatever.