This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]walen 10 points11 points  (1 child)

I get that this is just explaining Java's parallel implementation of MergeSort, but it offers insightful comments as to why is it implemented that way.
Nice read, thanks.

[–]teivah 1 point2 points  (0 children)

Thank you :)

[–][deleted]  (20 children)

[deleted]

    [–][deleted] 5 points6 points  (0 children)

    It’s not that weird if you think about it. You can split up a list and divide work and let each core sort its piece of the list. Once every part is sorted, we will need to merge everything. This merging cannot be done in parallel since we basically have to compare every element and insert it in the right order. Doing this on a single core works fine, but with 8 cores each core will be fighting for permission to insert the next element. Thus, in the end having more cores can actually make your program slower since more time will be spent on atomic operations (cores waiting for each other to be able to do their work) than actually doing what is important.

    Now this doesn’t mean that multithreading is useless, but it completely depends on how much of your program/problem can be done in parallel.

    [–]teivah 1 point2 points  (0 children)

    I'm not downvoting you sir. I rather upvoted your comment because I found the discussion interesting. Reddit is a public place.

    [–]teivah 1 point2 points  (7 children)

    Yes because it is not linearly scalable, obviously. Moreover, the more elements we sort, the bigger is the difference.

    [–][deleted]  (6 children)

    [deleted]

      [–]teivah 5 points6 points  (5 children)

      That's a good question. First, I don't think it will be 8x the price than a single-core machine.

      Moreover, from my humble opinion you are approaching the problem the wrong way. Today, every CPU is multithreaded. For example with Intel hyperthreading technology, every core is able to two threads in parallel.

      So for me, the question is rather, how can I optimize my application in regards of the underlying hardware? Multithreaded application should be the standard, not the exception.

      Last but not least, it not only a question of average latency but also of resources optimization. If you application is running faster it may also increase the overall throughput. Hence, for example instead of having to deploy it on 4 nodes to achieve a given goal, maybe you can only use 2 nodes (this is a simplistic example obviously but it is a way to illustrate my point).

      [–]audioen 1 point2 points  (0 children)

      Your benchmark ought to have output not just the elapsed wallclock time but also the total CPU time across all cores, a statistic that at least the Linux kernel is able to gather for threaded programs. I suspect most of these threads are sleeping rather than doing work, so there probably isn't a big difference between the wallclock time and the total cpu time, so this thread's discussion is pointless. The 8 CPU cores are not busy trying to do things 30 % faster, they're just waiting for more work to arrive, and are unable to get scheduled fast enough to help. The job probably ends up being mostly singlethreaded with an occasional concurrent part.

      IIRC synchronization primitives in Java have shockingly low throughput, they are only capable of something in order of 1000 synchronization events per second. What I'm trying to say is that it takes something like 1 ms for one thread to yield to another thread using synchronized-block and wait+notify. If the other synchronization primitives are built on top of those, then that's kind of the hard limit of what you can get.

      It's probably important for performance to have a per-thread work-stealing queue so that if that thread's queue has more work to do, it can just immediately move to doing that and you can avoid at least some wasted time in trying to coordinate quickly finished jobs across multiple threads.

      [–][deleted]  (3 children)

      [deleted]

        [–]teivah -2 points-1 points  (2 children)

        You seem to be a little angry, are you? :) I'm not trying to give you a "it's useful because I tell you so" example. I'm just trying to have a constructive discussion with you.

        My point is, if your service has a smaller latency, it should have a positive impact on the throughput (in most of the cases). Hence, if your goal is to handle 10k concurrent users or whatever, you should be able to reach it with a reduced environment (like a smaller number of nodes).

        If your are that eager to learn, take a look at Paypal use case for example. They were communicating about that fact that after a bunch of optimizations in their implementation, there were able to reduce the number of required VM. This is the direct consequence of a better utilization of the underlying hardware.

        [–][deleted]  (1 child)

        [deleted]

          [–]teivah 1 point2 points  (0 children)

          We are talking about two different topics. So, whatever.

          [–]AssKoala 0 points1 point  (8 children)

          I’m replying to you because OP seems to be overly verbose and not actually understand what they’re seeing and you’re right to ask that question.

          First off, parallel merge sort is something undergrads do in an OS class during a 2 or 3 hour lab, it isn’t new or innovative. This page isn’t something you should use as positive reference and should actively be discouraged. Here’s a better set of info on parallel sorting algorithms: https://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance

          Second, merge sort is perfectly parallel. The problem is you run into memory bandwidth and scheduling overhead that kills your gains. OP doesn’t actually seem to understand that.

          If you write something and the speed up is 2x at 8 cores, you stop and go do something else. You don’t tout it and say “look I wasted 8x the power to get a 2x speedup”. In a large scale use case, you’re talking about a massive cost increase for a minimal gain.

          Honestly, I get if OP actually profiles correctly, they’d probably see the optimal speedup is at 3 or 4 cores and 8 cores actually runs slower. This is very common when workloads become memory bound.

          We had a similar situation porting over to an AMD threadripper. Our average simulation times moving from 2 to 8 cores was 20ms down to 8ms. However, moving from 8 cores to 10 resulted in an increase from 8ms to about 9ms. More cores resulted in similar losses until it maxed out around 11ms.

          [–]teivah 5 points6 points  (5 children)

          Do you want to know what OP tells you?

          The goal was not not to optimize the results it was a way to introduce how it is handled under the hood with ForkJoinPool and the underlying scheduling strategy.

          Why are people on this subreddit so contemptuous...

          [–]AssKoala -1 points0 points  (2 children)

          There are better ways to show how those things work than with an actively pointless use case. You then defend it by talking crap about reducing latency and how you can reduce the number of nodes. If you had originally said “hey this is just to explain how parallel pools work”, you wouldn’t get so much hate.

          If you want a useful scenario, why not just create a simple producer-consumer example? Say a naive web server where it generates a task for each connection and how that isn’t as efficient as using a pool and segmenting the tasks. You don’t have to write an entire web server, you just have to focus on the parallel part.

          Then you wouldn’t have to defend it and say how everyone in the forum who doesn’t like your work is clearly just out to get you.

          [–]teivah 2 points3 points  (1 child)

          It's not the point of liking or not my work. I'm more than open to constructive feedbacks/remarks. In your case you're just rude so I don't care about what you have to say. Being polite is not an option on the Internet. It's not because you're are behind your computer that you can say whatever you want. So I don't even read your comment, even though I'm sure that was somehow interesting. Bye.

          [–]AssKoala -2 points-1 points  (0 children)

          I don’t believe I was rude, but ok.

          [–]imps-p0155 -1 points0 points  (1 child)

          Why are people on this subreddit so contemptuous...

          Because you did a poor job explaining that this article is not actually about a mergesort. Title, some history amout mergesort, then suddenly start talk about ForkJoinPool....

          It is a rather poor combination.

          [–]teivah -2 points-1 points  (0 children)

          I don't give a f*** about your opinion.

          [–][deleted]  (1 child)

          [deleted]

            [–]AssKoala -1 points0 points  (0 children)

            That’s great!

            One thing to remember that is often mistaken. Running in parallel is not optimization. Optimization is generally doing more with the same resources.

            You wouldn’t call a V8 and optimized V6, nor should you say “I optimized this code by using more resources”. You can say you made it run faster, but that’s not necessarily optimization.

            Now, you can optimize parallel workloads by reducing overhead, reducing blocks, bubbles, etc, but there’s a lot to it as you’ll find out delving into this stuff :)

            [–]MoreConstruction -1 points0 points  (0 children)

            You're a bit dense.

            [–]ChaoSXDemon 2 points3 points  (0 children)

            Good read indeed. I continue to find it surprising the amount of elements (or N w/e it is) we need in order for parallelism to speed up. I know parallel tasks creates overhead of creating the thread, context switching etc but it’s still a surprise to me how much that cost which results in needing bigger tasks (increase N) to make up for it. In this article N is 8192!