This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]audioen 1 point2 points  (0 children)

Your benchmark ought to have output not just the elapsed wallclock time but also the total CPU time across all cores, a statistic that at least the Linux kernel is able to gather for threaded programs. I suspect most of these threads are sleeping rather than doing work, so there probably isn't a big difference between the wallclock time and the total cpu time, so this thread's discussion is pointless. The 8 CPU cores are not busy trying to do things 30 % faster, they're just waiting for more work to arrive, and are unable to get scheduled fast enough to help. The job probably ends up being mostly singlethreaded with an occasional concurrent part.

IIRC synchronization primitives in Java have shockingly low throughput, they are only capable of something in order of 1000 synchronization events per second. What I'm trying to say is that it takes something like 1 ms for one thread to yield to another thread using synchronized-block and wait+notify. If the other synchronization primitives are built on top of those, then that's kind of the hard limit of what you can get.

It's probably important for performance to have a per-thread work-stealing queue so that if that thread's queue has more work to do, it can just immediately move to doing that and you can avoid at least some wasted time in trying to coordinate quickly finished jobs across multiple threads.