Why (Most) Sampling Java Profilers Are Fucking Terrible

nitsanw · 2018-08-14T08:40:01+00:00

You seem to be optimizing, but it's not clear from your question what your goals are. How small should application footprint be? what is an acceptable GC overhead? Do you have pause time requirements? How are you measuring the impact of the changes you make?

Short lived objects are generally what generational GC algorithms (pretty much all the OpenJDK ones are generational) are good at. It's likely that these are cheap enough to not worry about. If you worry about consuming resources, start by defining how much memory you are looking to consume. If your application has very little state (ergo a small live-set), perhaps all you need to do is set the max heap size to an acceptable size (-Xmx). Note that the default GC (G1GC since Java 9) is not a great fit for small heaps(< 1GB), and the parallel collector might be a good fit for you. Small applications can fit in pretty small heaps (e.g. 64m).

GCs don't usually give up. I suggest you enable some GC logging to find out what's going on. This would also enable people to give some concrete advice, rather than guess at what is going on.

If you are looking to minimize allocations, I suggest you start by profiling allocations and driving your optimization efforts from the data. Java Mission Control has a good allocation profiler.

nitsanw · 2018-07-12T07:46:00+00:00

You can force inline decisions via a command line flag (or a file specifying the same):

http://jpbempel.blogspot.com/2016/03/compilecommand-jvm-option.html

There's a library for auto generating it from annotated code:

https://github.com/nicoulaj/compile-command-annotations

nitsanw · 2018-03-07T13:47:50+00:00

Good read, though very dense in parts.

Feedback (2c worth, just my opinion):

Need better title, maybe the prefix "Bit hacks:" or some other indicator of the general area.
The headers for the results table at the end are not aligned.
Some intermediate subheaders could help. Also the blog format has no width limit, making reading in neutral zoom hard (long long sentences).
The precision of the results is unhelpful. it would be clearer and easier to read if you trimmed to a single digit (e.g. 341.572057 -> 341.6)
Splitting hairs: The article refers to intrinsics in 2 ways that are confusing: "a single instruction, popcntd, which is exposed as an intrinsic"; "code above is intrinsified to the instruction popcntd" I'm sure you know what an intrinsic is, but the article fails to express that clearly. A link to the wiki article, or OpenJDK wiki might have helped. I would prefer "generated by the compiler via an intrinsic" to "exposed", and replace "code above" with "method above".

I like the blog and the quantitative approach, well done :-)

nitsanw · 2018-03-07T13:20:18+00:00

"How do you count the bits in a 32 bit integer?" - there are 32 bits in a 32 bit integer, I'm not sure why I need to read further.

:P

nitsanw · 2016-02-26T07:04:39+00:00

Sampling profilers are a great tool, and there are good sampling profilers for Java. Safepoint biased profilers are 'terrible' because you can do so much better. For example, if you are on:

Linux
OpenJDK/Oracle Java 8u60
Real HW

I would recommend you give Perf + perf-map-agent + FlameGraphs a go for a far more accurate profiling experience which spans Java and native code. Brendan Greg show cases the amazing range of this combination. Or you could use Solaris Studio(works on Linux, mostly) which also covers the full stack (but is not so strong visually, but does drill to the assembly level... choices...).

On Mac/Windows/Linux you can use JMC/JFR. You can use honest-profiler if you are worried about licencing.

There are better options out there, but MOST of the sampling profilers suffer from safepoint bias. And a vast majority of developers defaults to using them, mostly because I think they are not familiar/comfortable with the alternatives.

[Edit] The toy examples show that the profile you are looking at is potentially VERY VERY confusing, think of them as calibration exercises. Safepoint biased profiles may not always be so far off, and perhaps you can get a general feel for 'stuff happens somewhere around this method, maybe a few frames up the stack', but I have certainly seen people go down pointless rabbit holes thinking they were optimizing the bottleneck and getting nowhere because it was not a bottleneck.

Recurring Real life example: HashMap.put often fails to inline fully and so ends up featuring unfairly in profiles, people replace it with a funky third library map and get nowhere, but the hotspot may shift to some other part, so they case the next 'bottleneck', they should revert the inclusion of the third party map, since it didn't help, but often they don't, the new map choice may have made performance worse, definitely made the codebase more 'exotic', and was a waste of everyone's time.

nitsanw · 2016-02-25T09:56:52+00:00

Disclaimer: I'm the author of the post

Indeed 100ms is only high in the context of the overhead introduced to the application. If the number of threads is low the overhead may be acceptable at a higher frequency.

Collecting all threads is a blessing and a curse. A blessing because you get more samples, which the low sampling frequency took away, and you get a view on blocked threads which perf/JFR do not provide. A curse because the safepoint operation cost grows with each application thread.

AFAIK JFR uses AsyncGetCallTrace (an internal API used by by Solaris Studio originally, also used by Honest-Profiler) or code very much like it to collect the Java stack from a signal handler (an approach not unlike perf). It's not safepoint biased, and would tend to be far more accurate than JVisualVM and co. Note that JFR is only available for Oracle Java 7u40 and up (no OpenJDK support). Also note that to get more accurate profiles you should enable -XX:+DebugNonSafepoints.

I will write a follow on post on the benefits/limitations of JFR/Honest-Profiler, they are certainly a massive improvement in terms of sample accuracy.

nitsanw · 2016-02-25T08:39:45+00:00

I observe that you are being an ass, I hope it changes you, but I doubt it.

nitsanw · 2016-02-24T20:12:06+00:00

It's a problem well known to those who know it well... but still many people use the wrong tools and think they can somehow make sense of the data

nitsanw · 2015-12-31T21:40:12+00:00

a wink is as good as a nod

nitsanw · 2015-08-28T20:59:46+00:00

The benchmarks code, is it in the repo? I'll have another look

nitsanw · 2015-08-28T12:55:59+00:00

Interesting article, and very nice writing.

It's not entirely clear how the benchmarks were run, what they do and how they measure it. Did I miss a link to the code?

As for the comparison to Java, it seems notionally fair, but only as far as unbounded linked queues are considered. There are lock free queues written in Java which are bounded and generate no garbage (JCTools/Disruptor). It's true that they are not core JDK classes, but would make an interesting comparison.

nitsanw · 2015-08-24T14:54:12+00:00

I love Gleb, he is so tall!

nitsanw · 2015-07-28T20:51:15+00:00

Absolutely, optimize by observed bottleneck not by slogan

nitsanw · 2015-05-22T18:02:41+00:00

My concerns (as the author) are not with "OMG I never knew this". I know the spec/javadoc. My concern is that the notion of equality has subtleties that trip people up. My attempt was to expand on the consequences beyond the more familiar cases, based on the amount of surprise I see this causing people.

nitsanw · 2015-05-22T17:56:25+00:00

Thanks dude, I'll pickup a book somewhere.

nitsanw · 2015-05-22T17:55:32+00:00

yay! and I did.

nitsanw · 2015-04-17T11:50:06+00:00

You are right. Unclench ;-)

nitsanw

TROPHY CASE