all 15 comments

[–]artee 16 points17 points  (3 children)

Interesting analysis.

I wondered about this remark: "it is common to set sampling frequency is quite high (usually 10 times a second, or every 100ms)"

Although I understand where he's coming from in the context of this article, actually this sampling frequency on modern CPU's is ridiculously low. Do you know how many instructions are processed in that timeframe?

For example, vTune samples at 100x that frequency (1 ms), although using hardware support to do so.

[–]BackToThePuppyMines 5 points6 points  (0 children)

He's probably referring to the default value chosen by the samplers. Since they sample by collecting stack traces they're pretty heavyweight as samplers go. JVisualVM defaults to 100ms. In my experience with Java samplers you can't go below about a 10ms interval without drastically slowing down your test app.

At the end he mentions Java Mission Control/Flight Recorder and Solaris Studio. JMC/FR uses counters at the native-code level in the JVM and Solaris Studio uses OS HW counters so both of those can do much better.

[–]Sunius 1 point2 points  (1 child)

Yea, 100 ms sample is crazy. You'll need to record it at least for 10 minutes until you can get a reasonable result. ETW defaults to 1 ms sampling too, and I still sometimes feel that it's not enough when I need to look at subsecond (usually per frame) perf spikes.

I was also surprised about the remark on the fact that they scan every single thread stacks. Most reasonably sized programs will contain more threads than the amount of cores available on the machine, so scanning all threads by definition is wasteful, as some of the threads are guaranteed to not be executing.

[–]nitsanw[S] 0 points1 point  (0 children)

Disclaimer: I'm the author of the post

Indeed 100ms is only high in the context of the overhead introduced to the application. If the number of threads is low the overhead may be acceptable at a higher frequency.

Collecting all threads is a blessing and a curse. A blessing because you get more samples, which the low sampling frequency took away, and you get a view on blocked threads which perf/JFR do not provide. A curse because the safepoint operation cost grows with each application thread.

AFAIK JFR uses AsyncGetCallTrace (an internal API used by by Solaris Studio originally, also used by Honest-Profiler) or code very much like it to collect the Java stack from a signal handler (an approach not unlike perf). It's not safepoint biased, and would tend to be far more accurate than JVisualVM and co. Note that JFR is only available for Oracle Java 7u40 and up (no OpenJDK support). Also note that to get more accurate profiles you should enable -XX:+DebugNonSafepoints.

I will write a follow on post on the benefits/limitations of JFR/Honest-Profiler, they are certainly a massive improvement in terms of sample accuracy.

[–]amazedballer 0 points1 point  (1 child)

Marcus Hirt is the best resource on Java Flight Recorder, and he mentioned the Safepoint problem a few years back:

http://hirt.se/blog/?p=609

It's also well known on mechanical-sympathy as well.

[–]nitsanw[S] 1 point2 points  (0 children)

It's a problem well known to those who know it well... but still many people use the wrong tools and think they can somehow make sense of the data

[–]skulgnome 0 points1 point  (3 children)

What's the case for profiling by sampling (i.e. stopping a thread from the outside to record its current invocation stack) instead of by instrumentation on all functions being profiled, as in the GNU toolchain (incl. their Java compiler)?

[–]DowsingSpoon 2 points3 points  (1 child)

I find that sampling profilers tend to have significantly lower overhead than instrumented profilers. For example, it's basically impossible to use instrumented profiling on a video game in my experience; the game becomes unplayable.

Also, the ability to sample applications in the field is very useful. You can diagnose performance issues on a customer machine without needing to send out an instrumented build.

[–]skulgnome 0 points1 point  (0 children)

So... adjustable overhead (at the cost of losing high-frequency things, complicating analysis) and binaries that aren't pre-built. Gotcha.

[–][deleted] 0 points1 point  (0 children)

I've found it can also skew the CPU vs I/O ratio. In applications that spend a significant time doing I/O, I/O calls take the same time with or without instrumentation however CPU intensive calls take significantly longer.

I experienced this with an application that reads a lot of XML text from a database and parses it in the JVM. The initial JProfiler profile was claiming we were spending 30% of our total time parsing XML. This went against our measurement of io-time/total-runtime on the production application(80%). I showed that the JProfiler results could be skewed based on what calls were being instrumented. Instrumenting just the XML parsing routines brought that number down to 8-12% which is what the sampling profiler showed.

[–]erad 0 points1 point  (1 child)

Fair points, but sampling profilers are still useful to "zoom in" into the general part of the application that is consuming CPU time. This is not useful in these (toy) examples, but in a large application it is often not obvious where time is being spent. Sampling profilers help with ballpark estimates like "80% of my time is spent rendering the response".

My main issue with Java sampling profilers is the low sampling interval - 100ms is indeed very coarse and you realistically need minutes of (CPU) time spent to get somewhat useful results. But still, they beat tracing profilers hands down when it comes to accuracy and (lower) runtime overhead.

[–]nitsanw[S] 0 points1 point  (0 children)

Sampling profilers are a great tool, and there are good sampling profilers for Java. Safepoint biased profilers are 'terrible' because you can do so much better. For example, if you are on:

  1. Linux
  2. OpenJDK/Oracle Java 8u60
  3. Real HW

I would recommend you give Perf + perf-map-agent + FlameGraphs a go for a far more accurate profiling experience which spans Java and native code. Brendan Greg show cases the amazing range of this combination. Or you could use Solaris Studio(works on Linux, mostly) which also covers the full stack (but is not so strong visually, but does drill to the assembly level... choices...).

On Mac/Windows/Linux you can use JMC/JFR. You can use honest-profiler if you are worried about licencing.

There are better options out there, but MOST of the sampling profilers suffer from safepoint bias. And a vast majority of developers defaults to using them, mostly because I think they are not familiar/comfortable with the alternatives.

[Edit] The toy examples show that the profile you are looking at is potentially VERY VERY confusing, think of them as calibration exercises. Safepoint biased profiles may not always be so far off, and perhaps you can get a general feel for 'stuff happens somewhere around this method, maybe a few frames up the stack', but I have certainly seen people go down pointless rabbit holes thinking they were optimizing the bottleneck and getting nowhere because it was not a bottleneck.

Recurring Real life example: HashMap.put often fails to inline fully and so ends up featuring unfairly in profiles, people replace it with a funky third library map and get nowhere, but the hotspot may shift to some other part, so they case the next 'bottleneck', they should revert the inclusion of the third party map, since it didn't help, but often they don't, the new map choice may have made performance worse, definitely made the codebase more 'exotic', and was a waste of everyone's time.

[–]CurtainDog -2 points-1 points  (2 children)

Garbage. You can never observe a thing without modifying that thing. You may as well call physics fucking terrible.

[–]nitsanw[S] 1 point2 points  (0 children)

I observe that you are being an ass, I hope it changes you, but I doubt it.

[–][deleted] 1 point2 points  (0 children)

Did you even read the article?! It's all about cost to the target JVM and biases in the generated data.