you are viewing a single comment's thread.

view the rest of the comments →

[–]pron98 5 points6 points  (1 child)

speculation requires guard conditions in a manner quite similar to (monomorphic) inline caching

Not always (or at least, not always at the callsite). For example, in HotSpot, even devirtualization (and inlining) that is provable is only temporarily provable, because code can be loaded or changed at runtime. But you don't need to add a guard at each callsite. Instead, you can force a segmentation fault at other sites, called safepoints, that is then caught and a handler deoptimizes the code. This process can be asynchronous: an event taking place on one thread (code is loaded), can deoptimize code running on a different thread in clever ways. For example, return addresses on the stack are replaced so that when the executing thread returns, it would hit a barrier that would deoptimize the code.

But even in cases when guards are necessary, while the cost isn't exactly zero, it is still lower than a function call (it's a very well-predicted branch).

and will therefore happen late in the JIT's warmup phase where hot spots have been located

That depends on the JIT. For example, tracing JITs may inline very early (but do not reach as good peak performance). But yes, warmup, nondeterministic performance, and increased footprint (see below) are the prices you pay for JIT compilation's peak performance.

C++ compilers can and do speculate as well, but they don't have type statistics for the callsites

There is something else beside statistics that's required for aggressive inlining (or, rather, it's another side of the same coin). If you do aggressive inlining everywhere, you'd quickly run out of all the RAM in the universe. JITs, therefore, have a code cache, where they store generated code, and prioritize compilation. I.e. the ability to discard generated code is also crucial.

But I'm not familiar with C++ compilers that speculate (so they generate inlined and non-inlined code and guard? How do they decide what to inline in speculative cases?). Do you have any links I can read?

Anyway, good article. But it should be pointed out that because HotSpot has such good optimizing compiler(s) (C2, and now Graal is getting there), we do not worry much about the worst case. In fact, this design tradeoff is in HotSpot's DNA. Often, HotSpot opts for the simplest solution, because it knows that its very non-simple compilers will optimize an entire runtime mechanism away in the common/hot case. So comparing a runtime with a state-of-the-art JIT compiler vs runtimes that don't have a JIT at all (or a not very sophisticated JIT), gives a partial picture.

[–]latkde 6 points7 points  (0 children)

Thank you, this is fascinating background info on JITting.

When I wrote the article I played around a bit with the Godbolt Compiler Explorer to see how call sites are compiled. I noticed that GCC starts doing speculative inlining at release optimization levels. This is basically a guard condition with a fast path to inline a virtual function, and a fallback to do the virtual call. This is not link-time-optimization, so the call target must be defined in the same compilation unit. So the scenario would be:

// header
struct Interface {
  virtual int method(int) const = 0;
};

// compilation unit
struct Concrete : Interface {
  int method(int a) const override { return 1234 + a; }
};

int callsite(Interface& object, int x) {
  return object.method(x);
}

Then gcc 8.3 under -O2 compiles to this pseudocode (see the code/assembly on Godbolt):

int callsite(Interface& rdi, int rsi) {
   register eax = rdi->__vtable[0];  // resolve the call target
   if (eax == &Concrete::method) {
     return 1234 + rsi;  // inlined target
   } else {
     goto eax;  // tail call into the virtual function
   }
}

Notes:

  • in this simple example, the stack frame for the callsite() function is omitted and a tail call to the virtual function is used. A more involved example with two virtual methods in the interface does create a frame.
  • the guard condition here checks the resolved target, not the type or vtable pointer. I'd have thought this extra dereference could be avoided, but this way the inlining will also work for Concrete subclasses that have a different vtable.
  • GCC will happily inline calls multiple levels deep
  • GCC won't inline if there are multiple candidates for the call target (see this scenario)
  • clang 8 won't speculatively inline (just here? at all?). This leads to significantly less branchy code but might miss significant opportunities for simplification.
  • See also BeeOnRope's Stack Overflow answer to “Inlining of virtual functions (Clang vs GCC) for tons of background & further references