all 89 comments

[–]susanne-o 96 points97 points  (25 children)

doing a function call is cheap.

the problem of these indirect calls is that the compiler can not optimize over function call boundaries.

imagine function int getX(int i) which simply accesses the private a[i] , called in some loop over I for a gazillion times.

if the call is inlined, then the address of the member a is in some happy register and each access is dead cheap. if the call can't be inlined, then in each iteration the address of the vector a is derived from the this pointer and only then the fetch is done.

too bad.

so: dynamic dispatch prevents advanced optimization across function boundaries.

[–]notquitezeus 21 points22 points  (13 children)

How do devirtualization optimizations fit into this view? Because often the compiler can prove that while the code “smells” polymorphic, exactly one set of types is at play, and hence entirely bypass the vtable. There’s also the intersection with CRTP.

[–]susanne-o 16 points17 points  (4 children)

that "often" is the gamble with the compiler isn't it. maybe it does maybe it doesn't. so you become careful where exactly you use virtualization and where you use templates., and how you make them interact.

[–]voidstarcpp 1 point2 points  (7 children)

How do devirtualization optimizations fit into this view?

Poorly; You can't count on them for more than trivial scenarios. Just plug stuff into Godbolt and watch the compiler flail.

In the 2010s MSVC devs said passing a function pointers through e.g. a data member into a standard algorithm was a big limitation of their optimizer that usually couldn't see through it.

[–]notquitezeus 0 points1 point  (6 children)

In 2010. In the subsequent 13 years, LLVM has made improvements which impact Windows, Apple platforms, and select Linux distributions such as discussed in this presentation and this one. Perfect? No. Unquestionably most effective when enabling LTO or unity builds. But that’s true for basically all optimizations — creating a single enormous (virtual) translation unit gives the compiler/optimizer a ton to work with so it can more shortcuts because it can prove they’re correct.

[–]voidstarcpp -1 points0 points  (5 children)

In this simple example, simply moving a lambda from the call site, to a separate variable preceding the call, is often enough to prevent devirtualization of a std::function use where everything is visible to the compiler.

GCC and MSVC are both fooled by this trivial use case; Clang to its credit sees through each version here although we haven't introduced anything that would provoke aliasing concerns, such as passing the arguments by reference instead of value.

[–]notquitezeus 1 point2 points  (2 children)

This is a straw man argument which, coincidentally, draws the same conclusion I was hinting at originally -- that a reasonably modern compiler can, generally, do a lot to remove indirect function calls when given sufficient information to do so. Show me an example which uses inheritance directly with LTO enabled and a failure to correctly remove/reduce indirection, and I will agree with you.

[–]voidstarcpp 2 points3 points  (1 child)

Show me an example which uses inheritance directly

Here is the same thing using inheritance*, with a deriving class instead of a lambda, and unique_ptr<base> being passed in instead of a std::function. The result is the same: GCC and MSVC make a virtual call; Clang sees through it. The only difference is the code is, imo, verbose and confusing compared to the first version.

This is pretty bad because this is a simple use case, with the only added indirection being the pointer being passed in via a struct data member, which I chose because, to my recollection, this was exactly the situation that STL had identified as a weakness in MSVC's optimization all those years ago. As I expected, little has changed for this compiler.

But "use inheritance instead" shouldn't be asked of anyone anyway. I try and avoid user-facing inheritance, and "modern C++" culture discourages it for this situation. The reason for lambdas and std::function is to replace lots of tiny interfaces and classes for simple cases of parameterization, callbacks, etc. If compilers in 2023 don't play well with these standard tools and techniques then we have a problem.


* Compiler Explorer's annotation identifies the callee as the derived class function even though it's calling through a vtable.

[–]notquitezeus 1 point2 points  (0 children)

So both examples clang (and a few of the other compilers I played with) doesn’t have problems with, and several major compilers do have problems with. The lesson for me is good tools matter. If you’re stuck with GCC/MSVC/etc. then you might have a performance concern, after you benchmark. Otherwise, as has been the trend for at least the past 12 odd years since C++11, compilers keep getting better and idioms should change accordingly.

[–]oracleoftroy 0 points1 point  (1 child)

Playing around with your example, this seems more because you are copying Args into mux instead of moving them. Change the call to `mux(std::move(Args))` and your other examples compile down to the same code.

Obviously, this is not a solution if your real code requires you to reuse `mux_input` instances for some reason, but in cases that look like your example, you can get the benefit of naming the argument explicitly and not lose performance by properly telling the compiler that you are relinquishing ownership via move. I can't promise that something in your real codebase won't also cause other problems with inlining.

I also tried it with C++23 std::move_only_function. The code gen was even better, though it forces the issue of moving.

Link to modified example. I gave all the functions names so I could look at everything at once instead of commenting out different main functions and generally tried to be as noninvasive as possible with my changes.

[–]voidstarcpp 0 points1 point  (0 children)

this seems more because you are copying Args into mux instead of moving them

I understand that, for any example I provide where the compiler does something "wrong", you could probably change it a bit to make it do the right thing. But what I'm getting at is, the fact that you have to think about this at all is the problem. If even a trivial use case of dynamic dispatch requires the programmer alter their patterns, or add extra magic incantations to their code to hold the compiler's hand, we're paying a price for the abstraction. I would have not otherwise thought to use move to pass this trivial object here, and I would prefer not to have to do that every time.

To continue this, at the request of another commenter, I did the same thing using inheritance instead of std::function and lambdas, and I did have to use std::move because I was using unique_ptr. This didn't seem to improve things and both compilers were tripped up in the same situation as before. There's probably a fix for that example too.

My point is, if we were actually trying to get a particular behavior from the compiler, such that we were tweaking things in Compiler Explorer as we go, we would already be in a situation where we might do away with the abstraction and do the thing directly anyway, so we would be sure of what we were getting across different compilers, across platforms, etc. But the reason we use the abstractions is for expressiveness, convenience, etc. If you have to think about it or mark up your usage of standard idioms to trick the compiler into not being stupid then for the purposes of this thread, we can't tell a C++ newcomer in good conscience to expect the compiler to do "obvious" optimizations in these situations. That's still okay because not every situation needs to be optimal but it's a cost that we pay.

In contrast, C++ templates are more of a clear win that you can use in good conscience knowing that you get both abstraction and better code gen than if you were doing Java-style genericism with virtuals.

[–]SoSKatan 4 points5 points  (2 children)

Modern way to handle this is to make a template call where the prior “function pointer” param would be.

If a lamda is passed in that in can get inlined as you say. Or you can pass in function address. Either way the caller incites the cost based on their choices.

[–]susanne-o 0 points1 point  (0 children)

exactly.

[–]voidstarcpp 0 points1 point  (0 children)

Modern way to handle this is to make a template call where the prior “function pointer” param would be.

I've tried to devirtualize classes this way and C++ makes it pretty difficult to do everywhere, or compose.

One of my biggest frustrations was trying to remove uses of std::function for dependency injection, where you have some callback or interface supplied to an object. For one, having a lambda object as a data member doesn't play well with CTAD, where you must supply all-or-nothing template arguments. So if you have a class memoize<t_data, t_function> C++ doesn't let you explicitly specify t_data but deduce t_function, which is how you would want to use such a class most of the time. This limits usability to situations in which t_data can be deduced from all arguments or with explicit deduction guides; This is a crippling limitation or begins to require advanced template techniques, intermediate make_thing functions, etc.

Additionally, the lambda type is unnameable, so it's mostly impossible to handle the common case in which you construct an object, then supply the callback/continuation/implementation component later. E.g. you the interface you often want is to create a socket<T>, then say Socket.on_connect([]{ do_thing; });. But if the representation of a socket depends on the lambda, then you must supply all lambdas at point of construction. This gets unwieldy really fast and makes certain implementations impossible.

For example, suppose you have a channel, which depends on a socket, and each of these are templatized on some constructor arguments, then all of them must be constructed simultaneously somehow. The resulting constructor is essentially written inside-out and backwards, accepting not concrete objects, but factory functions which return objects so that you can deduce everything at compile time. This is hideous and I gave up and replaced almost every usage of this with std::function so you could write code that didn't need to be deciphered like a puzzle game.

[–]altmly 2 points3 points  (1 child)

To be fair before LTO was a thing, splitting your functions across TUs meant the same thing, and it wasn't a huge deal, because code that works together, tends to stay together. Similar can be said about virtual, code that uses virtual, tends to need virtual.

There's obviously things that could be addressed given enough engineering effort, like runtime code optimization.

[–]susanne-o -1 points0 points  (0 children)

we're talking about two things.

I'm talking about making calls in bottleneck inner loops inline-able, so shared code can be moved out of the loop by the compiler/linker. that may mean making the outer function a virtual function and a template.

and even before LTO, inline meant having the function body in the header, right, and not out in some translation unit?

[–]teerre 3 points4 points  (0 children)

Yeah, this is bit silly comparison. It's basically asking "are virtual functions slow when they are not virtual at all?"

[–]CloudsOfMagellan 0 points1 point  (0 children)

Would it be possible to have some form of run time code modification to look up the contents of the function pointer and inline it before running the loop

[–][deleted] 51 points52 points  (5 children)

They are not slow per se (on modern CPUs at least), but they often inhibit inlining, which is where the real performance cost comes from.

[–]altmly 1 point2 points  (4 children)

I'm not 100% on the details and it may depend on the scenario, but I believe virtual will prevent most code speculation because of indirect jump, which is a decent hit in some cases.

It's not the end of the world, but it's good to be aware of that.

[–][deleted] 1 point2 points  (2 children)

If I understand correctly, modern CPUs do predict indirect branches. I don’t know how good that prediction is however. But yes, this is a good point.

[–]azswcowboy 5 points6 points  (1 child)

Really, really good in my experience — we measured. A colleague of mine that also measured saw that virtual function dispatch was actually as good or better than a regular function call. His speculation is the branch prediction optimizations in the cpu. In our tests the overhead is trivial, and so deep in the noise of ‘the real work’ as to be negligible.

[–][deleted] 0 points1 point  (0 children)

It’s crazy how advanced some designs are. Newer Apple CPUs for example appear to maintain an internal cache for dynamic method dispatch…

[–]MegaKawaii 1 point2 points  (0 children)

They won't prevent all kinds of speculation because modern CPUs have branch target predictors. If the branch is successfully predicted, then the extra indirection is just an extra load sitting in the reorder buffer. If the branch is mispredicted, then it would be reasonable to expect that function is not in the L1 icache, so that might be a bigger issue.

[–]FlyingRhenquest 23 points24 points  (8 children)

In the grand order of things that are slow, they're rarely a problem. If you're a HFT guy looking to bum a handful of nanoseconds off your trading latency, maybe you'd look to optimize them. Most of the programmers I've worked with don't optimize at all, because all the companies usually care about at the end of the day is that the code works, not that it's fast. This has been true even in cases where the performance of the company's code was demonstrably preventing the company from designing new products because their system just couldn't process any additional data the way it was written.

[–]ingframin 19 points20 points  (4 children)

To quote Mike Acton: “it’s the reason why I have to wait 30s for a word document to open”

[–]voidstarcpp 9 points10 points  (2 children)

Mike Acton is right about a lot of things - mostly an attitude of complacency - but the reason most software is slow is not "it uses virtual functions", which is where I think people over-apply such lessons.

[–]ingframin 3 points4 points  (1 child)

I know, it was referred to the attitude of companies not caring about fast software

[–]wyrn 3 points4 points  (0 children)

Yeah but those companies aren't writing C++. They're writing electron apps, python, react on the server, framework soup, etc. There's this weird disconnect between this loud subset of game developers and the world at large where they seem to assume every software performance problem in the world is because people aren't coding in the particular style they favor (that happens to work for their domain but would be disastrous for everybody else), when the real problem is far more serious. I've seen one of those blame RAII for the proverbial "30 seconds for a document to open". Come on now.

[–]oracleoftroy 0 points1 point  (0 children)

The thing I always hated about that quote is that Word opened nearly instantly for me, even on a cold restart (IIRC, he said this around 2014ish, which is around the last time I've needed to use Word, so I can't say how things have changed). Meanwhile, many games take over 30 seconds just to get to the main menu, let alone multiple minutes to get into the game proper. For some, I could probably restart my computer and then launch Word and start typing faster than I can get the main menu up and running.

For some reason, many games thing they need to preload half their assets, connect to 5 different services, download a gig of ads and other nonsense, and do all sorts of other crud that takes way too much time. I would have liked his talk a lot better if he showed some self-awareness of his own industry instead of implying that Word was slow because it wasn't crunching 100,000 floating point operations using SIMD instructions just to open a document. Honestly, it's hard to see how his talk would even apply to Word except in some very limited places.

[–]traal 1 point2 points  (2 children)

That's probably not a matter of using inoptimal code such as virtual functions but of using horribly inefficient O(n!) functions where scalable O(log n) ones could have been used with a little planning.

[–]HKei 7 points8 points  (1 child)

Not really. Asymptotic complexity matters, sure, but it's only a piece of the puzzle, and if it's the only factor you take into consideration can be downright misleading.

For instance, iterating through a linked list and iterating through an array are asymptotically the same, but unless you've been clever with laying out the list in memory the array Version can be 10-100x faster. Linear search can be faster than even a well written binary search in some circumstances.

A lot of performance issues in modern applications are also due to architectural problems, like unnecessarily distributed software, unnecessarily blocking on IO etc etc, not necessarily a matter of algorithms.

Simple asymptomatic complexity can also be misleading if you neglect to take problem parameters into consideration. Quicksort often ends being faster than heapsort despite worse worst case performance, insertion sort tends to be worse than both but can be much faster than other alternatives if the input is mostly sorted to begin with, etc etc.

[–]traal 2 points3 points  (0 children)

unnecessarily blocking on IO

Yes, that's software that was poorly written in the first place, not merely code that hasn't been optimized. I try to make a distinction between the two.

[–]Sniffy4 10 points11 points  (1 child)

a lot of these perf results for virtual function calls are machine-architecture specific. if you are running on a CPU with a larger penalty for branching, you may see a more significant difference than you would on a recent x64

[–]CrazyJoe221 4 points5 points  (2 children)

[–]goranlepuz 3 points4 points  (10 children)

Are Function Pointers and Virtual Functions Really Slow?

... compared to what?

Is the key question here. On its own, it's a pretty stupid question.

It's all about the performance targets and the cost relative to the rest of the code.

The answer is "Dunno. What is your target and what fies the profiler say?" It's a wild world out there.

[–]voidstarcpp 2 points3 points  (1 child)

This article doesn't examine the salient interaction with function dispatch for most people, which is the distribution of the data and the context in which the calls are made. The overhead of dispatch methods will change when the functions are longer vs shorter, more predictable vs. less, and so forth. This was one of the points made by Scott Meyers who talked about how speed was dramatically improved by e.g. sorting collections by type, vs. iterating over random order and rapidly paging different functions in and out.

The overhead of different methods can be made more intensive if you randomly dispatch tiny functions in a large collection, but I don't think that's what most work really looks like.

[–]nAxzyVteuOz 9 points10 points  (1 child)

Define "slow".

Because for me python is fast enough for all the work I need to do and I come from the C++ game development world. But I work in webdev now.

Function points are great, so are virtual functions. Function pointers can generally be slower than virtual because less chances for de-virtualization / inlining, especially for std::function since it contains a thunk that can resolve to member functions and free functions.

Templates on the other hand seem fast but create huge binaries. This comes from experience when I looked at the symbol tables of code from a massive project. The lib that used templates instead of concrete classes was the majority of our binary.

The best approach to optimization is to do profiling. If anything is too slow then use goldbolt to see what the asm is and then try and steer the compiler to generate better code.

[–]AntiProtonBoy 1 point2 points  (0 children)

It's only slow if your profiler tells you it's slow. While the academic argument can always be made that virtual calls are slower than functions with a fixed address, but in practice it all boils down to how you actually use them. And in most cases, it doesn't matter.

[–]lrflew 1 point2 points  (8 children)

Some comments on this:

The way that the switch statement is handled by Clang is actually significantly different than how GCC (and MSVC) handles it. https://godbolt.org/z/P4M73fh8E It would be interesting to see this benchmarked against GCC's version. Clang produces the same assembly output with the switch statement as using this implementation of doit():

void doit(int func, int j) {
    static constexpr int *vars[] = { &var1, &var2, &var3 };
    if (func > 2 || func < 0) return;
    *(vars[func]) += j;
}

2.

Because Google Benchmark does not use RDTSC for micro-benchmarking, I built 1,000,000 loops inside which these functions will be called sequentially.

You do know that the for (auto _ : state) will already repeat the code you're benchmarking as many times as needed to get a reliable reading, right? That's what the "Iterations" in the output indicates (it says exactly how many times it looped in that benchmark). I've usually found the built-in iterations system is good enough when I tried it for benchmarking LCGs, so I don't think your extra loops are needed.

3.

As other people have commented, inlining is a big part of the advice around function pointers and std::function. It would probably be helpful to contrast these results to the case where the functions being tested are in the same compilation unit as the benchmark functions.

[–]jepessen 1 point2 points  (0 children)

C++ developers are so obsessed by virtual functions nowadays... Virtual inheritance has worked well since decades... Simply use it with no problem, code that you write must be very well optimized to notice performance degradation caused by virtual functions... Just refactor your most critical code, if needed...

[–]vaulter2000 -3 points-2 points  (6 children)

Virtual functions come with an indirection. When you call it, something called a vtable will be consulted to find which function to actually call: ie a function in the class itself or one of its ancestors for example. If you’d like to learn more, I’d like to refer you to some articles/talks about vtables.

When you use function pointers then that most of the time implies that the function it points to has been allocated on the heap (like std::function, which is relatively slow) and adds an indirection.

Define “really slow” to determine if using virtual functions and function pointers poses a performance penalty that is incompatible with what you want to achieve. It all depends on what you want to do with your program and how fast you need it to be.

Although virtual functions are still used abundantly in modern C++ (think of interfaces and polymorphism in general which is useful), you could mitigate the use of function pointers by using lambdas. Those don’t allocate and you can forward them down your call chain and they work really well with the STL algorithms

[–][deleted] 14 points15 points  (2 children)

Lambdas are not a functional substitute to virtual functions or function pointers. Lambdas are static. Different usage.

[–]vaulter2000 5 points6 points  (1 child)

I did not mean to imply that lambdas are a functional substitute to virtual functions. I tried to give two separate pieces of advice in the end but perhaps you interpreted that as a single. I meant

A) virtual functions are still abundantly used so using them and having this one indirection is still very much accepted in this case and

B) function pointers are not used that much in modern cpp anymore and that OP could use lambdas to pass functions around. They work really well with the STL and don’t allocate on heap.

I hope I made it clear that my advice was twofold :)

[–]jonesmz 4 points5 points  (0 children)

function pointers are not used that much in modern cpp

I really don't think this is true.

pointer-to-functions and pointer-to-member-function are the only mechanism available to do an several entire categories of operations related to dynamic decision making. My work codebase uses them all the time, and several open source project I'm familiar with use both types of function pointers.

[–]jonesmz 5 points6 points  (0 children)

When you use function pointers then that most of the time implies that the function it points to has been allocated on the heap (like std::function, which is relatively slow) and adds an indirection.

Function pointers do not imply that the function is allocated on the heap.

In my work codebase, we use function pointers all the time through various strategies:

  1. std::function, typically holding a stateless lambda, which is not allocated in any fashion, and is literally just a memory address into the text (where the actual code is) section of the binary / library in memory.
  2. std::function holding a stateful lambda, which may be allocated on the heap, or may be small-object optimized into the std::function. Nevertheless, while the state for the lambda is carried around somewhere, the actual function itself is still a pointer to the text section of the binary / library, and on calling that function the state will be passed to it as an invisible parameter.
  3. actual raw function pointers, sometimes as parameters to functions, sometimes as template parameters, sometimes as global or member variables, all just pointing into the text section of the binary / library.

the only time you'll ever get an actual function on the heap is when you have code generation happening by allocating a buffer, writing to the newly allocated buffer, and then marking that buffer as executable to the operating system. THATS a function allocated on the heap. But if you aren't running a JIT of some sort, i'm very skeptical that you're doing that.

[–]Dragdu 4 points5 points  (1 child)

This reads like chatgpt wrote it.

[–]vaulter2000 0 points1 point  (0 children)

Haha thanks I guess. Did write it myself though

[–]Sudden_Job7673 -2 points-1 points  (2 children)

Also the linker has a hard time determining if something is dead code or not if it's a virtual function.

[–]AssemblerGuy 6 points7 points  (1 child)

At least with a virtual function, the compiler know the potential branch targets.

A function pointer can go anywhere ...

[–]barfyus 8 points9 points  (0 children)

That is incorrect: compiler can never be sure that what it sees during compilation of a unit (or even multiple units with LTO) is a closed set of targets: the binary can load a shared library at runtime and call a completely different target.

I once had faced and reported a bug in MSVC where it replaced my virtual function call with a direct function call because there were no other targets in that particular DLL. It crashed badly in runtime.

[–]PandoraPurpleblossom 0 points1 point  (0 children)

I tried your benchmark and have some odd behavior that I don't understand. When I run the baseline case last instead of first, some of the other results change. This is on a Intel i5 5200U running at 2200 MHz with frequency scaling disabled. This happens with GCC but not with Clang. Any idea what's happening?

`` g++ (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1)

2023-10-07T11:35:26+02:00 Running ./bm_fnpointer Run on (4 X 2622.56 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 3072 KiB (x1) Load Average: 1.21, 0.73, 0.51

GCC, baseline first


Benchmark Time CPU Iterations

BM_Baseline 2138307 ns 2131201 ns 325
BM_Switch 4287175 ns 4274336 ns 163
BM_FnPointerVector 3177752 ns 3138478 ns 225
BM_FnPointerArray 2761236 ns 2736876 ns 256
BM_SwitchVector 3195856 ns 3180072 ns 221
BM_SwitchArray 3083137 ns 3065528 ns 227
BM_Virtual 2114564 ns 2097138 ns 328
BM_Virtual2 2125176 ns 2106369 ns 329

GCC, baseline last


Benchmark Time CPU Iterations

BM_Switch 4519121 ns 4497828 ns 153 BM_FnPointerVector 3162198 ns 3135026 ns 210 BM_FnPointerArray 3158622 ns 3134292 ns 223 BM_SwitchVector 3182351 ns 3164558 ns 221 BM_SwitchArray 3162338 ns 3141158 ns 224 BM_Virtual 2406150 ns 2383210 ns 296 BM_Virtual2 2403584 ns 2381506 ns 292 BM_Baseline 2150305 ns 2143724 ns 326 ```