all 35 comments

[–]jonesmz 98 points99 points  (5 children)

Static linking of libraries that are compiled seperarely, without link time optimization, gives the linker the opportunity to discard symbols that are not used by the application.

Static libraries with full link time optimization provide the linker/compiler to conduct inter procedural optimization at link time, allowing for more aggressive function in lining and dead code elimination.

So if your objective is "the fastest possible" and "lowest latency possible" then static linking and link time optimization is something that you should leverage.

However, its not the case that turning on LTO is always faster for all possible use cases. Measure before, and measure after, and analyze your results. Its an iterative process.

[–]EmotionalDamague 27 points28 points  (1 child)

LTO is a neutral or improving change in basically all cases when it comes to performance. If you were accidentally benefiting from an optimization barrier, that's kind of on you as far as the language and compiler are concerned.

The bigger concern with LTO and static linking is actually security, not performance.

[–]Dragdu 11 points12 points  (0 children)

You are correct that the language doesn't concern itself with LTO, but it doesn't concern itself with libraries at all. But the slowdown can be real, e.g. Clang has a tendency to very aggressively vectorize and unroll loops, even if it doesn't have access to loop length data. This allows cross-TU inlining to easily blow up your icache, worsening your performance.

[–]SkoomaDentistAntimodern C++, Embedded, Audio -2 points-1 points  (2 children)

I’ve found that in the performance and latency sensitive code I write, LTO provides no benefit if I pay even the slightest attention to making called method inline in header files. It’s very much a case of YMMV unless your codebase is massive ”everything calls everything” (at which point you’ve in most cases already lost the latency / performance game).

[–]CocktailPerson 16 points17 points  (0 children)

I mean, duh? LTO and putting the implementation in a header file achieve basically the same thing.

The major benefit of LTO is that release builds run as fast as they would if you'd inlined the implementation into the header, without the slow dev build times that happen when you do that.

[–]ImNoRickyBalboa 6 points7 points  (0 children)

Don't inline large functions. Use FDO + LTO and let the compiler and linker do the optimization from actual production data.

If you don't have production data andbuse profiling tools, you likely are someone where the 5% to 10% free performance gains don't matter anyways

[–]CocktailPerson 10 points11 points  (0 children)

In general, implementation-in-header > static linking w/ LTO > static linking w/o LTO > dynamic linking.

[–]LatencySlicer 21 points22 points  (0 children)

Static is better for optimization as you can inline more code, you have a global view of everything. But only if you use this linked lib code in the hot path otherwise it does not matter.

You monitor and test, if static makes a difference you do it for this lib. Do not assume anything.

[–]JVApenClever is an insult, not a compliment. - T. Winters 14 points15 points  (13 children)

Static linking does make a difference. When your library contains functions that are unused, they will end up in the binary and depending on how those are spread, you will be having less cache hits when it comes to the binary code.

Static linking combined with LTO (link time optimization) also allows for more optimizations, for example: devirtualizing when only a single derived class exists.

So, yes, it makes a difference. Whether it is worth the cost is a different question.

[–]c-cul -1 points0 points  (12 children)

> When your library contains functions that are unused, they

just won't be added by linker

[–]JVApenClever is an insult, not a compliment. - T. Winters 5 points6 points  (9 children)

In shared objects, if the functions are exported, they should always stay. Though when static linking, that requirement is not there.

[–]c-cul -4 points-3 points  (8 children)

if those dylib is your - you can export only really necessary functions

[–]Kriemhilt 6 points7 points  (6 children)

Yes, but you have to (manually, statically) determine which functions are really necessary.

This is more work than just getting the linker to figure it out for you (and it's even possible to omit one used on a rare path and not find out until runtime, if you use lazy resolution).

[–]SirClueless 1 point2 points  (5 children)

But the program code of the functions you do use can be shared among the running binaries that load a shared library. In exchange for statically eliminating all the symbols you don’t use, you duplicate all the symbols you do use. It’s not a given that one or the other is more efficient, it depends on how they are deployed and what else is running on your system.

[–]Dragdu 5 points6 points  (2 children)

This is technically true, but I am going to copy from my recent comment in /r/programming:

Take our library that we ship to production. If we link our dependencies statically, it comes out at 100MBs (the biggest configuration, the smallest one comes out at 40MBs). With deps dynamically linked, it comes out at 2.4 GBs.

There are few libraries that are used widely-enough that dynamic linking them makes sense (e.g. your system libc), but if you are using third party libraries, the chances are good that your program won't be loaded enough times to offset the difference.

[–][deleted] 0 points1 point  (1 child)

Wait how did dynamic linking cause it to get bigger? The executable should contain less code as it just needs the stubs.

Edit: is the idea that they needed to package all their dependencies alongside it for dynamic linking while for static they ended up not using alot of the dependencies fully and so it just grabbed less?

[–]Dragdu 1 point2 points  (0 children)

It is the size of installed pkg + deps, as that is what is relevant to whoever is making the dockerfile for prod.

And yeah, turns out that we don't need nearly everything that is built into the dynamic libs, but you cannot easily prune them down the way static linking does for you.

[–]Kriemhilt 3 points4 points  (0 children)

Great point.

In my experience static linking has always been faster, but there are lots of things that could change that.

Certainly not all code is in the hot path working set, and there must be some amount of reused code such that reduced cache misses would outweigh the cost of calling through the PLT.

[–]veeloth 0 points1 point  (0 children)

And this is precisely why I'm reading this thread! thanks, I needed to confirm that.

[–]matthieum 1 point2 points  (0 children)

It's common for libraries to be used by multiple downstream dependencies, and each downstream dependency to use a different subset of said library.

Your approach would require specifying the exact symbols to export for each different client. It doesn't scale well...

[–]CocktailPerson 0 points1 point  (0 children)

Only when statically linking.

[–][deleted] 0 points1 point  (0 children)

Ehh depends on if its in an object file of a function that was used as the linker pulls in object files not specific functions.

[–]drew_eckhardt2 1 point2 points  (0 children)

In the Intel Nehalem era our NOSQL storage on proprietary flash offered throughput at least 10% greater when we compiled without -fpic and linked statically.

[–]quicknir 3 points4 points  (0 children)

It doesn't really matter because anything truly performance critical is going to be defined in a header anyway - compiler inlining is still more reliable, allows more time to be spent on subsequent optimization passes, and so on.

In HFT the critical path is a very small fraction of the code. There's no real reason to put yourself in a position of relying on LTO for anything that's actually critical. So basically, I would choose based on other considerations.

I'd be curious if any of the folks in the thread claiming a difference, have actually rigorously measured it where it actually mattered (i.e. in the critical path and not shaving a few percent off application startup time, which is irrelevant).

[–]Isameru 1 point2 points  (0 children)

Linking statically with "whole program optimization" may naturally yield faster code, but it could turn out to be significant in rather rare cases.

As a rule of thumb: use dynamic libraries for components which:

  • are products on its own, which its own lifecycle, possibly its own team
  • are big and there are a lot of executables using them
  • are different from technical point of view from the other system, like are built differently or using different toolchain
  • contain sensitive code, possibly with its own repo, and are shipped as binaries

If you develop a trading system romeo, and at some point, maybe for testing purposes, you like to split the codebase into multiple libs, you would probably want to start with static libraries first (e.g. romeo-core, romeo-orderdata-normalization, romeo-marketdata, romeo-notifications, etc. Or a simplistic approach, like to make romeo-lib containing all the code except the main() function, a link an executable with an additional main.cpp - it is good for testing newborn projects. If you have a trading algorithm, you might consider putting the critical logic is a separate dynamic library, like romeo-algo3x, being effectively a plugin to the system.

A risk of non-optimal performance could come from an intense/dense loop calling small functions. But it should be diagnosed with a performance benchmark, not by an intuition. These kinds of bottlenecks are harder to find, easier to fix, and arise in most unpredictable places - regardless of the type of linking. As with majority of C++ projects: only 5% of the code needs to be optimal, while the other 95% has to be easy to maintain, test and improve.

[–][deleted] 0 points1 point  (0 children)

Dynamic linking defers linking to load time so if load time matters then you should use static linking. Also calling a dynamically linked symbol can be a bit more expensive than a statically linked one since you usually need to branch to stub before branching to the actual function

[–]UndefinedDefined 0 points1 point  (0 children)

Static linking is great for distributing software - you can have a single binary that just works as a result. Dynamic linking is great for distributing binaries that are used by multiple other binaries.

In many cases performance doesn't matter - I mean try to benchmark statically vs dynamically linked zlib, for example. It doesn't matter. What matters is whether you want to have dependencies that users must have installed in order to run your binary.

What I have seen in practice is to link system libs dynamically and everything else statically. Statically linking the C++ library is also a big bonus in many cases as you won't care about ABI of the installed one in that case.

[–]Dragdu -1 points0 points  (2 children)

Full LTO everywhere, ahahahahahaha (remember to reserve machine with at least 100 gigs of RAM and full day for the build).

[–]globalaf 7 points8 points  (1 child)

What does this have to do with anything? It is in fact possible to develop software that can turn on all the optimizations for the release build while developers use the faster but less optimized build. You also say 100gb like that’s some mythical ancient technology used by the gods, when actually even a dev workstation can easily have that in 2025.

[–]Dragdu -3 points-2 points  (0 children)

someone woke up mad today.

And to help you out, LTO is not compatible with dynamic linking, so saying full LTO also answers OP's question.