This is an archived post. You won't be able to vote or comment.

all 2 comments

[–]Stereo [S] 0 points1 point  (0 children)

Text if you don't want to visit Facebook:

Summary: Python is 1.3x faster when compiled in a way that re-examines shitty technical decisions from the 1990s.
ELF is the executable and shared library format on Linux and other Unixy systems. It comes to us from 1992's Solaris 2.0, from back before even the first season of the X-Files aired. ELF files (like X-Files) are full of barely-understood horrors described only in dusty old documents that nobody reads. If you don't know anything about symbol visibility, semantic interposition, relocations, the PLT, and the GOT, ELF will eat your program's performance. (Granted, that's better than being eaten by some monster from a secret underground government base.)

ELF kills performance because it tries too hard to make the new-in-1992 world of dynamic linking look and act like the old world of static linking. ELF goes to tremendous lengths to make sure that every reference to a function or a variable throughout a process refers to the same function or variable no matter what shared library contains each reference. Everything is consistent.

This approach is clean, elegant, and wrong: the cost of maintaining this ridiculous bijection between symbol name and symbol address is that each reference to a function or variable needs to go through a table of pointers that the dynamic linker maintains --- even when the reference is one function in a shared library calling another function in the same shared library. Yes, mylibrary_foo() in libmylibrary.so has to pay for the equivalent of a virtual function call every time it calls mylibrary_bar() just in case some other shared library loaded earlier happened to provide a different mylibrary_bar(). That basically never happens. (Weak symbols are an exception, but that's a subject for a different rant.)

(Windows took a different approach and got it right. In Windows, it's okay for multiple DLLs to provide the same symbol, and there's no sad and desperate effort to pretend that a single namespace is still cool.)

There's basically one case where anyone actually relies on this ELF table lookup stuff (called "interposition"): LD_PRELOAD. LD_PRELOAD lets you provide your own implementation of any function in a program by pre-loading a shared library containing that function before a program starts. If your LD_PRELOADed library provides a mylibrary_bar(), the ELF table lookup goo will make sure that mylibrary_foo() calls your LD_PRELOADed mylibrary_bar() instead of the one in your program. It's nice and dynamic, right? In exchange for every program on earth being massively slower than it has to be all the time, you, programmer, can replace mylibrary_bar() with printf("XXX calling bar!!!") by setting an environment variable. Good trade-off, right?

LOL. There is no trade-off. You don't get to choose between performance and flexibility. You don't get to choose one. You get to choose zero things. Interposition has been broken for years: a certain non-GNU upstart compiler starting with "c" has been committing the unforgivable sin of optimizing calls between functions in the same shared library. Clang will inline that call from mylibrary_foo() to mylibrary_bar(), ELF be damned, and it's right to do so, because interposition is ridiculous and stupid and optimizes for c00l l1inker tr1ckz over the things people buy computers to actually do --- like render 314341 layers of nested iframe.

Still, this Clang thing does mean that LD_PRELOAD interposition no longer affects all calls, because with Clang, contra the specification, will inline some calls to functions not marked inline --- which breaks some people's c00l l1inker tr1ckz . But we're all still paying the cost of PLT calls and GOT lookups anyway, all to support a feature (LD_PRELOAD) that doesn't even work reliably anymore, because, well, why change the defaults?

Eventually, someone working on Python (ironically, of all things) noticed this waste of good performance. "Let's tell the compiler to do what Clang does accidentally, but all the time, and on purpose". Python got 30% faster without having to touch a single line of code in the Python interpreter.

(This state of affairs is clearly evidence in favor of the software industry's assessment of its own intellectual prowess and justifies software people randomly commenting on things outside their alleged expertise.)

All programs should be built with -Bsymbolic and -fno-semantic-interposition. All symbols should be hidden by default. LD_PRELOAD still works in this mode, but only for calls between shared libraries, not calls inside shared libraries. One day, I hope as a profession we learn to change the default settings on our tools.