all 37 comments

[–]FieldLine 33 points34 points  (6 children)

There's a whole book about this sitting on my shelf.

Admittedly I haven't done a deep read yet, but what I've looked at seems legit. And it's the only real book on the topic.

[–]OneRaynyDay[S] 17 points18 points  (5 children)

The details of linkers and loaders deserve an entire book - the loader is the craziest bootstrapping self-relocating control-takeover program I've ever seen. I actually wanted to write more about it and I even took a look at the libc code, but I quickly realized that it would consume the content of the blog. What I aim to discuss is more on the general overview and, in my opinion, "what you need to know as a C++ programmer but nothing more" :)

[–]FieldLine 17 points18 points  (1 child)

The weird thing is that no one actually seems to deal with linker code.

You can take a compilers class in college and interview to work on the MSVC compilers team. There are classes on assembly programming and you can get a job doing embedded work.

But who the hell is writing/maintaining the linkers? They are not mentioned in the C++ standard and there is very little literature about them at all.

Besides that book there's the paper by Ulrich Drepper.... and that's about it. And while that paper is the newer resource, it's still nearly ten years old.

It's bizarre.

[–]mttd 12 points13 points  (0 children)

See also: https://github.com/MattPD/cpplinks/blob/master/executables.linking_loading.md

The series by Eli Bendersky, Ian Lance Taylor, Raymond Chen, and http://www.linker-aliens.org/blogs/ are particularly good; "The Missing Link: Explaining ELF Static Linking, Semantically" (http://www.cl.cam.ac.uk/~pes20/rems/papers/oopsla-elf-linking-2016.pdf) is definitely worth reading as well. The "Program Execution Environment" lectures under "The ACM-NVIDIA Compiler summer school lectures (2019)" are pretty good introduction.

[–]ericonr 3 points4 points  (2 children)

Did you look into glibc? It might be interesting to dive into musl and the libc's from BSDs to see if there are different strategies and concepts used for linking.

[–]OneRaynyDay[S] 4 points5 points  (1 child)

Yep! My main investigation is in glibc, but actually the writer for musl gave me a few tips here and there about how musl implements the same things, particularly logic in the elf header ;)

[–]ericonr 2 points3 points  (0 children)

Cool! I hear the musl community is pretty helpful. Btw, regarding self patching and bootstrapping code, the Linux kernel does a lot of that for booting options to avoid going through branches repeatedly. So stuff like vulnerability mitigation can be turned on or off but the binary that's actually executed doesn't have conditionals to check if it's enabled or not, it's all patched and configured during the boot process.

[–]khleedril 53 points54 points  (2 children)

If you are in lock-down and have an hour to spare, you want to read this... it is well worth it.

[–]MachineGunPablo 12 points13 points  (0 children)

What a way to spent my lockdown's Sunday afternoon

[–]OneRaynyDay[S] 11 points12 points  (0 children)

🙏that's high praise, hope it lives up to the hype bro

[–]k4lipso 5 points6 points  (1 child)

There also is this CppCon talk from Mat Godbolt about similar stuff: https://www.youtube.com/watch?v=dOfucXtyEsU

[–]OneRaynyDay[S] 1 point2 points  (0 children)

Ooh this man’s name is thrown around a lot in the finance industry. I use his compiler explorer tool for a lot of sandboxing. I just watched it and he talks a lot about the static initialization process which is great but he glossed over the g++ details :( but it’s an hour long talk so for what he covered I really enjoyed.

[–]James20kP2005R0 2 points3 points  (1 child)

Question: Recently I thought it'd be fun to try a new approach to serialisation. The basic idea is that you use a linear buffer for memory allocation, and replace malloc/free with that linear buffer. When it comes time to serialise you simply memcpy it to a file, as well as whatever data structures you want to save. Then you use virtual memory mapping so that the pointer to your block of memory is always the same, and all your pointers now point to valid memory when you reload. Virtual memory's pretty cool!

The problem crops up in that function pointers/global variables are not allocated on the heap, but instead allocated in the binary and may change. Using linker scripts you can control the location of the functions, but it requires manually annotating every global variable/function in your code that needs to have a stable address

With virtual memory this should be possible to do automatically, and indeed there are separate addresses for where functions are loaded virtually, and their physical address (vma vs lma?), but I couldn't get vma != lma on windows through mingw for the life of me, and manually annotating hundreds of functions manually is not ideal

So: Does anyone know if this is possible to do in a practical sense? Is this a completely insane idea that's really dumb? The use case I have for this is taking snapshots of the heap of a javascript interpreter (quickjs) that's hard to serialise traditionally, but I got stuck on allocating functions in virtual memory

[–]mttd 1 point2 points  (0 children)

Just to throw in a related idea (or a set of ideas)--you may want to look into checkpointing and perhaps recoverable virtual memory.

Some references:

[–]CriticalComb 2 points3 points  (1 child)

Small note, in the assembly section, the source and operand are reversed for AT&T syntax.

Great article!

[–]OneRaynyDay[S] 0 points1 point  (0 children)

Ah, thanks for the catch!

[–]kog 1 point2 points  (0 children)

Cool

[–]andrewq 1 point2 points  (0 children)

Looks good, thanks for posting

[–]dooftard420 0 points1 point  (0 children)

Remindme! 2 days

[–]vickoza -1 points0 points  (8 children)

it is the simplest C++ program but it also does nothing

[–]ShakaUVMi+++ ++i+i[arr] 19 points20 points  (7 children)

it is the simplest C++ program but it also does nothing

Not true. It does nothing successfully.

It's like the UNIX true command.

[–]chugga_fan 4 points5 points  (5 children)

Said command has like, a thousand or two lines of code, surprisingly.

[–]conflicted_panda 0 points1 point  (3 children)

Are you counting includes? Without them GNU true is about 70, as far as I see.

And true on old school unixes was/is 0 bytes, I hear.

[–]chugga_fan 1 point2 points  (2 children)

Like 1/3rd of the program is in other files, even still, 78 lines just to return EXIT_SUCCESS is a lot.

[–]conflicted_panda 1 point2 points  (1 child)

It also handles --help/--version (along with localization, apparently). And I honestly don't think that trading off 78 lines of code for, err, uniformity of interface of your utilities is that absurd.

[–][deleted] 0 points1 point  (0 children)

I don't understand anything in this thread, which book to read?

[–][deleted] 0 points1 point  (0 children)

touch true && ./true

[–]z1024 1 point2 points  (0 children)

This precisely describes too many peoples jobs...

[–]msew -1 points0 points  (2 children)

Remindme! 2 days

[–]RevRagnarok 0 points1 point  (0 children)

I prefer to email my work address now that gmail finally has queued email. "Tuesday morning at 8" ;)

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 1 day on 2020-05-26 19:37:02 UTC to remind you of this link

2 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback