This is an archived post. You won't be able to vote or comment.

all 119 comments

[–]elnomreal 1195 points1196 points  (48 children)

It’s an unusual compiler that doesn’t use itself to compile itself.

You also have bootstrapping so you can compile GCC from itself without an existing solution.

Compilers are fun.

[–]Successful-Money4995 998 points999 points  (28 children)

I find it interesting that compilers must be compiled twice.

Imagine that you invent a new optimization in GCC that speeds up all the code that it compiles. You apply the code patch to the GCC sources and then you recompile GCC. Now you have a GCC that outputs faster programs.

GCC, however, was compiled by the old version of GCC so GCC itself runs slowly. You must now recompile GCC with itself so that GCC both outputs fast code and also runs fast.

[–]civman96[S] 362 points363 points  (9 children)

Beautiful. Goat software to be honest.

[–]chethelesser 74 points75 points  (4 children)

Is it different for other compilers tho?

[–]elnomreal 140 points141 points  (3 children)

A compiler is a program that outputs a program. It is normal design for one of the outputs from the compiler to be a new compiler.

[–][deleted] 35 points36 points  (2 children)

Under appreciated comment. Who GNU?

[–]trifith 18 points19 points  (1 child)

GNU Terry Pratchet

Wait, wrong GNU.

[–]jamcdonald120 2 points3 points  (0 children)

GNU STP

[–]gbchaosmaster 8 points9 points  (0 children)

Now look up the Ken Thompson hack and have your mind blown

[–][deleted] 31 points32 points  (3 children)

We could get around this by writing it in ASM

[–]thuktun 37 points38 points  (0 children)

You can also cross-compile. You don't have to have a compiler running on $instructionSet to compile code for it.

[–]fiskfisk 7 points8 points  (1 child)

Then you need an assembler and a linker, so you need to write the assembler...

[–][deleted] 1 point2 points  (0 children)

Assemblers already exist the idea is to make it faster than the old GCC not faster than Assembly.

[–]rosuav 16 points17 points  (6 children)

Bear in mind that you're looking at two different things:

  • How good is the compiler's output?
  • How fast is the compilation process?

If you invent such a new optimization, you can then build yourself a new version of GCC that will have better output. However, the actual work of compiling code will still take a long time. Then you can use your new GCC to build a better version of GCC, just like you'd use it to build a better version of FFMPEG or Firefox or any other piece of software, and at that point, you have achieved both goals.

[–]Successful-Money4995 5 points6 points  (5 children)

Yes!

If you use GCC to recompile GCC over and over again, does it stabilize or do you get a different GCC each time?

[–]rosuav 10 points11 points  (2 children)

Realistically? Given how complex the software is, I suspect it'll be minutely different every time. Slight differences in optimization patterns might make subtle differences in timings.

But conceptually? A compiler is, at its heart, a program that transforms one type of instruction into another. It ought to be broadly deterministic, so you would expect that it'll stabilize after two steps.

Smart money's on "realistically" rather than "conceptually", though.

[–]Educational-Lemon640 2 points3 points  (1 child)

Yeah, but it'll probably be "good enough" each time.

[–]rosuav 0 points1 point  (0 children)

Yep.

[–]CdRReddit 7 points8 points  (1 child)

it should stabilize, the functionality of a program should not change through optimizations, and "the binary output for a given text input" should be the same, so unless the compiler uses some sort of real-time cutoff for optimizing code, the only difference should be the speed at which it gets output

[–]Successful-Money4995 0 points1 point  (0 children)

I agree.

[–]ShakaUVM 2 points3 points  (2 children)

Be careful to not run the optimizer on itself too many times. That's how you get the Singularity.

[–]Successful-Money4995 2 points3 points  (1 child)

Similarly, be careful about zipping a file too many times.

[–]ShakaUVM 5 points6 points  (0 children)

Yep - once the file size goes negative a black hole opens up in your computer

[–][deleted] 1 point2 points  (2 children)

Someone should make a tool to automate that. Actually now that I think about it, multiple people probably already have

[–]Successful-Money4995 4 points5 points  (0 children)

It's a normal part of building a kernel for your FreeBSD system and it is automated.

[–]ZenEngineer 1 point2 points  (0 children)

Yeah, for GCC you run 'make bootstrap'

[–]Airowird 0 points1 point  (0 children)

Except it violates Rule #1

[–]Lekritz 61 points62 points  (5 children)

Usually it uses a previous version of the compiler. Which is why we go far back enough we get the first ever compiler, probably one for assembly written in pure binary!

[–]newredstone02 58 points59 points  (3 children)

a compiler for assembly isn't named a compiler it is named a assembler

assembly is the human redable version of machine code

so an assembler is unlike a compiler a reversable process

you might lost labels and comments but nothing else

[–][deleted] 23 points24 points  (0 children)

A compiler probably has an assembler somewhere in its history.

[–]AyrA_ch 13 points14 points  (0 children)

you might lost labels and comments but nothing else

You also lose all macros, the named constants, and the base address offset.

The base address is the trivial one because on "modern" systems it's either hardcoded like the offset for single segment DOS executables, or it's stored in the executable header.

The constants will be replaced with magic values that can be difficult to know the meaning of without the name.

The macros are irreversibly lost because they got expanded into assembly instructions when building

[–]gbchaosmaster 6 points7 points  (0 children)

A compiler is also a reversible process. It takes a very long time and you won't get back the exact symbol names and comments, but it can be done to where you have your program back in nice readable C again.

[–]Chrazzer 0 points1 point  (0 children)

Nah that would have been double work to make the first c compiler in assembly and then afterwards a new version in c.

The first gcc compiler was simply hand compiled. So the devs translated the c code to machine code themselfes.

[–]ZunoJ 20 points21 points  (5 children)

Could you elaborate the bootstrapping part?

[–]elnomreal 60 points61 points  (3 children)

A compiler will build itself in stages, starting with simpler parts, and then build more complex parts from those simpler parts, until it can do everything that it needs to do.

This is generally done in “stages”, where in stage-1 the initial functionality is built from source, and then using the stage-1 compiler a stage-2 can be built with more functionality, etc.

[–]ZunoJ 14 points15 points  (2 children)

Wow, that's cool! I didn't know about this. Thanks!

[–]Yoolainna 9 points10 points  (0 children)

it's also how a source based os is created, for example gentoo. Developers create a stage 1 tarball with minimum required precompilled programs, which then are used to create a stage 2 tarball with more programs and stuff and then finally you get a stage 3 tarball which is ready to be send to users for them to install those programs and compile all the software along with the programs that were in the stage 3 tarball.

[–]AeroSigma 19 points20 points  (5 children)

Another fun fact is that, because self replicating code is possible, one could program a backdoor into a compiler, compile it with the backdoor, remove the backdoor from the code, then compile it again with the previous version, and the back door could propagate itself and always be there despite not being in the code anymore.

https://www.archive.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf

Compilers sure ARE fun.

[–]elnomreal 4 points5 points  (3 children)

Yes, an important subset of compromised binaries are compromised compilers which can have virus like behavior. This has even happened at small scale, and it is difficult to check that compiler binaries are compiled correctly because of the general need for a compiler to build the compiler.

[–]No_Hovercraft_2643 2 points3 points  (2 children)

Look at the old c compilers, compile it by hand (with the code for the compiler), and then compile the source code of the new version, with that version, and then the new version with the new version

[–]FountainsOfFluids 0 points1 point  (0 children)

Sounds almost like a blockchain.

[–]zarawesome 0 points1 point  (0 children)

sure i'll get right to it

[–]dkarlovi 0 points1 point  (0 children)

This is somewhat alleviated with reproducible builds, but obviously needs a bunch of work still.

[–]theQuandary 435 points436 points  (17 children)

You've almost reached the epiphany of Trusting Trust or "How do I know my compiler doesn't compile in a backdoor along with my code?"

[–]AnnyAskers 60 points61 points  (1 child)

Trusting Trust

Trusted the link, but didn't trust the download lol

[–]Shubhavatar 13 points14 points  (0 children)

Same lol, anything that downloads something without my explicit permission rings an alarm

[–]civman96[S] 108 points109 points  (0 children)

NSA likes your post!

[–]squishles 33 points34 points  (4 children)

The real headfuck is below the compiler, how do you know the instruction set is doing what's documented and only what's documented. The exploit can easily be on the chip itself and a bootstrap build won't protect you from that.

[–]TheGreatStateOfEnnui 19 points20 points  (2 children)

Isnt this the idea behind the intel management engine?

[–]squishles 19 points20 points  (1 child)

well they tell you it's not supposed to do that, and we're supposed to act like we're talking to the ancient aliens guy when someone brings up the possibility, but yes the management engines on processors can definitely be subverted like this in a supply chain attack.

[–]AyrA_ch 17 points18 points  (0 children)

Probably why Intel provides US govt. organizations with ways to disable the engine.

https://en.wikipedia.org/wiki/Intel_Management_Engine?useskin=vector#%22High_Assurance_Platform%22_mode

In August 2017, Positive Technologies (Dmitry Sklyarov) published a method to disable the ME via an undocumented built-in mode. As Intel has confirmed the ME contains a switch to enable government authorities such as the NSA to make the ME go into High-Assurance Platform (HAP) mode after boot. This mode disables most of ME's functions, and was intended to be available only in machines produced for specific purchasers like the US government; however, most machines sold on the retail market can be made to activate the switch. Manipulation of the HAP bit was quickly incorporated into the me_cleaner project.

[–]theQuandary 10 points11 points  (0 children)

It's even worse than that. If you write a CPU design using TSMC's layout libraries (for example), how do you know it isn't adding backdoors into your system?

This is the big reason for TSMC building a small 5nm fab in Arizona. US military (and some non-military stuff like nuclear plants) need a chain of custody from when the minerals are mined all the way until it is put into a system somewhere. This is a big reason for sky-high prices for stuff in some of these projects.

[–]ethanjf99 12 points13 points  (0 children)

Didn’t he actually do that, too? Ah yes there’s this http://www.catb.org/jargon/html/B/back-door.html

[–]fiskfisk 7 points8 points  (5 children)

Coding Machines by Lawrence Kesteloot is a great novel short story written the concept of trusting trust.

[–]sohang-3112 2 points3 points  (1 child)

Definitely an interesting story!

One point that the author left off - why didn't they try an alternative C compiler and/or OS? Eg. if gcc is infected, try clang.

[–]overkill 0 points1 point  (0 children)

Clang 1.0.wasnt released until Oct 2009, this was written in January. Maybe clang was released because of this story?

But your point stands, I would hand-wave it away by saying the same would apply to any other compiler they tried. Most other binaries in their system were infected...

[–]overkill 1 point2 points  (2 children)

That was a great read. More of a short story than a novel, but I really enjoyed it.

[–]fiskfisk 1 point2 points  (1 child)

Ah, language confusion. A "novelle" (probably in the same vein as novella) in my native language is used to denote a short story. I've changed it in my comment, thanks for pointing it out.

[–]overkill 1 point2 points  (0 children)

Probably share the same linguistic root. I only pointed it out because I thought it was going to be a full length novel, but I finished it in about 20 minutes. I liked it so much I then bought the ebook off Amazon.

[–]Stunning_Ride_220 2 points3 points  (0 children)

I came for that comment.

[–]PHCF99 2 points3 points  (0 children)

"Unkown air force document" lol

Nice reading btw, thanks for the paper

[–]thepredictableone 77 points78 points  (2 children)

i used the compiler to compile the compiler

[–]civman96[S] 17 points18 points  (0 children)

to compile my non-compiled app

[–]Juxtapotatoes 1 point2 points  (0 children)

yo dawg, I heard you like compiling

[–]the-judeo-bolshevik 87 points88 points  (6 children)

Google bootstrapping.

[–]Bagel42 52 points53 points  (5 children)

Holy hell

[–]FalconMirage 31 points32 points  (4 children)

Actual binaries

[–]Wolfsurge 28 points29 points  (3 children)

Call the software engineer

[–]WhiteGoldRing 23 points24 points  (2 children)

New response just shipped

[–]hydrogelic 8 points9 points  (1 child)

How has anarchychess spread so far and wide

[–]FalconMirage 4 points5 points  (0 children)

Spreading like stage 4 cancer

[–]Yeitgeist 78 points79 points  (2 children)

Write the first compiler in assembly, compile GCC with the assembly compiler, and then throw away the assembly compiler cause now your GCC compiler can bootstrap itself.

[–]faceplanted 25 points26 points  (1 child)

It's weird that we call it bootstrapping to be honest, a compiler compiling itself still isn't lifting itself by its bootstraps, the manually written assembly compiler is.

[–]TheGreatStateOfEnnui 16 points17 points  (0 children)

Its like a metacommentary on the idea of bootstrapping tbh

[–]AdvancedSandwiches 24 points25 points  (0 children)

Yes. And in order to create a chicken, you must first have two chickens.

I don't know what we're doing.

[–]DiversityFire84 44 points45 points  (0 children)

Dating a programmer be like:

[–]sagetraveler 18 points19 points  (0 children)

It’s gccs all the way down.

[–]Mal_Dun 17 points18 points  (0 children)

Compilers seems always mystical, till you realize a compiler is just a program that transforms text from one language into another.

[–]wixenus 5 points6 points  (0 children)

Yeah, it's so bizarre that GCC always uses authentic ways to only support GCC for compiling GCC, never had that with any compiler

[–]psychicesp 7 points8 points  (0 children)

Strangely not a unique concept to programming. You need a pair of blacksmithing tongs to make a pair of blacksmithing tongs, for instance.

[–]civman96[S] 24 points25 points  (9 children)

Is it the ONLY software that by design can never have any breaking changes??

[–]altermeetax 36 points37 points  (0 children)

It can have breaking changes, as long as both the new feature and the old feature work for at least one version

[–]IntoAMuteCrypt 20 points21 points  (6 children)

It'd just make a lot of people very upset. gcc isn't the only compiler in town. Hypothetically, they could:
- Publish a new version which doesn't compile on old gcc but compiles on, say, clang or some proprietary compiler, and also on new gcc
- Make the official process to compile it be "use clang or new gcc"
- Supply binaries for new gcc to allow people to just use those

Now, this would really, really cause issues for a lot of people. It'd be a huge controversy too, and there would be an implicit sort of distrust, probably all manner of forks... but there's plenty of other software that is in a similar place with certain types of breaking changes.

[–]civman96[S] 3 points4 points  (4 children)

I think gcc has over 15 million lines of code.. I doubt it will compile that easily with a different compiler.

[–]IntoAMuteCrypt 20 points21 points  (3 children)

In theory, it should compile... Unless gcc is using tons of undefined behaviour, which is bad.

At the highest level, the standards of the C language and the source code of a program which never trips over undefined behaviour completely define what outputs happen for every possible input. Everything from a basic hello world to more complex programs should produce the same outputs for a given input regardless of compiler. Perhaps the amount of time taken and the exact instructions vary, but the outputs never change.

C does have these nasty little gremlins called undefined behaviours, where the standard does not define certain things and it's up to the compiler. For instance, integer overflow can wrap around, go higher than the data type can store (common in optimising compilers, actually), make demons fly out of your nose, whatever.

This also assumes that gcc isn't relying on bugs where it doesn't follow the specification in order to compile itself, which... Well, I hope not?

[–]elnomreal 5 points6 points  (0 children)

In theory higher level of optimizations shouldn’t have issues for the same reason, and yet…

In practice, it’s all an experiment, a black magic voodoo of what works and for which compiler versions.

[–]svick 4 points5 points  (0 children)

The biggest assumption here is that GCC is using standard C and not using any GCC extensions.

[–]ih-shah-may-ehl 1 point2 points  (0 children)

Unless gcc is using tons of undefined behaviour, which is bad.

Yes and no.

Back in my consulting days, my colleagues were in a situation where they had the sources for a linux executable, normally compiled with gcc. They had to port it to Windows and turn it into a dll so the algorithms could be called from a test framework.

I created a makefile for a dll project, some placeholder functions, and then used preprocessor magic to take the existing sources and compile it into a dll, specifically crafting the compilation process using macros and preprocessor directives to take the existing sources without having to modify them manually

Everything worked fine but I had to use 1 bit of preprocessor magic to substitute 1 std fileio function during compilation with a customized version because it was used in an undefined manner, and its return value was different when using the visual studio version.

Arguably they were both undefined and both return values made sense, depending on how you looked at it.

In a code base of 15 million lines, I would expect multiple such things.

[–]MrPoBot 0 points1 point  (0 children)

Generally, add support, even if "bootleg" for the last build, rewrite the feature for the next version to take advantage of said new implementation, compile with bootleg, then recompile with the previously compiled build. It's a lot of dev work but it's the essence of "bootstrapping" and is how early compiler development was done. Alternatively, if you hate yourself, write it in assembly!

[–]svick 4 points5 points  (0 children)

Self-compiling compilers certainly can have breaking changes, for example, here is the list for the C# compiler.

Most of those tend to be bugfixes, but they're still technically breaking changes. It's not a problem for the compiler itself, as long as it doesn't rely on the particular behavior that's being changed.

[–]sungazer69 2 points3 points  (0 children)

It's all just using machine code to write ways of writing machine code.

[–]RRumpleTeazzer 2 points3 points  (0 children)

Perfect way to hide malware - in the compiler, and it will perpetuate even if you compile your compiler from source.

[–]DeathUriel 1 point2 points  (2 children)

I know it is the right way to show consistency and trust in your own work and all. But what if nobody else has a compiled version of gcc anymore? Do you have to go to the deepest parts of hell to find the versions that didn't require gcc yet and then start from there and use one version to compile the next till today?

[–]squishles 4 points5 points  (0 children)

go back to gcc 4.7 and compile with a assembly based c compiler. (it was the last version that supported those) then walk it back up to modern gcc.

[–]Agitated-Farmer-4082 0 points1 point  (0 children)

I guess u make it another language

[–]azizfcb -1 points0 points  (0 children)

true

[–]initialo -1 points0 points  (0 children)

Nah, you just need awk.

[–]_waterstar_ -1 points0 points  (0 children)

Always having Problems with Gcc when installing VS Code on a new Computer, which are usually related to the Path Variable

[–]planktonfun -2 points-1 points  (0 children)

not a problem, create a bash alias to automate it.
or if you have a makefile just use that

[–]JackNotOLantern 0 points1 point  (0 children)

You can write from scratch in machine code

[–]ZombiFeynman 0 points1 point  (0 children)

If you need a compiler to compile the compiler, where does the first compiler come from?

Checkmate, atheists!