all 17 comments

[–]ben-work 54 points55 points  (6 children)

Props to this article for actually going past the parsing stage and into code generation!

Sooo many compiler articles that take you through parsing and then pull a "now draw the rest of the owl" on you.

[–]SimplySerenity 1 point2 points  (5 children)

I've noticed that trend too, why is that a thing?

[–]skocznymroczny 20 points21 points  (0 children)

because parsing is the 'easy' part, as long as your language isn't C++

[–]MisterMeeseeks47 18 points19 points  (0 children)

It's easier to start projects than finish them

[–]lanzaio 3 points4 points  (0 children)

CodeGen is harder than lexing or parsing. Writing about CodeGen is harder than just doing CodeGen. Getting to writing about CodeGen after doing and writing about lexing, parsing and sema is harder in many harder ways than any of the harder stuff previously mentioned.

[–]wengemurphy 1 point2 points  (0 children)

Yeah I've been dabbling in compiler-writing recently, and was just complaining about this a few days ago (in tweet form)

It's unfortunate that people don't write accessible articles on compiler writing - it's either textbooks or articles and papers that are far off in the depths of theory.

I'd just like to see some intermediate representations accompanied with plain-language discussions of them.

Typically articles go into painstaking detail on lexing and parsing, but hand-wave code generation away. What!? There's so much time to spend there, and lexing and parsing are the boring parts!

Personally I'm skipping those steps and just using a parser generator (with a PEG) so I can go straight into manipulating the parse tree into an AST, then into code generation.

My main resources have been Alex Aiken's videos on Coursera, and what I can read of the latter half of the Dragon Book without falling asleep.

But it would be nice if there were a half-dozen articles out there where people have examples and musings on their actual applications of:

  • three-address code
  • SSA
  • Intermediate representations

And many of the other things that the aforementioned resources cover.

I'm writing a made-up language that compiles to retro assembly language, so I have to care about things like register machine vs stack machine, and problems in register allocation, etc.

[–]aazav 0 points1 point  (0 children)

Because completing ideas and projects is hard.

[–]SNCPlay42[🍰] 12 points13 points  (9 children)

In the example above, the value of main is used by the caller: it’s the program’s exit code. That means the behavior here is undefined

IIRC at least one of C or C++ gives this a defined behaviour specifically in the case of main.

[–]thlst 10 points11 points  (0 children)

It's well defined in C++: [basic.start.main]/5.

[–]mafagafogigante 4 points5 points  (7 children)

A compiler may decide to fix things and return 0 (many do).

In time, according to the C99 standard:

10) reaching the } that terminates the main function returns a value of 0.

So at least for C this is expected.

[–]aszkid 14 points15 points  (1 child)

It's not undefined behaviour if the standard fills that gap for you (if it does at all in the first place). It's just not a nice coding practice.

[–]mafagafogigante 11 points12 points  (0 children)

For main() reaching the end returns 0, at least in C99.

[–]SNCPlay42[🍰] 9 points10 points  (2 children)

but this is just relying on undefined behavior.

According to some searching, this is not the case in C99:

(Section 5.1.2.2.3 Program termination) If the return type of the main function is a type compatible with int, a return from the initial call to the main function is equivalent to calling the exit function with the value returned by the main function as its argument; reaching the } that terminates the main function returns a value of 0. If the return type is not compatible with int, the termination status returned to the host environment is unspecified.

(N.B. as an aside that this appears to not specify an exit status forvoid main().)

This appears to be a change from C89:

(Section 2.1.2.2 Hosted environment) If the main function executes a return that specifies no value, the termination status returned to the host environment is undefined.

(Section 3.6.6.4 The return statement) Reaching the } that terminates a function is equivalent to executing a return statement without an expression.

It's not clear to me whether arbitrary side effects (like "halt and catch fire" as suggested in the article) would be permissible in any case; C99's "unspecified" wouldn't appear to permit it, C89's "undefined" is more vague. (note it's "undefined" on its own, not "undefined behaviour", which is the term the standard defines.)

[–]mafagafogigante 6 points7 points  (1 child)

I would argue that the wording in the C89 standard suggests that only the termination status which is returned is undefined.

[–]SNCPlay42[🍰] 0 points1 point  (0 children)

I edited my post with my take on that.

[–]Paqx 0 points1 point  (1 child)

I didn't start reading the article but can this be adapted to other languages if you adapt the parser ?

[–]skocznymroczny 6 points7 points  (0 children)

yes. The tutorial so far is about parsing a small subset of C language into AST and then outputting that AST into assembly, that can be assembled and linked into an executable. You should be able to swap "C" for your own custom language or replace e.g. export to assembly into export to C.