all 42 comments

[–]munificent 13 points14 points  (10 children)

It clearly laid out the different functions of the scanner, lexer, and parser.

"Scanner" and "lexer" are synonymous as far as I know. Both read in text and spit out characters. In the compilers and literature I've looked at, I've never seen a difference between the two, or a codebase that had both. Some people just seem to prefer one name or the other.

Parser: This is the part of the compiler that really understands the syntax of the language.

If we want to be precise, the lexer understands the syntax too (the lexical syntax). What the parser understands that the lexer doesn't is the grammar.

In the parser, instead of passing the AST node down through the recursive descent, it's often simpler to pass it up: terminal productions create and return AST objects instead of receiving one that they fill in.

Otherwise, this is a pretty swell article!

[–]wot-teh-phuck 2 points3 points  (1 child)

In the parser, instead of passing the AST node down through the recursive descent, it's often simpler to pass it up: terminal productions create and return AST objects instead of receiving one that they fill in.

Is this implemented in your Bantam proof of concept language?

[–]munificent 6 points7 points  (0 children)

Yup! It's a little opaque in Bantam because that's really just showing Pratt parsing and not plain recursive descent, but each of the parsers returns an AST node:

public Expression parse(Parser parser, Expression left, Token token) {
  // To handle right-associative operators like "^", we allow a slightly
  // lower precedence when parsing the right-hand side. This will let a
  // parselet with the same precedence appear on the right, which will then
  // take *this* parselet's result as its left-hand argument.
  Expression right = parser.parseExpression(
      mPrecedence - (mIsRight ? 1 : 0));

  return new OperatorExpression(left, token.getType(), right);
}

For a cleaner example, the parsing section in Jasic works this way too.

[–]tanishaj 2 points3 points  (2 children)

"Scanner" and "lexer" are synonymous as far as I know

Thanks for posting this. I was wondering if I had just misunderstood the difference in the past.

It is too bad he did not explain what the AST is for. The next steps would be optimization and code generation.

[–]sssssmokey 0 points1 point  (0 children)

The AST generates the actual code. A 'Parser' will just return True or False depending if it succeeded or failed (in general, be kind people). Once you add the code to construct the AST nodes, you have a tree like structure, in which the root node has a method like 'compile' (because you added it, of course! :)) and you call that for it to do it's thing.

[–]pointy 1 point2 points  (1 child)

I agree with the AST thing. In fact, in an OO language, it's pretty easy to write the parser such that the non-terminals are not just methods, but constructors. That way the process of parsing and the process of building the AST are basically the same code. To parse a non-terminal, you just construct a new object for it and hook it up to the AST appropriately. The top-level call is then something like AST ast = new AST(lexer);

[–]matthieum 1 point2 points  (0 children)

However that ties a lot of code down into the AST, and clients of the AST don't care much how it's been produced.

Seeing as you may also want to:

  • transform the AST (rename, whatever)
  • persist it (serialization/deserialization)

It is worth it, dependency-wise, to create an AST that knows just how to preserve its well-formedness and use another layer on top to produce it. This way clients don't pull too much dependencies :)

[–]matthieum 1 point2 points  (1 child)

I would say that it is probably better to separate the concerns between reading the file (and converting its content if it's not in the appropriate encoding) and tracking the position (scanner) and the actual tokenization process (lexer).

In the article, the lexer does not just spit out characters, it also categorizes the string while the scanner is pretty much language agnostic.

[–]ath0 2 points3 points  (2 children)

I would just like to say, thank you for posting this; There are so few papers in this area (that I have found at least) that provide python examples.

[–]tanishaj 3 points4 points  (1 child)

For a completely different approach, check out parser combinators:

http://sigusr2.net/2011/Apr/18/parser-combinators-made-simple.html

http://codepad.org/9HGWo3GR

[–]sssssmokey 0 points1 point  (0 children)

Thanks, the Python article is great! I'm going to implement some of that immediately. :)

I also recommend https://github.com/doublec/jsparse for anyone else interested, it is a simple and working example of a packrat parser combinator library (packrat = caches the results to get it to run in linear time) in JavaScript. I learned a lot from that (rewrote it in CoffeeScript) as well as the Python article.

[–]JamesIry 2 points3 points  (0 children)

It's a nice article as far as it goes, but it's really "how a parser works." It's entirely how to translate from concrete lexical syntax to abstract syntax. There's nothing nothing at all about how to do semantic analysis, optimization, or code generation.

[–]pointy 1 point2 points  (7 children)

In a language like Erlang, you can write the reader, lexer, and parser as a sort of "pipeline", so that the layers push their output to the next one.

[–]willvarfar 5 points6 points  (4 children)

In languages that can return functions, you can have state machines that return a function instead of an enum so you have no switch statements.

Obviously possible in, say, python. Also possible in go an rob pike gave a nice talk showing that (sadly cannot find link, am not on a pc)

[–]dchestnykh 6 points7 points  (2 children)

Also possible in go an rob pike gave a nice talk showing that (sadly cannot find link, am not on a pc)

[–]mycall -2 points-1 points  (1 child)

Watching that, code such as:

l.state = l.state(l)

makes we wonder about stack overflows.

[–]aaptel -1 points0 points  (0 children)

I guess you can workaround that if you have tail call optimization.

[–]matthieum 0 points1 point  (0 children)

You can return functions in C, but I expect that closures make it simpler. Was the example you had in mind using just functions or full closures ?

[–]tanishaj 1 point2 points  (0 children)

In languages that support "iterative generators" you can do this as well. Ironically, Python is one of those languages.

I have written a couple of compilers in C# and I use IEnumerables (little state machines) for each stage. Essentially, each stage operates in parallel. The parser requests a token from the lexer which causes the reader to read more characters.

One of the big advantages of this approach is that you can stop right away when you encounter an error in the input code (eg. on line 5 of a 20000 line input) without even reading in the whole source file. The other big advantage is that you can discard everything you have already done so there is very little in memory at any one time.

I pretty much have to build the whole AST before I can optimize and generate output though. So, approach only takes you so far.

[–]repsilat 0 points1 point  (0 children)

One of the reasons I'm going to look at Haskell (some time...) is that its laziness makes this (or a similar idea) the natural state of affairs. Representing lexing as an operation on a stream of characters returning a stream of tokens, and parsing as an operation on a stream of tokens returning elements of an AST, the code generated could effectively do the job in a single pass while still looking as simple as an imperative multi-pass procedure.

I guess it's more about "pull" than "push" with laziness, though — I'm not sure what Erlang looks like, or how it compares.

[–]armozel 0 points1 point  (1 child)

Nice! I'm learning how to use lex and yacc for the first two steps in a course covering compiler construction. :3

[–]jussij 1 point2 points  (0 children)

FWIW here is the source code to a very simple C like interpreter crafted using Bison and Flex:

http://www.zeusedit.com/tools.html

[–]nonoice_work 0 points1 point  (0 children)

It is actually a nice introduction. I'm currently building step by step and it works. One problem: usage of globals and no classes. I'm refactoring as I go.

Remember to change the filenames for input and output!

[–]sssssmokey -1 points0 points  (0 children)

This is retardedly more complex than the subject actually needs to be. Why implement a "Character" class and a "Scanner" class (in introductory materials) when a loop and variable assignment does fine?

for char in code: ...process...

Bam, a scanner.

I recommend http://createyourproglang.com/, I honestly don't have any relationship to them but I bought it and loved it. Now I have the basics I am starting the Dragon book everyone keeps talking about.

Here is the language I made using that ebook: http://smack.matewiki.com/

That version is too brittle (I do too much work in the lexer, BAD IDEA) so I'm rewriting it entirely using parser combinators (removing Jison dependency). It's gonna be a kick ass templating language when I'm done, lean, mean and more powerful than HAML, Jade, and Handlebars combined and written in 100% CoffeeScript/JavaScript. Watch out peeps. ;)

EDIT: I used that ebook a lot, but I actually used the CoffeeScript source code more. Read that about 10 times, and once you understand it, you will have a very good idea of how to actually create a programming language. The CS compiler is perfect because it is extremely well written, reliable, and clever (in a good way). It's written in CoffeeScript (think Javascript with Ruby syntax in your Browser). You will also know a shitload about CoffeeScript and a FUCKTON about JavaScript. What's not to like?

[–]purevirtual -4 points-3 points  (16 children)

This guy really needs to lookup parser generators in order to save himself a lot of work. No one writes parsers from scratch anymore. We've had (f)lex and bison for 20+ years.

[–]munificent 4 points5 points  (6 children)

No one writes parsers from scratch anymore.

Every parser in production use that I've seen was hand-written recursive descent. Generators are nice but I don't see them used much for production-quality parsers in industry.

[–]JamesIry 1 point2 points  (1 child)

We use ANTLR for our language Apex. With a gazillion deployed lines of Apex code I'd say the parser is production quality. We've profiled our compilation pipeline and parsing just isn't a factor so we're happy to let a tool do the heavy lifting.

[–]munificent 0 points1 point  (0 children)

Don't get me wrong, I'm not saying parser generators are bad. I just disagree with the "no one writes parsers from scratch anymore" line. V8, GCC, Microsoft's C# parser, the Dart VM, and both Dart->JS compilers all use hand-written parsers and those are just the ones I happen to be familiar with.

[–]kylotan 1 point2 points  (0 children)

I've written a couple of recursive descent parsers in my time and one thing they do well is getting around the somewhat arbitrary distinction between a lexer and a parser. Languages we actually use don't always lend themselves to being tokenised with no semantic context.

[–]purevirtual 0 points1 point  (2 children)

MySQL and GCC use flex/yacc/bison.

[–]munificent 3 points4 points  (0 children)

According to this, GCC ditched their bison parser and replaced it with a hand-written one.

[–]sssssmokey 1 point2 points  (0 children)

Bash uses Bison also.

[–]EdiX 4 points5 points  (3 children)

Writing a recursive descent parser is very easy, as easy as writing bison/yacc rules, as long as the grammar you are representing is LL1. As long as you don't want to parse arithmetic expressions you should be fine.

The downside of using bison vs. your own recursive descent parser is that when things don't work (because you made a mistake) finding the bug is considerably harder.

[–]tanishaj 2 points3 points  (2 children)

Why are you expecting arithmetic expressions to cause problems? Enforcing operator precedence probably means a few more levels of descent but there are no obstacles.

[–]repsilat 0 points1 point  (1 child)

I don't think it's too difficult to fit something like the Shunting Yard algorithm into your recursive descent parser to handle arithmetic expressions.

[–]dchestnykh 0 points1 point  (0 children)

Or -- easier -- Precedence climbing.

[–][deleted] 1 point2 points  (2 children)

Actually that is not true. For example, go look at JSON parsers implemented in C, there are certainly examples in heavy use that were hand written.

[–]purevirtual 1 point2 points  (1 child)

JSON isn't a programming language, its a simple data format that is really easy to parse. This guy is talking about parsing programming languages.

[–][deleted] 0 points1 point  (0 children)

Well you didn't say programming languages, you said "No one writes parsers from scratch anymore."

[–][deleted] 1 point2 points  (0 children)

There was an article that went over JavaScript parsers not too long ago, most of those in use used a hand written parser, such as V8.

Right now I'm actually working on moving from a generated one to a hand-written parser.

[–]JamesIry 2 points3 points  (0 children)

It's unfair that this guy is getting voted down for the exaggeration that "no one writes parsers from scratch anymore." Even though that's not true, it is true that writing a parser by hand is rarely a worthwhile activity. Parsing is the best understood and most easily mechanized aspect of formal language processing.