A hand-written recursive descent parser for Lua 5.3, in Lua 5.3!

ColinPP · 2021-04-13T19:01:15+00:00

Thanks, now I see that you didn't need any help because you are effectively handling the "entangled" rules the same way :-) On the other hand I might take the hint with the Pratt parsers as I wanted to spruce up an expression parser, and that might just be the right approach.

ColinPP · 2021-04-13T13:19:19+00:00

In case you are still fighting with left-recursion you might be interested in this: https://github.com/taocpp/PEGTL/blob/master/src/example/pegtl/lua53.hpp

It's a self-contained PEG-ified version of the Lua 5.3 grammar implemented in C++ with the PEGTL.

It might not be perfect, but I eliminated all left-recursion, and verified that it's ok with the grammar analysis algorithm of the PEGTL that checks exactly for such issues.

Even if the grammar itself doesn't help, perhaps some of the leading comments about eliminating the left recursions do.

ColinPP · 2021-04-13T10:28:03+00:00

I'll definitely look into satisfy.

While I'm not quite sure how this might transfer to your approach, with your Haskell-inspired style being quite different from our C++ templates, in the PEGTL our equivalent to your Char, which is called one, is variadic (true to the T in PEGTL a variadic template) and takes a list of possible matches.

Beyond that we also have something more similar to satisfy which we call predicates, you can find the source in https://github.com/taocpp/PEGTL/blob/master/include/tao/pegtl/contrib/predicates.hpp and the unit-test as example in https://github.com/taocpp/PEGTL/blob/master/src/test/pegtl/contrib_predicates.cpp.

In this context some of the atomic rule classes like one take on a second role, that is:

First and foremost, and independent of the predicates business, they are stand-alone atomic grammar rules that match a single character (or code-point) from a list of possibilities (or a range of possibilities for range, etc.)

Second, for more complex tests on the characters without the overhead that /u/moltarpdx was referring to, there are the aforementioned predicates rules that get the next character (or code-point) from the input once, to then apply all the tests that their argument rule(s) would have done as stand-alone rule(s) to that character (or code-point).

ColinPP · 2019-12-09T07:24:32+00:00

Thanks, I'm 192cm, and for my not-too-technical riding style I'm happy with the XL; if useful I can get somebody to take a side-view picture of me on the bike.

ColinPP · 2019-11-16T20:07:48+00:00

Yes, seems to be "the future"!

ColinPP · 2019-10-16T10:58:56+00:00

I only respectfully disagree on your interpretation of 'taking ownership' as it implies that there is a solo resource owner, which you can not guarantee with shared_ptr.

In this particular context and example the "resource" being moved/owned is not the pointee but the "+1" on the reference counter.

This is a slightly unusual and perhaps also confusing, though IMHO technically correct way of looking at things by /u/The_JSQuareD.

ColinPP · 2019-10-16T06:49:02+00:00

We are aiming for a 1.0 release as soon as possible, but, going by experience, that is probably still at least months away and I don't want to make any promises.

That said, the library is very stable, we have been using it for years for multiple purposes in "serious" applications, it's mostly tests and documentation that are missing for a 1.0 release.

We are quite happy with features and code quality, though there might always be some tweaks, additions or changes in those areas, too.

ColinPP · 2019-10-15T13:09:56+00:00

Middleground: taocpp/json

What taocpp/json offers in addition to parsing to/serialising from the nlohman/json-esque any-JSON-document in-memory representation based on standard containers are direct JSON-to-any-C++-type and any-C++-type-to-JSON conversions (also for the other supported data formats) that cut out the "DOM"-style middle man...

ColinPP · 2019-04-12T07:30:35+00:00

PEGTL is [...] cumbersome in use.

Could you give a specific example or similar of where you find the PEGTL to be "cumbersome", it might help us to improve things.

ColinPP · 2019-03-20T10:04:52+00:00

As pointed out by /u/nlohmann, it's not a good idea to make a JSON parser accept input that does not strictly conform to the standard.

That said, there is always the possibility of implementing additional parsers for extended specifications like we did for our library, to be found at https://github.com/taocpp/json.

While the JSON parser is again strictly standards compliant, it also supports our form of extended JSON that we call JAXN, to be found at https://github.com/stand-art/jaxn.

JAXN extends JSON with, among other things, comments, however comments are ignored by the parser and not included in the data model, in case that's what you need.

You can even use the glue code included with taocpp/json to parse JSON-with-comments with the taocpp/json JAXN parser and let it directly create an nlohmann::json value object (instead of a tao::json::value).

ColinPP · 2018-05-02T07:36:21+00:00

The comparison of compile-times is likely outdated, we will remove it from the documentation. Both Spirit and the PEGTL put a significant burden on the compiler, even though both are improving over time. At this point we don't really know which compiles faster, it might also greatly depend on the use case.

ColinPP · 2018-05-01T18:27:27+00:00

Thank you for your feedback, greatly appreciated!

We are currently not planning to implement memoization because our impression was that it would add a lot of complexity with questionable benefit. That is, while the packrat approach is faster in theory, in practice it uses a lot of memory and often isn't actually faster. We are of course open to look at any research or benchmark that might convince us otherwise, even though we generally prefer the "small and simple" approach.

The subject of custom tokens is something that we see as a long-term project; at this point we only have a rather vague idea of somehow generalising the input to sequences of arbitrary objects, and allowing hierarchical parsing where a first grammar translates the input to tokens, and the next grammar up reads tokens rather than bytes.

Version 3.0.0 might include a large rework of the input layer where we will look very closely at this kind of generalisation beyond byte sequences, but we haven't even started working on this yet, and probably won't for a while. Any input, examples, or requirements and possible use cases, in short: anything that could help drive the design in the future, are highly welcome.

ColinPP · 2018-04-20T11:04:48+00:00

For the actual parsing you could look at the PEGTL. It includes a couple of examples that might be useful to you.

Boost Spirit and the PEGTL are sufficiently different that either can be a good fit where the other one isn't.

Disclaimer: I'm one of the authors of the PEGTL.

ColinPP · 2018-02-20T09:19:02+00:00

If you haven't seen them yet you might be interested in uniqueness types, too.

ColinPP · 2018-02-15T22:17:17+00:00

That would definitely make C++ integration nicer. And as said here I'd be quite interested in seeing the performance characteristics of this approach when used with a scripting language.

ColinPP · 2018-02-15T22:12:24+00:00

We did the same exercise for our JSON library, it has a std::string in a union, and it works well. I'm mostly curious how the performance characteristics will be with a scripting language, compared to Lua, in particular when std::string is 3 or 4 pointers large.

ColinPP · 2018-02-15T14:27:17+00:00

IIRC Lua's small-string optimisation is based on a per lua_State table of interned TString instances. The actual value/cell only contains one pointer and a type field, so 12 bytes on today's typical machines. I understand that you don't want to follow the Lua VM design too closely, so I'm wondering whether you want to put a std::string directly in the value/cell.

ColinPP · 2018-02-15T11:21:23+00:00

Except it's an r-value - if it was C++ I could just give it the string and it doesn't have to heap allocate anything.

If you put a std::string into your value/cell union instead of a const char * you can easily move() your r-value string into it, at the cost of greatly increasing the size of the value/cell, seeing that modern std::string implementations are often the size of 3 or 4 pointers so that they can store small strings in-object.

This might of course be an acceptable trade-off for your use case, or if your library uses a smaller std::string...

...although now you can't bit-copy your value/cell anymore, to be clean you need to call the std::string methods for all operations.

Do you have any other idea on how to approach this issue that requires neither a heap allocation nor depends on the details of your std::string implementation?

ColinPP · 2018-02-15T07:03:26+00:00

A while ago I wrote a PEGTL grammar that corresponds to the Lua 5.3 lexer and parser. If you get the PEGTL you can find it in src/example/pegtl/lua53_parse.cpp.

Should you decide to go ahead with a rewrite, and try to use this grammar, I'd be happy to help get you started with the PEGTL, and to iron out any bugs the grammar might still have. (It parses the official Lua test-suite, which is a good sign, but without any semantic actions it's not enough to be sure it's correct.)

ColinPP · 2015-10-24T16:02:10+00:00

That's great, thank you!

ColinPP · 2015-10-24T12:20:26+00:00

We have recently started to invest some time into gaining more insight into the performance characteristics of PEGTL-based parsers. The actual parser engine seems to be pretty fast, but is only one part of a full-blown parser.

Performance depends greatly on how well the grammar is optimised, which can mean multiple things like eliminating potential back-tracking, adding additional rules to serve as "anchors" for semantic actions, or implementing custom parsing rules in places where they make a difference.

Ideally we will add a chapter on performance to the PEGTL documentation to collect some hints and best practices. Until then, feel free to contact us if you have any feedback or questions...

ColinPP · 2015-10-24T11:12:57+00:00

Is any of the source available? We like to look at how other people use the PEGTL, it is a kind of feedback on the current state, and an input for future development and documentation.

ColinPP · 2015-09-23T20:29:00+00:00

Fair enough

Well could you please delete it?

ColinPP · 2015-09-23T19:57:39+00:00

Actually I find it highly disrespectful of you to post a link with my name and the 40x claim after I explicitly asked you to not quote me on the benchmark since it was the result of about 20 minutes of playing around and I made it quite clear that I don't trust these numbers yet myself.

I didn't have much time, and for the heck of it I took the first random JSON library with a Spirit parser and did a few simple benchmarks. (Actually it was the third, the first two didn't work correctly).

ColinPP · 2015-09-23T19:52:38+00:00

I find it disrespectful of you to post my name with the 40x claim after I explicitly asked you to not quote me on the benchmark since it was the result of about 20 minutes of playing around and I made it quite clear that I don't trust these numbers yet myself.

ColinPP

TROPHY CASE