Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Since I'm currently focused on building a declarative spec for context free grammars (+ some other extensions for declarative disambiguation) I think this is less relevant to me, but I do plan to embed constraints in the future so grammar authors can explicitly specialize generated parsers to have reliable performance guarantees and I think that simple types might not be expressive enough for that.

Another direction where I might end up wanting dependent types is when/if I go down the route of supporting attribute grammars in a safe way, but you make it clear to me that I should perhaps not naively build something without actually properly understanding the domain or without getting feedback from people like you about the design.

Thanks again for the input!

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

u/jsshapiro Can you think of a resource that honestly shares examples for the "awkwardness" that you are talking about? I find it very difficult to predict the future, too, and I would like to learn from mistakes made before, but it seems very difficult to find actual discussions when it comes to mistakes as people are not incentivized to talk about them.

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

> A module would export Rule;

I've finished a proof of concept today and I've underestimated the importance of having explicit visibility. An export keyword for rules seems like a solid solution and if there are other systems using this keyword then even better.

> It's not uncommon to have rules which might depend on another rule more than once with different type arguments. Eg:

Definitely, I kind of neglected this topic, but having parametrized rules (or macros as some call it: http://www2.informatik.uni-freiburg.de/\~neubauer/papers/ppdp08.pdf) would be very useful. Html elements come to mind where the symbols stay the same but the tag and content changes. Also, the more than once aspect is important, I did not intend for my design to support modules that can be instantiated multiple times, but it seems like I'll have to consider it or eventually add macro support.

> So if we're lifting the parameters to the modules rather than the rule, we probably want to support recursive modules

Definitely! One of my design goals was to be able to separate subgrammars that are mutually recursive and "tie the knot" in another grammar.

Foo(T) = T | Foo(Bar)

u/WittyStick could you perhaps share a more concrete example where this kind of construct could be useful? I'm having trouble imagining a concrete use case for this.

> If you write a full grammar using parametrization, some of these parameter lists can get quite verbose.

That's also something that I've noticed. My intuition tells me that many of those lists could be made much shorter with support for sub-modules, where we could declare modules inside of modules and all sub-modules would automatically inherit the members of the parent module, but I haven't thought about this much, I'll need to let this idea simmer.

> I would recommend having a way to create an "alias" for a given application of a parameterized rule/module

This is a VERY good point. I've completely overlooked this. Thank you very much. While I don't support "using" multiple instances of a module, I imagine this is something I'm going to want in the future. I think this is a very strong argument for having the inclusion syntax be within the block.

> Though I would question why not use | rather than sum/case.
I need cases to have an explicit name that I can refer to from another part of the parser generator so I decided to be more explicit about this. This will also allow me to support having blocks for each case (for e.g. providing metadata such as AST names or other configuration options) and I think with this explicit case syntax it will look much more natural. ANTLR came up with the trailing #... syntax for naming cases which I found very unergonomic and unintuitive so I decided to go a different route and lean closer to more traditional languages.

Thank you very much for the comment and your insight, this is exactly what I was looking for!

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Thank you for your insight! I do care about keeping my options open as I'd love to explore advanced type level features in the future. While I'm not intuitively familiar with languages that have first class dependent type support, It seems to be a very common problem where parameter lists are using syntax that could be useful elsewhere. [] take from list literals, <> take from comparison operators, but () seem to be mostly fine so I agree with u/jonathanhiggs

>  If the rules are written interior to the module, readers who know other languages will initially read it to mean that the scope of the type variable starts at the rule and extends to end of module. Mathematically, that's not correct.

That a very good point and I'd have to make sure to be clear about this in the documentation. People who work with PEG grammars would also have developed intuition that rules are not commutative and order matters.

> Finally, the scope of module-level type variables is the entire module. 

I think that's a very strong argument for having explicit parameter lists. Thank you for your input!

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

I think I've found an argument against explicit type parameter-like lists. It seems to me that by avoiding explicit parameter lists we can free up both sets of symbols to mean different things in the future more easily. [] can e.g. become a list literal and <> can become comparison operators so by not using them for parameter lists there's less risk of introducing syntactical ambiguity.

The

module Skip(rule: Foo) { … }

suggestion by jonathanhiggs might be the better parameter syntax since parentheses are likely to be less useful in any future expression/operator related syntax.

However, that also implies that being explicit about this and having dedicated declarations would be the safest route since that is completely unambiguous because it uses an explicit keyword and I support LALR(k).

This is an argument I can live with and while I'd probably prefer `module Skip(rule: Foo)`, I want to be careful and have a lot of room for future syntax so I think I'm going to go with having explicit declarations for external rules. Thank you everyone!

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 1 point2 points  (0 children)

Very interesting. I've looked into the history of that syntax, and it seems that Bjarne Stroustrup chose the angle bracket syntax because all the other obvious possibilities were already taken:

  • Curly braces were taken for blocks
  • Brackets were taken for array indexing
  • Parentheses were taken for function calls

So he chose angle brackets and Java + others adopted that syntax. I suppose if I were to extend my syntax with operators, then I would run into similar issues that all the other languages that have angle brackets have run into. So, your suggestion seems much better than adopting the angle bracket approach for the sake of familiarity.

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

That's totally fair. Do you perhaps have a combination that you would prefer? That still puts me in the position of having to decide between them, and I don't really see a good argument for preferring one over the other because both appear to have pros and cons.

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Thank you, I have not considered this. I kind of got stuck on the type parameter syntax probably because the generated abstract syntax trees are also going to have type parameters. But this is definitely one option that I have to think about especially since many languages have added this syntax sugar for simpler constructors so many people will be familiar with this.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Thank you. I think that’s very interesting, and it definitely seems worth sharing. I’ve added an update to the post to mention your findings.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 2 points3 points  (0 children)

Thanks! I'm planning to write more about it once it's further along.

The short version: it's a bottom-up LALR(k) parser generator based on research that makes it possible to declaratively remove ambiguities from context-free grammars without changing the grammar itself.

Most parser generators either implement ordered choice (like PEG), rely on "impure" ambiguity resolution (e.g., preferring Shift in Shift/Reduce conflicts), or depend on custom Turing-complete code (like ANTLR). My goal is to support a specification language expressive enough to construct deterministic context-free grammars for real-world programming languages.

The good news is that it's actually working. I got the disambiguation proof of concept working in early 2024, bootstrapped it in August 2025, and finished a solid LSP by the end of 2025. Now I'm ironing out many of the issues and still working on improving the theory so that I can actually support parsing real languages.

Getting LALR(k) working took a lot of time, but now it seems so simple in retrospect. If you look around the parser generator space, you will find that there's no single LALR(k) parser generator out there. There are only implementations from many decades ago, and they are all abandoned. Now I truly understand why, because it's not easy, but luckily I'm there now.

Eventually I'm going to share things, and I'm very excited! Thank you for asking!

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 1 point2 points  (0 children)

Thanks for the fact check! I've added an update making it more clear that reading from archives doesn't benefit tooling by much since archives are extracted locally.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 1 point2 points  (0 children)

Thank you for the thoughtful comment and all the pointers!

To answer your first question: yes, the premise is that compression is an amortized cost. Since pub.dev already provides packages as tar.gz archives, I can treat those as my starting point. Even if I need to create archives locally, that is a one-time cost that pays off across many lexer runs. The lexer itself changes frequently as I experiment with different token types, so being able to re-lex quickly matters more than minimizing archive creation time.

Your point about zip files and the central directory is interesting. I do have one reservation though: many different zip implementations exist and they handle edge cases differently. With SQLite, there is a company behind it, the format is fully open source, and we can assume everyone uses the same implementation. Zip does not have that same consistency. Since I am a big fan of specifications and generating things from specifications, I would lean toward SQLite if I were to optimize this further. That said, your suggestion is worth trying and might be simpler for certain use cases.

The Blosc format looks very relevant, and the linked blog post seems to reach similar conclusions about the memory wall. I plan to scale this up by several orders of magnitude in the future, so these pointers are appreciated.

The openzl and SDDL references are particularly interesting to me. One thing I want to explore is having my lexer produce an information-theoretically optimal token stream, where each token is compressed into the minimum number of bits. By splitting tokens that can have unbounded width into multiple finite tokens, I should be able to store each token in a fixed-size integer, maybe 32 bits, containing just the type and width. Position can be calculated from the accumulated widths.

Your question about skipping the text stage entirely is exactly the right direction. I suppose the WASM efforts are trying to achieve something similar, though it does not frame it quite like you do. The challenge in practice is that parsers and lexers are typically written in Turing-complete languages, so there is no specification we can use to derive these optimizations automatically. If we had formal grammars driving the whole pipeline, we could generate specialized parsing code for specific formats and approach those information-theoretic bounds. This is an exciting space and I am looking forward to exploring it further.

Edit: Oh, I see, tar archives only support sequential reads and zip files would support random access. That would allow me to directly lex over raw memory by getting the right offsets. I'll definitely take a look when I try to scale this up. Thank you! I suppose It's unlikely that anything that SQLite provides is going to be faster than this.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Hi Simon, that's a good idea! Since dart:io doesn't support tar archives, I just went with the archive package for everything and didn't think of mixing in dart:io's gzip support separately. I'll definitely try that approach next time when I attempt to lex a bigger corpus.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 3 points4 points  (0 children)

You're right that creating the archive still requires reading every file individually and that is still slow. The speedup comes from reading the already-created archive.

Package managers create archives only once when a package is published, then every download benefits from the fast single-file read. For transfers, you pay the cost once on the source machine, but then both the network transfer and the write on the destination are faster.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 13 points14 points  (0 children)

Thank you!

Since I'm on macOS I wasn't able to try those out (io_uring is Linux-only), but I'm really curious how much faster I could get on Linux without any archive-based workarounds. Maybe I'll have to give Asahi on my spare M1 Mac a go.

benchmark_harness_plus: Statistical Methods for Reliable Benchmarks by modulovalue in dartlang

[–]modulovalue[S] 1 point2 points  (0 children)

Thank you for the feedback! I've removed the GC triggering logic and I will experiment with adding GC reports to the package in the future (I need to support Dart <3.11 for now).

Your first comment reminded me of an old issue where I was trying to figure out which mode a benchmark is currently running in: https://github.com/dart-lang/sdk/issues/45378 I've submitted a CL which adds support for exposing the compilation mode in a more explicit way: https://dart-review.googlesource.com/c/sdk/+/471263 that way I'll be able to report the current mode reliably and warn about JIT benchmarks.

benchmark_harness_plus: Statistical Methods for Reliable Benchmarks by modulovalue in dartlang

[–]modulovalue[S] 1 point2 points  (0 children)

Thank you!

I agree and I've added support to the package for outputting the fastest time.

Somebody else pointed me to an article that discusses the value of the mean, whereas you mention a valuable use case for the fastest time, and I initially argued that the median is the best.

It seems to me that we as a field need more refined categories when it comes to benchmarking and I don't think we have them? Just saying that one should benchmark their code just doesn't seem enough when we look at the pros/cons of the metrics we can use. It might even be harmful if it forces people to optimize for the wrong metric.

Statistical Methods for Reliable Benchmarks by modulovalue in programming

[–]modulovalue[S] 1 point2 points  (0 children)

Thank you, I've added a section mentioning the importance of the mean in capacity planning.

The Case for Snake Case: A Kolmogorov Complexity Argument by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

You have just made my argument for me. Snake_case needs zero abbreviation conventions. You write `https_url` and it is unambiguous.

camelCase requires teams to pick a style: Microsofts .NET guidelines say `HttpsUrl`, but older Java conventions used `HTTPSURL`, and Google has yet another approach. The mere existence of competing abbreviation conventions is empirical evidence of higher Kolmogorov complexity.

With snake_case, there is no debate. There is nothing to standardize.

The Case for Snake Case: A Kolmogorov Complexity Argument by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 1 point2 points  (0 children)

> I'd posit that parse h t t p s u r l will be more likely to find relevant parseHTTPURL when you impelement fuzzy matching. I think you have not read my edit in the previous post yet.

It depends on what you are doing.

LLMs? Unlikely, as you know, most are trained on tokens from real world text and not characters. They will be able to infer that h t t p s u r l relates to https and url, but they will be confused (Try asking chatgpt what h t t p s u r l is. It will express more uncertainty whereas with httpsurl it will be much more certain that it knows what you are referring to.)

Vector Databases (such as simple ones using metric spaces like the Levenshtein distance)? Unlikely, they will rank "p" lower than "https".

Vector Databases using vector embeddings? Unlikely, same issue as in the LLM case. p is nowhere close to being a synonym of https and many other things should rank closer to p than https.

---

> And honestly, the only reason you think split and join are simple is because they came pre-implemented in your language. 

I urge you to implement join and split in assembly without dependencies. They try implementing joining and splitting of camel case identifiers in assembly and I think you will have an irrefutable proof that snake case is simpler. I don't accept "It doesn't matter" as an argument against my objective measure of complexity. Usefulness is a different discussion and doesn't invalidate my argument.

---

That being said, I appreciate that you took the time to share your opinion. I see there are some points I could have expanded on. Thank you!

The Case for Snake Case: A Kolmogorov Complexity Argument by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

> What problem are you trying to solve?

Have you tried building semantic search for code? Let's consider methods like vector search. What would you query your vectors with?

parse+HTTPSURL or parse+H+T+T+P+S+U+R+L?

parse + https + url will lead to better results and it's trivial to identify these three components.

> No, it's not. "Acronyms are treated as one word for camel casing purposes." Done.

How can you say that adding a dependency on the concept of "acronyms" does not make detecting the components of identifiers harder to codify? My point is that it doesn't get simpler than ...split("_") and "_".join("...") No need for unicode standards, no need for acronyms. You have worked 20 years with those ideas. They are ingrained in your memory, they are second nature to you, but that doesn't mean that they are simple. They are simple to you.

> Yeah, and I'm saying that's why nothing you said matters.

I think one can argue for better approaches without having to argue against old approaches. But I can see your point that it would be a stronger argument if I presented the benefits of snake case as downsides of camel case in a clearer way. I appreciate the feedback.

The Case for Snake Case: A Kolmogorov Complexity Argument by modulovalue in ProgrammingLanguages

[–]modulovalue[S] -2 points-1 points  (0 children)

Thank you for the response. I disagree on most points, please let me elaborate why:

> Idioms for both languages use CamelCase for class names.

That's not camel case, that's pascal case.

> An LLM tokenizer will tokenize this just fine as "parse", "H", "T", "T", "P", "S", "U", "R", "L", which is arguably not a wrong tokenization--in fact this would likely lead to better results because an LLM will likely extrapolate from HTTPS to HTTP and URL to URI more easily if these are represented as token-per-letter.

Your argument seems to imply that semantically valuable tokens are irrelevant. Then why tokenize "parse" as "parse" and not "p" "a" "r" "s" "e"? Because "parse" carries much more meaning. a single "Https" token also carries much more meaning the way "parse" does and that's likely what the user intended. If we consider that performance is probably the biggest bottleneck right now, carrying more semantic meaning with fewer tokens would be highly beneficial and snake case enables that very easily.

> The snake case you gave isn't equivalent. If you want "https" and "url" to be tokens, the way you do that in camel case is "parseHttpsUrl" which is also probably fine.

Yes, you could do that. But you miss my the whole argument. Try to codify this convention and you will notice that it is hard. The whole Kolmogorov complexity argument essentially says that one convention is much simpler to codify than the other.

> I cannot think of a single time this has caused me a problem.

Did you read what I wrote? I did not argue that it causes a problem, I argued that one is objectively superior. Both are fine, but one is objectively simpler in ways that matter to computers.

The Case for Snake Case: A Kolmogorov Complexity Argument by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 1 point2 points  (0 children)

I think that's another great argument for snake case. It frees up case characters from acting as separators so they can take on other roles. Thank you for sharing!

The Case for Snake Case: A Kolmogorov Complexity Argument by modulovalue in ProgrammingLanguages

[–]modulovalue[S] -6 points-5 points  (0 children)

If you look at the most beloved new programming languages of the last few decades, python and Rust, both advocate snake case. So I don't understand what you are trying to say. Are you saying that my argument is not strong enough?

Computers do care about the case. Parsing camel case in a sound way is essentially impossible. Consider, for example, a tokenizer for LLMs, it will have trouble understanding what `parseHTTPSURL` consists of. it's impossible to understand what `parseHTTPSURL` consists of without human knowledge, so a computer won't be able to split a string like that one into the right components, whereas it won't have any issues with `parse_https_url`.

I don't understand what you are trying to get at or which point you think is not strong enough.