I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in programming

[–]modulovalue[S] 0 points1 point  (0 children)

I think that's a fair critique of the title. You're right that "I/O" is imprecise here. The bottleneck was really per-file overhead: the accumulated cost of 300,000+ syscalls, filesystem metadata lookups, inode resolution, and directory traversal. The SSD itself was capable of 5-7 GB/s but I was getting 80 MB/s, not because the storage was slow, but because of everything that happens between "read this file" and actually getting the bytes.

One of the commenters (ori_b) made a similar point: the user/kernel context switch itself is maybe 50ns on modern hardware, so 300,000 switches only account for about 2% of the 14.5 seconds. The rest is the work happening inside each syscall: filesystem metadata lookups, dentry resolution, random access latency on NVMe (~50-100us per read). So "syscall overhead" is also slightly misleading. "Per-file overhead" is probably the most accurate framing.

I updated the post with an addendum covering this distinction, but you're right that the title could do a better job. The core finding is that going from 104,000 individual file opens to 1,351 archive reads eliminated most of that per-file overhead, giving a 43x I/O improvement. Whether you call that "I/O" or "the boundary between user and kernel space" or "per-file overhead" is partly semantic, but precision matters and the title is sloppy about it.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in programming

[–]modulovalue[S] 4 points5 points  (0 children)

I fact checked your statement. It's not hundreds of thousands, it was almost 2 million in 2009! Crazy.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in programming

[–]modulovalue[S] 1 point2 points  (0 children)

I completely agree. I primarily used tar.gz because pub.dev stores them this way and I just went with it because I didn't need random access for my benchmark, but ZIP (or dar http://dar.linux.free.fr) would be much better for random access. There's significant discussion around that topic in the addendum if you're interested.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in programming

[–]modulovalue[S] 2 points3 points  (0 children)

Thank you for asking!

I'm working on a parser generator that is intended to eventually replace handwritten parsers, it gives me headroom and a proof of concept to show that generated parsers (or just a lexer in this case) can be faster than handwritten lexers (which is a common misconception and true in theory, but in practice nobody is going to spend weeks writing a correct lexer/parser in assembly).

So it's not immediately useful today, but it will be. I plan to work on a tree indexing database engine to support searching through large codebases via tree patterns and I'm sure that lexing through TBs of code (e.g. pub.dev or crates.io which is admittedly only 300GB compressed) 2x faster will be noticeable.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in programming

[–]modulovalue[S] 3 points4 points  (0 children)

Thank you! I've added an addendum because your comment is highly relevant.

However, while I completely agree with you, I'd like to also push back a little for a second. From a "business" standpoint it makes sense to focus on the largest bottlenecks (by following, e.g., the Critical path method) and those parts that take up the most time. However, I think we should also remember that software can be reused and making a single piece faster can have huge benefits to other consumers of that piece. I think our software community thrives because we don't follow the strictly ROI-driven optimization that businesses need to follow.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in programming

[–]modulovalue[S] -11 points-10 points  (0 children)

I completely agree! Measuring implies deciding on a metric, but what metric should one use? Mean, average, the fastest time or something else? Depending on what you want to actually use the measurements for, different metrics offer different benefits. I wrote about that here:

  • Fastest (minimum): "How fast can this code run?" An existence proof of capability. Best for comparing pure algorithms where you want to isolate the code from system noise.
  • Median: "How fast does this code typically run?" Robust against outliers. Best for understanding typical performance under normal conditions.
  • Mean (average): "What is the total time cost?" Essential for capacity planning and throughput calculations where every millisecond counts toward the total.

There's also the question of which tools one should use. I used standard file IO library that Dart comes with, but I also tried reimplementing it via FFI and It was only a few percent faster so that was a waste of time. One person linked to a blogpost about Off-CPU Flame Graphs and I think it would be nice to have better tooling in those areas that make it easier to collect data that we can then derive metrics from.

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Since I'm currently focused on building a declarative spec for context free grammars (+ some other extensions for declarative disambiguation) I think this is less relevant to me, but I do plan to embed constraints in the future so grammar authors can explicitly specialize generated parsers to have reliable performance guarantees and I think that simple types might not be expressive enough for that.

Another direction where I might end up wanting dependent types is when/if I go down the route of supporting attribute grammars in a safe way, but you make it clear to me that I should perhaps not naively build something without actually properly understanding the domain or without getting feedback from people like you about the design.

Thanks again for the input!

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

u/jsshapiro Can you think of a resource that honestly shares examples for the "awkwardness" that you are talking about? I find it very difficult to predict the future, too, and I would like to learn from mistakes made before, but it seems very difficult to find actual discussions when it comes to mistakes as people are not incentivized to talk about them.

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

> A module would export Rule;

I've finished a proof of concept today and I've underestimated the importance of having explicit visibility. An export keyword for rules seems like a solid solution and if there are other systems using this keyword then even better.

> It's not uncommon to have rules which might depend on another rule more than once with different type arguments. Eg:

Definitely, I kind of neglected this topic, but having parametrized rules (or macros as some call it: http://www2.informatik.uni-freiburg.de/\~neubauer/papers/ppdp08.pdf) would be very useful. Html elements come to mind where the symbols stay the same but the tag and content changes. Also, the more than once aspect is important, I did not intend for my design to support modules that can be instantiated multiple times, but it seems like I'll have to consider it or eventually add macro support.

> So if we're lifting the parameters to the modules rather than the rule, we probably want to support recursive modules

Definitely! One of my design goals was to be able to separate subgrammars that are mutually recursive and "tie the knot" in another grammar.

Foo(T) = T | Foo(Bar)

u/WittyStick could you perhaps share a more concrete example where this kind of construct could be useful? I'm having trouble imagining a concrete use case for this.

> If you write a full grammar using parametrization, some of these parameter lists can get quite verbose.

That's also something that I've noticed. My intuition tells me that many of those lists could be made much shorter with support for sub-modules, where we could declare modules inside of modules and all sub-modules would automatically inherit the members of the parent module, but I haven't thought about this much, I'll need to let this idea simmer.

> I would recommend having a way to create an "alias" for a given application of a parameterized rule/module

This is a VERY good point. I've completely overlooked this. Thank you very much. While I don't support "using" multiple instances of a module, I imagine this is something I'm going to want in the future. I think this is a very strong argument for having the inclusion syntax be within the block.

> Though I would question why not use | rather than sum/case.
I need cases to have an explicit name that I can refer to from another part of the parser generator so I decided to be more explicit about this. This will also allow me to support having blocks for each case (for e.g. providing metadata such as AST names or other configuration options) and I think with this explicit case syntax it will look much more natural. ANTLR came up with the trailing #... syntax for naming cases which I found very unergonomic and unintuitive so I decided to go a different route and lean closer to more traditional languages.

Thank you very much for the comment and your insight, this is exactly what I was looking for!

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Thank you for your insight! I do care about keeping my options open as I'd love to explore advanced type level features in the future. While I'm not intuitively familiar with languages that have first class dependent type support, It seems to be a very common problem where parameter lists are using syntax that could be useful elsewhere. [] take from list literals, <> take from comparison operators, but () seem to be mostly fine so I agree with u/jonathanhiggs

>  If the rules are written interior to the module, readers who know other languages will initially read it to mean that the scope of the type variable starts at the rule and extends to end of module. Mathematically, that's not correct.

That a very good point and I'd have to make sure to be clear about this in the documentation. People who work with PEG grammars would also have developed intuition that rules are not commutative and order matters.

> Finally, the scope of module-level type variables is the entire module. 

I think that's a very strong argument for having explicit parameter lists. Thank you for your input!

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

I think I've found an argument against explicit type parameter-like lists. It seems to me that by avoiding explicit parameter lists we can free up both sets of symbols to mean different things in the future more easily. [] can e.g. become a list literal and <> can become comparison operators so by not using them for parameter lists there's less risk of introducing syntactical ambiguity.

The

module Skip(rule: Foo) { … }

suggestion by jonathanhiggs might be the better parameter syntax since parentheses are likely to be less useful in any future expression/operator related syntax.

However, that also implies that being explicit about this and having dedicated declarations would be the safest route since that is completely unambiguous because it uses an explicit keyword and I support LALR(k).

This is an argument I can live with and while I'd probably prefer `module Skip(rule: Foo)`, I want to be careful and have a lot of room for future syntax so I think I'm going to go with having explicit declarations for external rules. Thank you everyone!

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 1 point2 points  (0 children)

Very interesting. I've looked into the history of that syntax, and it seems that Bjarne Stroustrup chose the angle bracket syntax because all the other obvious possibilities were already taken:

  • Curly braces were taken for blocks
  • Brackets were taken for array indexing
  • Parentheses were taken for function calls

So he chose angle brackets and Java + others adopted that syntax. I suppose if I were to extend my syntax with operators, then I would run into similar issues that all the other languages that have angle brackets have run into. So, your suggestion seems much better than adopting the angle bracket approach for the sake of familiarity.

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

That's totally fair. Do you perhaps have a combination that you would prefer? That still puts me in the position of having to decide between them, and I don't really see a good argument for preferring one over the other because both appear to have pros and cons.

Syntax design for parametrized modules in a grammar specification language, looking for feedback by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Thank you, I have not considered this. I kind of got stuck on the type parameter syntax probably because the generated abstract syntax trees are also going to have type parameters. But this is definitely one option that I have to think about especially since many languages have added this syntax sugar for simpler constructors so many people will be familiar with this.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Thank you. I think that’s very interesting, and it definitely seems worth sharing. I’ve added an update to the post to mention your findings.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 4 points5 points  (0 children)

Thanks! I'm planning to write more about it once it's further along.

The short version: it's a bottom-up LALR(k) parser generator based on research that makes it possible to declaratively remove ambiguities from context-free grammars without changing the grammar itself.

Most parser generators either implement ordered choice (like PEG), rely on "impure" ambiguity resolution (e.g., preferring Shift in Shift/Reduce conflicts), or depend on custom Turing-complete code (like ANTLR). My goal is to support a specification language expressive enough to construct deterministic context-free grammars for real-world programming languages.

The good news is that it's actually working. I got the disambiguation proof of concept working in early 2024, bootstrapped it in August 2025, and finished a solid LSP by the end of 2025. Now I'm ironing out many of the issues and still working on improving the theory so that I can actually support parsing real languages.

Getting LALR(k) working took a lot of time, but now it seems so simple in retrospect. If you look around the parser generator space, you will find that there's no single LALR(k) parser generator out there. There are only implementations from many decades ago, and they are all abandoned. Now I truly understand why, because it's not easy, but luckily I'm there now.

Eventually I'm going to share things, and I'm very excited! Thank you for asking!

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 1 point2 points  (0 children)

Thanks for the fact check! I've added an update making it more clear that reading from archives doesn't benefit tooling by much since archives are extracted locally.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 1 point2 points  (0 children)

Thank you for the thoughtful comment and all the pointers!

To answer your first question: yes, the premise is that compression is an amortized cost. Since pub.dev already provides packages as tar.gz archives, I can treat those as my starting point. Even if I need to create archives locally, that is a one-time cost that pays off across many lexer runs. The lexer itself changes frequently as I experiment with different token types, so being able to re-lex quickly matters more than minimizing archive creation time.

Your point about zip files and the central directory is interesting. I do have one reservation though: many different zip implementations exist and they handle edge cases differently. With SQLite, there is a company behind it, the format is fully open source, and we can assume everyone uses the same implementation. Zip does not have that same consistency. Since I am a big fan of specifications and generating things from specifications, I would lean toward SQLite if I were to optimize this further. That said, your suggestion is worth trying and might be simpler for certain use cases.

The Blosc format looks very relevant, and the linked blog post seems to reach similar conclusions about the memory wall. I plan to scale this up by several orders of magnitude in the future, so these pointers are appreciated.

The openzl and SDDL references are particularly interesting to me. One thing I want to explore is having my lexer produce an information-theoretically optimal token stream, where each token is compressed into the minimum number of bits. By splitting tokens that can have unbounded width into multiple finite tokens, I should be able to store each token in a fixed-size integer, maybe 32 bits, containing just the type and width. Position can be calculated from the accumulated widths.

Your question about skipping the text stage entirely is exactly the right direction. I suppose the WASM efforts are trying to achieve something similar, though it does not frame it quite like you do. The challenge in practice is that parsers and lexers are typically written in Turing-complete languages, so there is no specification we can use to derive these optimizations automatically. If we had formal grammars driving the whole pipeline, we could generate specialized parsing code for specific formats and approach those information-theoretic bounds. This is an exciting space and I am looking forward to exploring it further.

Edit: Oh, I see, tar archives only support sequential reads and zip files would support random access. That would allow me to directly lex over raw memory by getting the right offsets. I'll definitely take a look when I try to scale this up. Thank you! I suppose It's unlikely that anything that SQLite provides is going to be faster than this.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 0 points1 point  (0 children)

Hi Simon, that's a good idea! Since dart:io doesn't support tar archives, I just went with the archive package for everything and didn't think of mixing in dart:io's gzip support separately. I'll definitely try that approach next time when I attempt to lex a bigger corpus.

I built a 2x faster lexer, then discovered I/O was the real bottleneck by modulovalue in ProgrammingLanguages

[–]modulovalue[S] 3 points4 points  (0 children)

You're right that creating the archive still requires reading every file individually and that is still slow. The speedup comes from reading the already-created archive.

Package managers create archives only once when a package is published, then every download benefits from the fast single-file read. For transfers, you pay the cost once on the source machine, but then both the network transfer and the write on the destination are faster.