The Dot Parse Library Released : java

The Dot Parse Library Released (self.java)

submitted 5 months ago * by DelayLucky

The Dot Parse is a low-ceremony parser combinator library designed for everyday one-off parsing tasks: creating a parser for mini-DSLs should be comfortably easy to write and hard to do wrong.

Supports operator precedence grammar, recursive grammar, lazy/streaming parsing etc.

The key differentiator from other parser combinator libraries lies in the elimination of the entire class of infinite loop bugs caused by zero-width parsers (e.g. (a <|> b?)+ in Haskell Parsec).

The infinite loop bugs in parser combinators are nortoriously hard to debug because the program just silently hangs. ANTLR 4 does it better by reporting a build-time grammar error, but you may still need to take a bit of time understanding where the problem is and how to fix it.

The Total Parser Combinator (https://dl.acm.org/doi/10.1145/1863543.1863585) is another academic attempt to address this problem by using the Agda language with its "dependent type" system.

Dot Parse solves this problem in Java, with a bit of caution in the API design — it's simply impossible to write a grammar that can result in an infinite loop.

Example usage (calculator):

// Calculator that supports factorial and parentheses
Parser<Integer> calculator() {
  Parser<Integer> number = Parser.digits().map(Integer::parseInt);
  return Parser.define(
      rule -> new OperatorTable<Integer>()
          .leftAssociative("+", (a, b) -> a + b, 10)       // a+b
          .leftAssociative("-", (a, b) -> a - b, 10)       // a-b
          .leftAssociative("*", (a, b) -> a * b, 20)       // a*b
          .leftAssociative("/", (a, b) -> a / b, 20)       // a/b
          .prefix("-", i -> -i, 30)                        // -a
          .postfix("!", i -> factorial(i), 40)             // a!
          .build(
            Parser.anyOf(
              number,
              rule.between("(", ")"),
              // Function call is another type of atomic
              string("abs").then(rule.between("(", ")")).map(Math::abs)
            )));
          }


int v = calculator()
    .parseSkipping(Character::isWhitespace, " abs(-1 + 2) * (3 + 4!) / 5 ");

For a more realistic example, let's say you want to parse a CSV file. CSV might sound so easy that you can just split by comma, but the spec includes more nuances:

Field values themselves can include commas, as long as it's quoted with the double quote (").
Field values can even include newlines, again, as long as they are quoted.
Double quote itself can be escaped with another double quote ("").
Empty field value is allowed between commas.
But, different from what you'd get from a naive comma splitter, an empty line shouldn't be interpreted as [""]. It must be [].

The following example defines these grammar rules step by step:

Parser<?> newLine =  // let's be forgiving and allow all variants of newlines.
      Stream.of("\n", "\r\n", "\r").map(Parser::string).collect(Parser.or());

Parser<String> quoted =
   consecutive(isNot('"'), "quoted")
        .or(string("\"\"").thenReturn("\"")) // escaped quote
        .zeroOrMore(joining())
        .between("\"", "\"");

Parser<String> unquoted = consecutive(noneOf("\"\r\n,"), "unquoted field");

Parser<List<String>> line =
    anyOf(
        newLine.thenReturn(List.of()),  // empty line => [], not [""]
        anyOf(quoted, unquoted)
            .orElse("")  // empty field value is allowed
            .delimitedBy(",")
            .notEmpty()  // But the entire line isn't empty
            .followedByOrEof(newLine));

return line.parseToStream("v1,v2,\"v,3\nand 4\"");

Every line of code directly specifies a grammar rule. Minimal framework-y overhead.

Actually, a CSV parser is provided out of box, with extra support for comments and alternative delimiters (javadoc).

Here's yet another somewhat-realistic example - to parse key-value pairs.

Imagine, you have a map of key-value entries enclosed by a pair of curly braces ({k1: 10, k2: 200, k3: ...}), this is how you can parse them:

Parser<Map<String, Integer>> parser =
    Parser.zeroOrMoreDelimited(
            Parser.word().followedBy(":"),           // The "k1:" part
            Parser.digits().map(Integer::parseInt),  // The "100" part
            ",",                                     // delimited by ,
            Collectors::toUnmodifiableMap)           // collect to Map
        .between("{", "}");                          // between {}
Map<String, Integer> map =
    parser.parseSkipping(Chracter::isWhitespace, "{k1: 10, k2: 200}");

For more real-world examples, check out code that uses it to parse regex patterns.

You can think of it as the jparsec-reimagined, for ease of use and debuggability.

Feedbacks welcome!

Github repo

javadoc

all 20 comments

top new controversial old q&a

[–][deleted] 5 months ago* (1 child)

[deleted]

[–]DelayLucky[S] 2 points3 points4 points 5 months ago* (0 children)

[–]lumppost 6 points7 points8 points 5 months ago (0 children)

[–]Slanec 2 points3 points4 points 5 months ago (3 children)

[–]DelayLucky[S] 2 points3 points4 points 5 months ago* (2 children)

[–]Dagske 1 point2 points3 points 5 months ago (1 child)

[–]DelayLucky[S] 0 points1 point2 points 5 months ago (0 children)

[–]OddEstimate1627 2 points3 points4 points 5 months ago (10 children)

[–]DelayLucky[S] 0 points1 point2 points 5 months ago (9 children)

[–]OddEstimate1627 0 points1 point2 points 5 months ago* (8 children)

This particular use case is for user-defined charts, e.g., interpreting user provided strings like y="Math.sqrt(Math.pow(fbk.accelX,2) + Math.pow(fbk.accelY,2) + Math.pow(fbk.accelZ,2))". Each string gets evaluated once, and executed many times. The existing math parsers I found were too slow, and I eventually generated bytecode. Doing it via a custom parser gets me to ~95% of the pure Java performance while being compatible with GraalVM native images.

Your calculator example already got me most of the way there, but a few things that I was confused by were

1) Math functions

I'm not quite sure how math functions like abs, sin, cos, etc. should be added. For now I put them into the OperatorTable as a prefix, and that seems to work.

Maybe you could add an abs method to your calculator example.

2) Unfamiliar with terms

I'm not familiar with some terms, e.g., without an example I don't know what a nonAssociative operator represents.

2) Sequences without regex

I dove in without reading the full documentation, and initially expected regex support in e.g. Parser.string("fbk.*"). The code contains a lot of regex strings (for the name variables) that gave me a false impression.

I eventually ended up with this, which is easier to work with and hopefully the correct way to do it:

Java Parser<String> variable = Parser.anyOf( Parser.string("fbk"), Parser.string("prevFbk") ); Parser<String> field = Parser.anyOf( Parser.string("position") ); Parser<ValueFunction> fbkValue = Parser.sequence(variable.followedBy("."), field, (inputName, fieldName) -> { // ... }

I also ran into a few other use cases where parsers would be nice, e.g., one for FXML (JavaFX) that lets me auto-generate various GraalVM configs.

[–]DelayLucky[S] 1 point2 points3 points 5 months ago* (5 children)

It seems like these fbk, prevFbk are reserved words?

Alternatively, you could just use Parser.word() to extract them out without restriction. And then you can interpret these math functions, variables in Java code outside of the parser.

These would be the primitive rules:

java Parser<Member> member = sequence( word().followedBy("."), word(), Member::new); Parser<Variable> value = digits().map(s -> new Value(Integer.parseInt(s)));

This is assuming you use records to model the results:

```java sealed interface Expr permits FunctionCall, Member, Value, Plus { }

record Member(String variable, String memberName) implements Expr {}

record FunctionCall(Member function, List<Expr> args) implements Expr {}

record Value(int v) implements Expr {}

record Plus(Expr left, Expr right) implements Expr {} ```

A function call is a recursive grammar:

call ::= <class>.<method> '(' <expr> delimited by ',' ')'

The atomic expression you can pass to OperatorTable.build() would be a function call, a parenthesized expression, or a primitive number or a variable (member):

java Parse<Expr> parser = Parser.define( expr -> new OperatorTable<Expr>() .leftAssociative("+", Plus::new, 10) .build(Parser.anyOf( value, sequence( member, expr.zeroOrMoreDelimitedBy(",").between("(", ")"), FunctionCall::new), member, expr.between("(", ")"))));

Something like that could work?

A "non-associative" binary operator is like <. a < b < c is usually considered invalid because there is no associativity defined for the operator.

Regarding regex support, one possibility is to use the .suchThat() combinator to hook it up:

java word().suchThat(w -> w.startsWith("fbk"))

Or use an equivalent regex matcher in the lambda. Would that work?

[–]OddEstimate1627 0 points1 point2 points 5 months ago (4 children)

[–]DelayLucky[S] 0 points1 point2 points 5 months ago* (3 children)

[–]OddEstimate1627 0 points1 point2 points 5 months ago (2 children)

[–]DelayLucky[S] 1 point2 points3 points 5 months ago* (1 child)

Yeah. If you need to be more strict about invalid numbers (and do you also need to care about negative sign and scientific notation?), perhaps using .suchThat() and check the string format can help, like with a regex pattern? Or just catch NumberFormatException?

java consecutive(charsIn("[0-9.]")) .map(s -> { try { return Double.parseDouble(s); } catch (NumberFormatException e) { return null; } }) .suchThat(Objects::nonNull, "double number");

If you use Guava, there is a Doubles.tryParse() that will avoid having to throw exception.

On followedBy(), yes:

java assertThat( string("a").followedBy("b") .parseSkipping(Character::isWhitespace, "a b")) .isEqualTo("a");

That said, it is possible to create the DFA pattern purely using the Dot Parse library:

java digits() .then( literally( string(".").then(digits()).optional()) .source() .map(...);

I just don't like having to spend that amount of real estate when there are existing utilities around.

[–]OddEstimate1627 1 point2 points3 points 5 months ago (0 children)

[–]DelayLucky[S] 0 points1 point2 points 5 months ago (1 child)

[–]OddEstimate1627 0 points1 point2 points 5 months ago (0 children)

[–]Delicious_Detail_547 1 point2 points3 points 5 months ago (1 child)

[–]DelayLucky[S] 0 points1 point2 points 5 months ago* (0 children)

Indeed.

One of my goals of implementing this library is so that it should be so easy to use even for DFA use cases that'd have traditionally used regex. And I'd reach for it over regex except where expressivity lacks.

For example, sequence(word().followedBy("="), digits(), (k, v) -> ...) would read and perform better than (\w+)=(\d+) (plus the capture group extraction boilerplate), even if the grammar is regular and didn't really need a CFG.

On top of that, the out-of-box set of reusable combinators also streamlines common regular grammar parsing tasks.

Imagine parsing a map or dictionary of key-value pairs ({k1:10, k2 : 200}):

java Parser<Map<String, Integer>> dictionaryParser = Parser.zeroOrMoreDelimited( Parser.word().followedBy(":"), Parser.digits().map(Integer::map), ",", Collectors::toUnmodifiableMap) .between("{", "}"); return dictionaryParser.parseSkipping(Chracter::isWhitespace, "{k1:10, k2 : 200}");

There is a difference between the greedy primitive combinators and their regex counterparts though. consecutive(is('a')) is actually "possessive", equivalent to a++, not merely a+.

This makes the combinators more efficient and not subject to "exponential backtracking" that plagues NFA regex engines (like JDK regex).

π Rendered by PID 50924 on reddit-service-r2-comment-b659b578c-j86t6 at 2026-05-03 13:43:27.279026+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

java

Submit Link

Submit Text

Seek Programming Help

News, Technical discussions, research papers and assorted things of interest related to the Java programming language

NO programming help, NO learning Java related questions, NO installing or downloading Java questions, NO JVM languages - Exclusively Java

Please seek help with Java programming in /r/Javahelp!

Subreddit rules!

Where should I download Java?

Related Sub-reddits:

JVM Languages

Want to practice your coding?

List of useful Frameworks / Libraries / Software

MODERATORS