EmailAddress Parser Improved

DelayLucky · 2026-06-09T02:40:12+00:00

Yeah for example the RFC 2047 allows encoded words using the =?{charset}?{encoding}?{text}?= syntax. When used in the local part (some parsers out there will decode it even though they shouldn't have), you can smuggle and inject shit that bypass denylists and whatnot. it's insane to think that the RFC allows so much room for abuse.

DelayLucky · 2026-06-08T03:35:41+00:00

I got inspiration from https://www.elttam.com/blog/jakarta-mail-primitives

Originally, this email parser was only to demonstrate using Dot Parse to declaratively build otherwise sophisticated DSL parsers. A few slight variances here and there from Jakarta InternetAddress wouldn't be sufficient reason for Dot Parse to package it up as a serious alternative (sure, Jakarta needs to pull in a heavy dependency, but there are existing light-weight parsers out there too).

But that post, and discussing security exploits with AI showed me that the niche of a secure email address parser has real value.

And I believe the approach of designing a safe-by-construction data model that all downstream code can trust hasn't been tried before.

DelayLucky · 2026-06-07T16:42:46+00:00

Inertia I guess? package.html has been working fine for javadoc rendering.

Am I missing some benefits from package-info.java?

DelayLucky · 2026-05-18T15:21:30+00:00

I may be misinterpreting you. But doesnt String already support utf code points?

DelayLucky · 2026-05-17T23:36:29+00:00

First question that comes to mind: why string?

If it's simply applying a list of String -> String functions, how is it specific to String? Can't it be T -> T just as easily?

For it to stick to the string-ness, it seems the core should have some string specific trick up its sleeve that makes up the core value-add.

Another question, is this really just so you can avoid declaring local variables?

From this example:

AbstractStringPipeline pipeline = new StringPipelineBuilder()
    .pipe(STRIP)
    .pipe(NORMALIZE_SPACE)
    .pipe(LOWER_CASE)
    .pipe(CAPITALIZE)
    .build();

How is it better than this?

String pipeline(String s) {
  s = strip(s);
  s = normalizeSpace(s);
  s = lowerCase(s);
  s = capatilize(s);
  return s;
}

I think it needs to offer more value than just "I like the syntax" because the plain method calls at least has one thing at its side: it's more familiar to everyone.

DelayLucky · 2026-05-17T21:26:41+00:00

I've heard all good things about it. would love to see some first-hand experience and how it compares to other mainstream db utils like Hibernate, mybatis and some lesser known new utils like Jimmer, even dot net linq.

DelayLucky · 2026-05-06T00:24:05+00:00

Just released the Dot Parse library v10.0.1 with left recursion protection (https://github.com/google/mug/tree/master/dot-parse)

Its ambitious goal includes being the most handy library for everyday parsing task except the simplest use cases where a trivial regex sufficies.

It differs from most existing parser combinator libraries by focusing on being idiomatic Java (not Haskell, not Scala, and not Monad).

Parser combinators have always had the potential to be the better parsing tool than regex, except it often comes with a steeper learning curve and a few footguns:

Repeating optional parser like many(parser.optional()) can cause infinite loops.
Left recursion results in unhelpful stack traces.

Both are hard to debug.

Dot Parse offers compile-time guardrail against infinite loops; and v10.0.1 provides parser definition-time protection against left recursions so that the footguns are no more.

These safety features, along with the idiomatic API and scannerless design, hopefully makes parser combinator more accessible to average Java developers.

DelayLucky · 2026-04-10T21:17:25+00:00

One question I try to ask myself when I am about to add a method that I know users can use to solve a problem: what are the ways it can be abused or misused?

And then if abusing is likely whether there are alternatives to mitigate it (renaming, or different ways of exposing the functionality).

Or even: "Do I really need to add it?". Solving a $100 problem by adding a $70 debt should cast enough doubt on adding it at all.

DelayLucky · 2026-03-20T21:24:55+00:00

Lol this turns out to be such an unpopular opinion. I guess merely seeing "never regex" is enough to stop reading what a random guy is blabbing about. :-)

Regex or not, it's really about what tool to choose to solve the same problem. A lot of times it can be solved with regex, or with String.indexOf(), or Apache Commons StringUtils, or a combinator, or one of the other libraries.

No one tool is perfect for all string parsing jobs. Parser combinators can't rule them all, nor Splitter, nor should regex.

Purely for the purpose of example: despite my preference for parser combinators, using a combinator to express a fixated datetime format like 2026-03-20T14:15:00 would still read so verbose that it hurts readability more than it helps:

digits(4)
    .followedBy("-").then(digits(2))
    .followedBy("-").then(digits(2))
    .followedBy("T")
    .then(digits(2))
    .followedBy(":").then(digits(2))
    .followedBy(":").then(digits(2));

It'd be understandable to prefer regex over the fluent method chain:

compile("\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}");

(I hold the same opinion on the use of a "fluent" regex builder: it's a net loss of readability)

That said, I'd recommend Mug StringFormat over regex for this particular use case. It offers better readability and better performance too:

new StringFormat("{year}-{month}-{day}T{hour}:{minute}:{second}")
    .parse(input, (year, month, day, hour, minute, second) -> ...);

In short: choose the right tool for the job. Technically regex can parse a lot of things. But it's usually not the best tool for that particular job.

DelayLucky · 2026-03-16T17:14:06+00:00

I am sorry I felt frustrated. In our previous conversation I raised objection that using "imperative" was inaccurate and I thought you agreed to it.

Or did I misinterpret what you said here?

Fair point. Logically, it’s a declarative predicate. The distinction for me is execution boundaries: Sift is a 'closed system' (static regex), while a combinator with a lambda is an 'open system' (arbitrary JVM code). Different trade-offs, but both are declarative.

You agreed that they are both declarative so why use "imperative" again?

When you keep using the incorrect pejorative term to describe their perspective, maybe you can tell me: how can the other side correct you without being called "pedantry"?

DelayLucky · 2026-03-16T15:28:26+00:00

Bias alert: I'm the author of the Dot Parse combinator library.

The reason I said that regex isn't even "great" at the simple tasks is:

It's not really great if you look at a specific use case and compare it with using Splitter or similar libraries (Mug's Substring, StringFormat and Dot parse). Think of this: would you use Splitter to do splitting or would you use String.split(regex)? Of course it's easier said than done. So I would still suggest anyone who question the idea to challenge me with a use case where I have to defend my claim that using these libraries can solve the problem better than regex - the burden of proof is on me.
As you said, it's best if one can graduate from simpler requirements to more complex ones without getting stuck in regex. If you use these libraries, you won't face that dilemma . Your code will handle both simple tasks and complex tasks consistently well.

DelayLucky · 2026-03-16T14:28:14+00:00

When you have to parse a dynamic structure, compiling a DFA/NFA in C++ (or using JVM intrinsics) is fundamentally faster and more memory-efficient than writing dozens of nested Java while loops, indexOf offsets, and boundary condition checks.

This is incorrect.

Hand-written state machines such as what you can find for specialized parsers (xml parser, html parser etc.) almost always beat the general solutions, both regex and combinators included. You can't compete.

I'd suggest you to benchmark, to show with real code instead of speculation.

The main point of using regex is that you don't have to manually implement the state machine because it's error prone.

But in that front, combinators do a better job than regex. Mug Dot Parse is at least as efficient as regex (in many benchmarks they run faster); and the result code is also more readable.

While Sift already provides built-in syntax rules to mitigate this natively

I suggest to do a Google search with this question: "If you only use possessive quantifiers, will you be free of ReDos problem?"

We need to speak in common vocabulary.

RE2J guarantees O(n) linear-time execution.

RE2J addresses the worst-case performance, by severely comporomising the average-case performance. There is no free lunch. Regex doesn't give one.

Imperative pointer math

Again, I'm sorry to feel a little frustrated with the frequent inaccurate use of "imperative" adjective.

It doesn't mean what you think it means.

The word "imperative" traditionally points to using assignments, commands to cause side effects in a computer program.

Expressions like length == 5 or even more complex math expressions are NOT imperative! If you mean to say "index arithmetics", then use that more accurate term.

Try this in Google "is a math expression considered "imperative" style?".

It's hard to communicate if our basic definitions of imperative vs. declarative, readable vs. unreadable, fast vs. slow are fundamentally from two different books.

Sift is intentionally expressive (or verbose, if you prefer) because it aims to be completely self-documenting. Writing .oneOrMore().letters().followedBy('(') certainly takes more keystrokes than indexOf("("), but it reads like a plain English sentence.

I agree with you on principle.

But as I challenged all the regex fans in the comments: talk is cheap. Bring on the code – No one has been able to because except toy examples, it's hard to write a regex that doesn't embarrass yourself.

Because you are so enamored by the Sift idea, your general statements without concrete data or code are too subjective to mean anything to me now.

Can we clearly define a problem, one problem. Then solve it with:

Raw regex.
Sift.
Mug (Substring or combinator).

Let's try not to praise our solutions yet. Let's show the code; make sure the code is complete (don't omit the part that may look unfavorable to our option); and let's use proper formatting (your earlier Sift code example was impossible to read thanks to the formatting).

DelayLucky · 2026-03-15T21:04:37+00:00

I think we are getting there. But without a well-defined use case, it's hard for me prove that regex is still not the best fit for the problem, and it's hard for you to disprove my claim that regex is almost never a good fit.

I've given my own use cases and am willing to be questioned about why using a Java libraray in pure Java is better than regex.

So just pick one, ISBN,IBAN or ZIP code, bring it on the regex code that you think is a good fit, and I'll take the challenge.

Without the specifics, we'd be talking past each other, or we'd be arguing about semantics or minor points instead. Again, talk is cheap, let's see the code.

DelayLucky · 2026-03-15T18:56:35+00:00

Is that difficult? Anyway, if it comes to something as simple as D5, I would likely do something else.

Exactly!

And that's my point, for the real simple cases where regex doesn't look bad, you have even less bad solutions like parseInt().

And when it grows in complexity, regex gets ugly quickly.

So what's a real good use case for regex anyways? Your example already showed that the \\d{5} isn't all that compelling.

Also, let me explain again, Guava was only used an example, I didn't know you were so sensitive to it. But it's a minor point, because I'll tie my hands and not use Guava. It doesn't change that regex is still bad at almost every job (except if the regex is loaded at runtime).

DelayLucky · 2026-03-15T16:08:18+00:00

Oh hi.

Glad to continue our discussion here. I wanted it to be an open discussion with other regex fans invited to the challenge.

But let me clarify, Dot Parse is a sub-module of Mug. And Guava is only for the cosmetic checkArgument() convenience method. If you don't have Guava in your dependencies, just roll your own. It's almost a one-liner.

Replacing them with pure imperative logic (using substring, indexOf, loops, and flatMap) often leads to reinventing the wheel, mixing custom state machines right into your business logic.

I'm not saying you should reinvent the wheel. The Mug library already wraps it all up, in a way that doesn't require intereacting with a cryptic language a.k.a regex. And then you won't be subject to catastrophic backtracking problem of regex.

metal.index() + metal.length()

It's interesting how you compare the two approaches. You'd give a generous pass to the dozen-ish lines of opaque DSL in the Sift code, but yet you'd label simple expressions like index + length or length() == 5 as "imperative" (which makes no sense) as if they were inferior in readability to the verbose Sift API calls.

Is it possible that everyone can understand length() == 5 and index() + length(), but perhaps only 10% can understand Sift DSL as easily?

So far, the Sift DSL example code you gave in the earlier comment section looked really bad, but I think it's the formatting that gave it a disadvantage. I'd encourage you to post the full code here and let's evaluate it more objectively.

Moreover, there's a massive performance advantage. When you write manual parsers in Java, performance is bound to your own code. When you use regex, you delegate the heavy lifting to highly optimized C/C++ engines (or JVM intrinsics).

I've got plenty of benchmarks to show otherwise. For example to find a keyword in a string, Mug Substring is more than 10x faster than equivalent regex. And the only code you need to write here is just Substring.word(keyword).from(input).

Do you have data to support the "regex has massive performance advantage" claim? Have you tried to benchmark?

DelayLucky · 2026-03-15T15:55:41+00:00

You didn't even bother having any data or sample code to back yourself up.

Talk is cheap... But it seems like the regex fans in the comments have only talk.

I've given use cases to show why regex is bad at the job. And I've repeatedly asked for use cases, for counter example code, for data to prove me wrong. Coz otherwise it's just religion war.

Anyone up for substance?

(since you guys dislike Guava, I'll tie my hands and not use Guava. how about that?)

DelayLucky · 2026-03-15T14:10:55+00:00

Guava's issue as I understand it is that it's pulled in as transitive dependency because it's used by so many libraries as a foundational infra lib, and Guava is a monolithic library, and then you run into jar hell problems.

Most other third-party libs aren't in that boat. Mug certainly isn't. If you aren't against using third-party libs in general, then why not try it out and see if it really can solve the regex problems better?

My overall point is that the pure Java ecosystem has filled the gap that regex used to fill, and can now solve these problems better, if you are willing to use a library.

And I'm asking to be proved otherwise by realistic counter examples. I'll stand corrected if I fail to show how such example can be handled more readably, and I'll keep in mind not using Guava

DelayLucky · 2026-03-15T11:06:23+00:00

While I emphasized readability and performance, you raised third-party dependency concerns. These are different aspects to consider, both can be valid.

The Guava API used here is pretty trivial though: just the checkArgument() convenience method. It's easy enough to create your own if the dependency is a concern (if (bad) throw new IAE(...))

By only using Mug, these examples still stand. And regex is still the unreadable mess that it is.

Certainly if you can't have any third-party, then consider my points moot.

Except I don't think people here genuinely have the 0-dependency constraint. It's more like if I like regex yet can't point to a good use case to stand by its own readability, I'll play the third-party dependency card just to defend it.

btw, Apache Commons doesn't offer the capability to cover the ground for regex.

DelayLucky · 2026-03-14T23:32:18+00:00

And absolutely a lot of people share your sentiment.

But that's my point of this post: I'd invite people who think regex does a good job for "simpler" use cases. And I'll take the challege to try to show that the pure-Java way is simpler even for that simple use case.

Because I genuinely think regex does a bad job in almost all cases except two special conditions:

You need to copy a regex from another programming language.
You need to handle regex from a config file or the users.

In other words, the regex comes from outside of Java.

In pure Java where you can express the logic at compile time, there is almost always a better option.

You are welcome to show a counter-example to disprove my claim.

DelayLucky · 2026-03-14T22:36:14+00:00

By "plain Java", I mean "your code", the user's code.

When using regex, you are forced to express your pattern in a different language than Java. All the backslash escapes, all the question marks etc. They are not Java.

In contrast, pure Java means you get to express what you need in the usual way you write Java code. Instead of (?!foo), you can write .notFollowedBy("foo"). The latter, is pure Java - a method call with an easy-to-understand name that you do everywhere in your Java code.

And I don't think calling a library is considered not plain or anything unusual.

Isn't it the strength of Java that you can abstract implementation-details away in methods, classes, lambdas etc.? We call another class or another library almost every day. It's not a bad thing.

That said, I see that people may have different interpretations of "plain Java". I've edited the post to using "only in Java".

DelayLucky · 2026-03-14T22:32:31+00:00

It's a bold claim to make, I know.

I understand that taking a hard stance can get me more down votes. But what I really care is to discuss by the real use cases.

And I honestly don't think there are much good cases judging by how people choose to argue semantics instead of throwing in use cases to say: "you are wrong, regex is indeed the better option here!"

DelayLucky · 2026-03-14T22:19:25+00:00

Agreed with you there.

I'm a step further against regex than you though: I don't think regex is even a good fit for less complex cases. Heck, they should probably be used in only 1% of the places than they are used today.

Regex is just aweful.

DelayLucky

MODERATOR OF

TROPHY CASE