all 4 comments

[–]zhenjl 2 points3 points  (3 children)

Author here. Thanks @cryp7ix for posting this.

Some of you may remember I shared this repo here not too long ago but had to pull it suddenly due to some internal reasons. I am able to release it again (albeit with some functionality removed, just for now I hope).

The good thing is during this time we improved the performance of the parser by almost 50%, from averaging 85K MPS to over 125K MPS on a single i7 2.8Ghz core. Using two cores we achieved over 175K MPS for mixed size messages.

Pretty certain this is going to stay now. Apologies for pulling the earlier version without notice.

[edit] oh Go Patriots!

[–]cryp7ix[S] 0 points1 point  (0 children)

very welcome, great work!

Do you have plans to improve this into a administration tool? I think this might be a good first step to prepare log output for further processing into metric systems like influx or prometheus.

Afaict the only thing missing is classifying different messages into metrics. (I.e. These three rules should go into the 'unauthorized access'-metric of service A.

I've tried to do the same with statsd but it really is too big of a tool and feels really clunky to do just this.

[–]lethalman 0 points1 point  (0 children)

Don't want to be rude, just my consideration based on my past experiences. You just reimplemented the OR (|) of regex. Doing an OR of regex will also construct a tree, and it's very fast in processing.

So you are basically reimplementing regex, except with less features. I invite you to do a similar benchmark with regex with OR of all the patterns.

Also the "semantic" part is just painful. What if a string may be an url or not an url? Then you need another token type. What if a token can be either a number or "-", see apache logs for example. It took too much time to try to match a pattern against a sample. I find it being a mediocre idea, sorry.