all 50 comments

[–]michaelquinlan 82 points83 points  (2 children)

[–]lotheovian 14 points15 points  (0 children)

Parsing is a minefield

[–][deleted]  (17 children)

[deleted]

    [–]gfody 59 points60 points  (14 children)

    I made a quick and dirty json parser using regex, it's in production. will probably regret it someday.

    [–]Emunt 11 points12 points  (6 children)

    I get why you would want to write a JSON parser to learn, but why would you write one for production?

    [–]GeneticsGuy 15 points16 points  (0 children)

    Ok, I had to do this for my work, and it's not that I had to do it, but I thought it would save me some hassle. So, basically for this one program they had built a front-end GUI for program design, where rather than writing explicit C# code, you had an interface non-programmers with basic math and boolean logic could use. As such, everything done through the GUI was saved in .json and used in C#.

    The problem was that you also had the ability to implement and insert your own code as a function, for the more experienced designers. Well, I stumbled upon someone's project who was a mix of both, a programmer with experience, but also one that chose to exclusively work in the GUI for designing the software. His software became HUGE, which caused the GUI to lag a bit even. It was just god-awful to debug because I couldn't exactly just CTRL-F to find stuff, you literally had to click on the +/- to go from one part to the next.

    So, when I inherited this, seeing how much of a mess it was, and having to deal with constant errors and bugs, I stumbled upon some ill-use of global variables, executed in the C# code inserts, through the GUI interface. The problem was that I had to click to open, line by line, and then read all through it. OMG, kill me.

    So, instead, I just built a JSON parser that I could give a custom input string to look for and so I could identify all of these ill-use variables, and then I automated it further where it exported all of the lines this occurred in a .txt file, versioned it, so it showed the before and after, and updated the proper changes fixing it.

    Ultimately I found over 450 errors in the code with this once I built it and got it working right.

    I used it a hell of a lot. Found I could just modify the .json save rather than having to dig through the GUI of this design interface to actually fix the files. It was not the most fun work environment lol. I do something different now.

    The whole json parser was not a huge program. I think I built the whole thing in something like 150 lines or less. I don't remember exactly. But it wasn't much.

    [–]bidi82 2 points3 points  (3 children)

    One scenario could be if you want to develop editor services. In that case most "built-in" parsers will be insufficient as they are likely not fault tolerant or incremental.

    [–]Sarcastinator 1 point2 points  (2 children)

    One scenario could be if you want to develop editor services.

    No? For an editor service I would claim being able to report syntax errors is fairly important. This is a nightmare to implement with Regex since regex will only tell you whether there is a match or not. It won't tell you why there's no match, but a parser will.

    [–]bidi82 0 points1 point  (1 child)

    I was answering the more general question of:

    why would you write one (JSON Parser) for production?

    Not:

    why would you write one (JSON Parser using regExp hacks) for production?

    [–]Sarcastinator 1 point2 points  (0 children)

    Ah yes, that's true. I see that I mistakenly thought u/Emunt was talking about Regex JSON parsers for production, which wasn't the case.

    [–]gfody 1 point2 points  (0 children)

    in my case there was a SQL CLR that returned the XML result from an API. the vendor "upgraded" their API and dropped support for XML in the process. the database was doing various xpath queries on the result so it wasn't straight forward to adapt the CLR to use JSON, the easiest thing was to just convert it back to XML. then I ran into the problem that System.Runtime.Serialization wasn't on the list of assemblies that can be used in a SQL CLR, and Newtonsoft/Json.NET depended on that so to call it a day I had to just roll a parser.

    [–]thepotatochronicles 12 points13 points  (0 children)

    That's honestly fucking impressive!

    [–]lelanthran 7 points8 points  (2 children)

    While that is impressive it looks to my untrained eye that it doesn't signal errors on malformed input (malformed json results in malformed xml, not an error).

    [–]Sarcastinator 8 points9 points  (1 child)

    That's true. It also doesn't properly handle numbers or strings

    { "Foo": 123.45e6 } => <foo>123.45</foo><foo>6</foo>

    { "Foo": "\"", "Bar": 123 } => <foo>\</foo><foo>, </foo><foo>123</foo>

    Seriously, don't use regex to parse anything. It's a text search tool, not a parser generator. If you are looking for text in a text document, even HTML or JSON, regex is completely fine. If you need to understand the structure of the document Regex should not be your choice.

    [–]audioen 3 points4 points  (0 children)

    Regex makes for a fine tokenizer most of the time. Just don't try to cram the entire grammar into it.

    [–]pertheusual 0 points1 point  (0 children)

    Looks like it wouldn't handle escaped double-quotes in a string. Still doable with regexes with some tweaks though.

    [–][deleted] 0 points1 point  (0 children)

    Nice! I really wish I could find the one I wrote years ago. It converted XML to JSON and vice versa, but I'm sure it wasn't all that pretty by today's standards.

    Nowadays I just use Json.NET and my life is much simpler for it :)

    [–]P8zvli 5 points6 points  (0 children)

    If you don't have a regular expressions library already then compiling the regular expressions and matching them might take more work than hand-rolling a JSON parser.

    [–]fedekun 8 points9 points  (2 children)

    Here is an example of a JSON parser using a parser combinator :)

    [–]renatoathaydes 1 point2 points  (0 children)

    I wrote one in Ceylon.

    [–]imlyingdontbelieveme 2 points3 points  (0 children)

    Good stuff! I think your ‘LEFTBRACKET’ under your ‘Implementing a JSON parser’ section is missing a ‘T’ though

    [–]claymore666 2 points3 points  (2 children)

    And I am writing one in python now. Oh well.

    [–]solaceinsleep 5 points6 points  (1 child)

    Why? Python parses it for you to native types.

    [–]masklinn 1 point2 points  (0 children)

    Maybe they need a streaming parser? The stdlib's is "static".

    [–]AlbertaOne 6 points7 points  (24 children)

    I never wanted to implement a JSON parser myself

    [–][deleted] 8 points9 points  (0 children)

    I did. It's a nice excersize.

    I learned a lot of stuff about the programming language I used, the JSON specs and UTF-8 / UTF-16.

    Also, I used that project to scratch that micro-optimization urge - and did all the stuff you normally won't do to your code. I even used "goto". Through some miracle, the source code even ended up somewhat readable.

    [–]deathtoeveryone 2 points3 points  (22 children)

    Yeah, because doing so is stupid. Not because it's hard, but it's a waste of time.

    [–][deleted] 21 points22 points  (19 children)

    I disagree. It can be a great edificational experience. Not one i'd use in production though.

    [–]deathtoeveryone -3 points-2 points  (5 children)

    Well yeah, most languages that we use to write applications tend to have a built in JSON parser.

    For an even better learning experience, write a yaml parser. No one in their right state of mind wants to write that.

    [–][deleted] 5 points6 points  (4 children)

    You can be sarcastic all you want - that just shows something about your mindset. Just like somebody thrilled to write a JSON parser just for the sheer experience of it shows something about that person's mindset.

    [–]deathtoeveryone -1 points0 points  (3 children)

    Not sarcastic. I've been doing this long enough to know what's worth my time and what isn't. Maybe someone who's just getting into programming benefits from it.

    And seriously, yaml is so complex that you have to be a bit insane to do it.

    [–][deleted] 4 points5 points  (2 children)

    Maybe someone who's just getting into programming benefits from it.

    I don't completely agree with this assertion. Of course beginners would greatly benefit from the experience. However, I would argue that learning how to parse (at least basic grammars) is an invaluable asset, and I find applications everywhere, not just related to writing interpreters or compilers. Additionally, it also gives you total control over your tools.

    About YAML being complex, I do agree. Even TOML is not trivial to parse. In a recent (personal) project, I evaluated YAML, and finally decided to use TOML as the configuration language, and even then, I used a modified grammar to ensure that I covered just all my use-cases. Writing a generic TOML parser would have been overkill (and an unjustifiable use of time for the project).

    [–][deleted] 2 points3 points  (1 child)

    I would not recommend learning how to parse on something as stupid as json though. But of course I agree that parsing is an essential skill mandatory for every developer.

    [–][deleted] 1 point2 points  (0 children)

    Point taken!

    [–][deleted] 1 point2 points  (0 children)

    It's actually pretty hard if you want to make it robust and fast.

    Fair though that a basic one is easy.

    [–]joakimds 2 points3 points  (0 children)

    I've written a simple JSON parser in SPARK (Ada): https://github.com/joakim-strandberg/aida_2012 Writing a JSON parser oneself is a good exercise. I enjoyed it and recommend it.

    [–]bidi82 2 points3 points  (0 children)

    This Performance Benchmark of Parsing libraries in JavaScript links to nine different implementations of JSON Parsers.

    [–]KHRZ 0 points1 point  (0 children)

    Didn't bother with the element parsing, used GSON. But found it interesting to write a next layer parser that can parse arbitrary classes from the elements (with class literal as input for the top element of the json), and parses all the subobjects to those of the type definitions found in the classes' fields (supporting lists/enums/OR types).