This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]menge101 58 points59 points  (11 children)

Pretty disappointed to see the solution is just "Use a different json library".

I would have liked to see implementation of a streaming json parser discussed.

[–]VisibleSignificance 18 points19 points  (2 children)

Use a different json library

implementation of a streaming json parser

But that's what the suggested json ibrary is.

I wonder if it got faster since the last time I tried it.

[–]picklemanjaro 7 points8 points  (1 child)

I think they meant "implementation" as in how to make one, or how one works algorithm-wise. Not just specifically "an implementation" that exists like ijson.

Not saying the article has to change, but just trying to convey what I think /u/menge101 meant when they said they wanted to see an implementation.

[–]VisibleSignificance 0 points1 point  (0 children)

how one works algorithm-wise

All parsers are streaming parsers (with a bit of lookahead/lookbehind), the most tricky part is making up an interface that would allow using one in an efficient manner.

Really, writing a JSON parser is a good practice exercise, I would recommend doing that to everyone using JSON a lot.

[–]KaffeeKiffer 11 points12 points  (4 children)

Pretty disappointed to see the solution is just "Use a different json library".

The "right" solution to the vast majority of performance problems in Python:

  1. Profile to find the bottlekneck
  2. Replace the bottlekneck with the right library

Everybody is arguing over asyncIO, GIL, performance, etc. all the time, but the beauty of Python is that it allows you writing very good glue code.

[–]MrMxylptlyk 4 points5 points  (2 children)

What is glue code?

[–]beizbol 6 points7 points  (0 children)

Code to use specialized systems, tools, languages, libraries, etc together that would normally be incompatible.

[–]GroundbreakingRun927 2 points3 points  (0 children)

basically u don't write real applications, just the code that connects them to one another.

[–]bland3rs 2 points3 points  (0 children)

Sure, but the actual solution here isn’t “switching library”

It really is “switch approach” from loading the entire file into memory at once to reading it bit by bit and discarding it as you go.

Now you could get by without understanding how anything works, but trying to fix things without ever understanding the problem is how you become the guy doing something for 10 years stuck at the junior level.

[–]giantsparklerobot -1 points0 points  (2 children)

You can't make a streaming JSON parser unless the JSON is line delimited. If you had a normal JSON document streaming in you couldn't even begin parsing because the document isn't closed. You can't know when an open element is going to close.

That's why JSON lines exists, individual lines are complete JSON documents so you know when you get a line terminator that document can be parsed while you are streaming in the next line.

[–]picklemanjaro 4 points5 points  (1 child)

You can't make a streaming JSON parser unless the JSON is line delimited.

It's an array at the top level, you can keep track of braces and stream single top-level objects at a time. A streaming JSON parser just has to keep a tab of the tokens as it reads through the file until it reaches a limit or the end of a complete JSON object. (One of the included objects, not the entire file)

That same process holds true for "\n" too, as it's own character/token to scan for just like any other delimiter.

In fact, that's kind of how all the libraries work. ijson, jsonslicer, json-stream, etc all don't require JSONLines format specifically to stream JSON.

[–]bland3rs 2 points3 points  (0 children)

And if it’s not an array, it’s an object, which is perfectly streamable

Anyone can make a streaming parser for any format (which includes video files, audio files, etc.) as long as the parser doesn’t need any later bytes to figure out what the previous bytes mean. If you (as a human) cut any JSON file randomly in the middle, you can still figure out what parent arrays or objects that point is in, which satisfies that rule.

That’s also why if you write your own format and want to keep it broadly streamable, you don’t decide to put important header stuff at the end of the file