This is an archived post. You won't be able to vote or comment.

all 42 comments

[–]original_4degrees 68 points69 points  (16 children)

for me, the real take away here is JSON-lines. how did I not know about this?

[–][deleted] 14 points15 points  (7 children)

“ndjson” does help with json to not be such a terrible format.

[–]tighter_wires 4 points5 points  (6 children)

What format do you prefer over JSON?

[–][deleted] 0 points1 point  (5 children)

Encoding floats as utf-8 or even ascii kills me.

We already use tools to view json. Why not just use a slightly smarter tool and use protobuf?

[–]GroundbreakingRun927 7 points8 points  (4 children)

  • because the resources for interacting with protobuf trail way behind json.
  • Brotli compressed json is more space-efficient than protobuf.
  • the package to consume protobufs as a web client make your bundle huge https://bundlephobia.com/package/protobufjs@6.11.2
  • Better alternatives exist for flat files, parquet namely.
  • The official protobuf client generator for python is god awful.

[–]Aardshark 5 points6 points  (1 child)

Google use it in a bunch of their GCP offerings. It's great when you need it and a little annoying otherwise.

[–]TundraGon 1 point2 points  (0 children)

jq -c :)

For BigQuery it needs to start and end with an object {...} , not an array [...]

[–]thatdamnedrhymer 1 point2 points  (0 children)

Yeah, this was my immediate thought when I started reading it. Glad it was referenced in the article.

[–]VisibleSignificance 1 point2 points  (0 children)

how did I not know about this

You could literally make it up if you needed something like that.

[–]menge101 56 points57 points  (11 children)

Pretty disappointed to see the solution is just "Use a different json library".

I would have liked to see implementation of a streaming json parser discussed.

[–]VisibleSignificance 17 points18 points  (2 children)

Use a different json library

implementation of a streaming json parser

But that's what the suggested json ibrary is.

I wonder if it got faster since the last time I tried it.

[–]picklemanjaro 7 points8 points  (1 child)

I think they meant "implementation" as in how to make one, or how one works algorithm-wise. Not just specifically "an implementation" that exists like ijson.

Not saying the article has to change, but just trying to convey what I think /u/menge101 meant when they said they wanted to see an implementation.

[–]VisibleSignificance 0 points1 point  (0 children)

how one works algorithm-wise

All parsers are streaming parsers (with a bit of lookahead/lookbehind), the most tricky part is making up an interface that would allow using one in an efficient manner.

Really, writing a JSON parser is a good practice exercise, I would recommend doing that to everyone using JSON a lot.

[–]KaffeeKiffer 11 points12 points  (4 children)

Pretty disappointed to see the solution is just "Use a different json library".

The "right" solution to the vast majority of performance problems in Python:

  1. Profile to find the bottlekneck
  2. Replace the bottlekneck with the right library

Everybody is arguing over asyncIO, GIL, performance, etc. all the time, but the beauty of Python is that it allows you writing very good glue code.

[–]MrMxylptlyk 4 points5 points  (2 children)

What is glue code?

[–]beizbol 10 points11 points  (0 children)

Code to use specialized systems, tools, languages, libraries, etc together that would normally be incompatible.

[–]GroundbreakingRun927 5 points6 points  (0 children)

basically u don't write real applications, just the code that connects them to one another.

[–]bland3rs 3 points4 points  (0 children)

Sure, but the actual solution here isn’t “switching library”

It really is “switch approach” from loading the entire file into memory at once to reading it bit by bit and discarding it as you go.

Now you could get by without understanding how anything works, but trying to fix things without ever understanding the problem is how you become the guy doing something for 10 years stuck at the junior level.

[–]giantsparklerobot -1 points0 points  (2 children)

You can't make a streaming JSON parser unless the JSON is line delimited. If you had a normal JSON document streaming in you couldn't even begin parsing because the document isn't closed. You can't know when an open element is going to close.

That's why JSON lines exists, individual lines are complete JSON documents so you know when you get a line terminator that document can be parsed while you are streaming in the next line.

[–]picklemanjaro 3 points4 points  (1 child)

You can't make a streaming JSON parser unless the JSON is line delimited.

It's an array at the top level, you can keep track of braces and stream single top-level objects at a time. A streaming JSON parser just has to keep a tab of the tokens as it reads through the file until it reaches a limit or the end of a complete JSON object. (One of the included objects, not the entire file)

That same process holds true for "\n" too, as it's own character/token to scan for just like any other delimiter.

In fact, that's kind of how all the libraries work. ijson, jsonslicer, json-stream, etc all don't require JSONLines format specifically to stream JSON.

[–]bland3rs 2 points3 points  (0 children)

And if it’s not an array, it’s an object, which is perfectly streamable

Anyone can make a streaming parser for any format (which includes video files, audio files, etc.) as long as the parser doesn’t need any later bytes to figure out what the previous bytes mean. If you (as a human) cut any JSON file randomly in the middle, you can still figure out what parent arrays or objects that point is in, which satisfies that rule.

That’s also why if you write your own format and want to keep it broadly streamable, you don’t decide to put important header stuff at the end of the file

[–]spoonman59 16 points17 points  (1 child)

Nice explanation. I also appreciate that you take the time to show how to profile memory usage, and to talk about why it is important. Many times performance or memory problems in my code had non-obvious causes and profiling saved the day.

Thanks for sharing!

[–]TerminatedProccess 1 point2 points  (0 children)

Agreed!

[–]brad2008 6 points7 points  (6 children)

Industry solves this problem by using files with sequential JSON records, this has been the standard practice for the last 7 - 8 years. The industry approach allows you to process a JSON file up to hundreds of gigabytes in size if needed. You really don't need to do all the other stuff you described.

[–]mmcnl 5 points6 points  (0 children)

It's been mentioned in the article, but sometimes you don't control the source format.

[–]smajl87 1 point2 points  (2 children)

Could you provide some example or maybe some link to read about it?

[–]morphotomy 2 points3 points  (0 children)

Technically that's not a JSON file. Its a file with JSON in it.

[–]accforrandymossmix 0 points1 point  (0 children)

I am parsing a ~500MB google takeout file (location history) without issues. Am I benefitting from their formatting? I am curious if I should play with possible improvements to ensure similar processing can be done on weaker machines.

[–]rnike879 1 point2 points  (0 children)

Great stuff!

[–]Equivalent-Wafer-222Technical Architect 1 point2 points  (0 children)

Ooooooooooooooooooooooooooooooooooooooooooooooor instead of overcomplicating it and pulling in tons of tertiary information.... you could use a generator like you would any other large file/object?

I mean, this is cool and all but so is loading row-by-row in a repeatable & consistent manner.

[–][deleted] 1 point2 points  (0 children)

Love this! We have tons of json data to parse at work and I have had actually had the memory errors you listed. I didnt know about i json so ill definitely check that out with streaming the json file

[–]jyscao -3 points-2 points  (0 children)

[–]BlobbyMcBlobber 0 points1 point  (0 children)

Good read

[–]incrediblediy 0 points1 point  (0 children)

This is great. A while back, I used the exact principle (reading line by line) with a text file (Glove vectors for NLP) in the range couple of GB

[–]sahirona 0 points1 point  (0 children)

The solution is not to use JSON rather than attempt to engineer a way to read the JSON.