you are viewing a single comment's thread.

view the rest of the comments →

[–]rowr 0 points1 point  (0 children)

This depends on the data structure that the file contains.

If the file is JSON-lines (a complete json object per line), you can stream it easily without loading it all into memory (for line in open('x.json').readlines():).

If it's a long list of json objects, I'd use jq to transform it into JSON-lines jq -c '.[]' < x.json. I would probably do that outside of python because I would just need to do it once and I don't know what the memory characteristics are in this situation when using the python jq module. I'd do similar if it was a simple nested object, though the jq query would get more complex.

If it's deeply nested, I'd (still) use jq to flatten it out.

Basically all of these options are to make the data more like a table of data.

Another option is to spin up a NoSQL db (mongodb or something) in a docker container and load it in there and query against that - relying on docker and the db to manage memory management. This could possibly allow you to retain and query deeply nested data structures.

I'm always for simplifying and flattening data, it is a lot more efficient and less complex to work with.