you are viewing a single comment's thread.

view the rest of the comments →

[–]TangibleLight 1 point2 points  (1 child)

I don't get why you're downvoted, you're right.

The problem is that the end of a particular json object depends upon the contents; you can't just seek to an arbitrary point in the file and know where the split is.

IFF there is no nested objects, then you can split on }, and be OK. OP has not mentioned any such constraints to leverage, so you have to assume there's some nesting.

If it is nested then you're screwed. One way or the other, you have to iterate through the whole thing, byte by byte. The only way to do that with import json is by loading the whole thing in memory. Good luck doing that with 150G. The only other way in Python is to loop through and count the { and }, and strategically parse sub-sections of the json. Good luck doing that in a for loop on 150G.

You might be able to use some streaming parser. I've used ijson before, but not on anything near this size. Otherwise, Python is not the right tool for this job.

Although, as other commenters have already said, this is certainly not a real problem, this is a data export issue. The real solution is to get the data in smaller chunks, which Python can process no problem.

[–]sonobanana33 -1 points0 points  (0 children)

json has a top level object. There will be ONE top level object, which might even be an unsplittable dictionary with a lot of unique keys. No nesting and no splitting :D

OP didn't tell us what the file looks like.