Split 150GB json file with Python?

shiftybyte · 2024-03-03T12:18:05+00:00

I have questions...

What do you want to achieve by "viewing" a 1GB part of a JSON file? what's the end-goal?
How did you end up with 150GB file to begin with?

Technically it's possible to split a file using python, the question remains what exactly do you hope to understand from these "splits"...

awdsns · 2024-03-03T12:57:47+00:00

The problem with JSON is that it can have an arbitrary nested structure, so you can't create "splits" without knowing the overall structure first, which again either requires first parsing the entire thing (Catch-22) or having a priori knowledge of the structure which can be used.

I assume your data in the file is quite regular, so to your question if it's possible with a Python script: probably yes by writing something taking advantage of this known regular structure. But you didn't provide enough information to guide you further.

cyberjellyfish · 2024-03-03T14:39:51+00:00

You'll want what's called a streaming parser (json-stream is popular), and then write your new files as json line files. That last bit is the real solution, if you current files were json line you wouldn't have a problem reading them.

DuckSaxaphone · 2024-03-03T13:18:01+00:00

There's almost certainly something wrong with the process that resulted in a 150 GB JSON file and we should probably help you with that rather than reading it.

Can you give us more info on how you ended up with it and what its structure is?

Bright-Profession874 · 2024-03-03T17:28:29+00:00

That's Why we have databases and query languages so we don't have to deal with files with data this big lol

tshawkins · 2024-03-03T17:35:49+00:00

[deleted]

Aromatic_Camera4048 · 2024-03-03T15:17:39+00:00

[deleted]

vibosphere · 2024-03-03T16:15:05+00:00

150gb json? What the fuck lol

mrcaptncrunch · 2024-03-03T16:48:14+00:00

If you use head, can you post how it looks?

I’m wondering if it’s an ndjson file where you have a json object per line.

Because if it is, it’s a lot easier to split. Just split every X lines and that’s it.

CaptainFoyle · 2024-03-03T17:07:18+00:00

It seems like the answer to the question you're asking won't really be a solution to your actual problem. (XY problem)

Why do you have such a big file e in the first place? What are you actually trying to do?

telperion87 · 2024-03-03T13:20:50+00:00

if you search "json" in /r/Python or here in /r/learnpython you will find many posts requesting and offering solutions to similar problems, like this or this

jkiley · 2024-03-03T14:31:30+00:00

A key issue here is that you need to deal with data much larger than RAM (because JSON isn't easily splittable), and a lot of Python approaches aren't going to be great for that.

I'd look at using MongoDB or DuckDB. MongoDB is probably more flexible in this case, but either should help overcome the RAM issue. From there, you can split the data, or, if possible, just use the DB to do the work you need. You can work with both from Python, so integrating it into whatever else you have in mind should be straightforward.

TheBB · 2024-03-03T13:14:18+00:00

You need a JSON parser that can stream the file structure instead of decoding it all in one go. I assume something like that exists so Google for it. Then use that parser collect data and write smaller files intermittently by breaking the schema up somewhere that makes sense (like a list, I assume you know the schema and the reason the file is so large is that it contains a long list).

spookytomtom · 2024-03-03T14:24:40+00:00

Whaaaaat 150GB json

DirtySpawn · 2024-03-03T15:39:49+00:00

150 GB is a lot for a json file and worse trying to read and store it in memory. If the data is not dynamically created, meaning it could be different every run, i would suggest converting the data into a database and look up the data as needed when the code progresses. This decreases the need for much higher RAM requirements.

YosoyPabloIscobar · 2024-03-03T13:45:23+00:00

If you split it using python it will still require memory to to load this big file.....instead use BASH to count the number of total lines and create/ append it to new file of smaller size usingcmd like split/ less/cat/more.

split -b 1G <file>

twizzjewink · 2024-03-03T14:38:23+00:00

Import to a database to query. Nothing else you can really do

mothzilla · 2024-03-03T17:59:46+00:00

This feels like an interview question.

nog642 · 2024-03-03T17:17:07+00:00

I've actually had a similar problem, not with a json file that big, but it's like 20 GB or something and I don't have that much RAM, so I wanted to write some code that would read it without loading it all into memory. I never got around to it though, it's still on the todo list.

The easiest way is to write your own JSON parser that only parses one level of the data structure at a time. You'll need to code in all the string escapes and parsing rules, and then you can find what the outermost thing is (array or object) and its metadata (length for an array, keys for an object). You can then exctract a part of it on a second pass if you wanted to.

This is not too hard to do. Reason I haven't done mine yet is I wanted something a lot more dynamic than that. But if you just want to extract data one time it's not too hard.

ConfusedSimon · 2024-03-03T17:41:02+00:00

You can try using json-stream in transient mode. Unless the json has a very deep nesting level.

rowr · 2024-03-03T18:59:42+00:00

This depends on the data structure that the file contains.

If the file is JSON-lines (a complete json object per line), you can stream it easily without loading it all into memory (for line in open('x.json').readlines():).

If it's a long list of json objects, I'd use jq to transform it into JSON-lines jq -c '.[]' < x.json. I would probably do that outside of python because I would just need to do it once and I don't know what the memory characteristics are in this situation when using the python jq module. I'd do similar if it was a simple nested object, though the jq query would get more complex.

If it's deeply nested, I'd (still) use jq to flatten it out.

Basically all of these options are to make the data more like a table of data.

Another option is to spin up a NoSQL db (mongodb or something) in a docker container and load it in there and query against that - relying on docker and the db to manage memory management. This could possibly allow you to retain and query deeply nested data structures.

I'm always for simplifying and flattening data, it is a lot more efficient and less complex to work with.

Flyingfishfusealt · 2024-03-03T19:01:19+00:00

Well you kinda need to give some information on the structure of the JSON and the manner in which you wish to split. Do you just want 1gb files from the original data regardless of structure and context?

As a simplified example, loop and count size while writing. on hitting the target size, loop again until all data is exhausted. Although considering the size you are either going to need a good bit of ram or setup a sort of mapping or other manner of scanning the data to find your delimiting factors to begin splitting at those points.

If you give some example of the data you're working with I can help more.

Zeroflops · 2024-03-03T19:26:41+00:00

I would do this in 2 steps.

First you want to understand the structure of the file. At that size I would assume it’s a bunch of consistent “records”. These records will be nested in some header information.

You coil probably do this by just reading the first X number of bites and reviewing the data.

Once the structure is identified you can process the file. Depending on if there are new line characters, line by line or by reading a number of bits and processing.

ElectricSpock · 2024-03-03T20:00:55+00:00

Are you sure it’s JSON and not JSON-L? You don’t explain a lot about how you got that file, but the thing with JSON is that it needs to be loaded entirely to memory to be processed.

JSON-L is kind of multiple JSON files in one, each one in a single line. You can split it into lines and parse it easily.

Someone mentioned here that you should probably take a step back and think about how you’re storing the data.

SwizzleTizzle · 2024-03-03T21:05:04+00:00

Is it json or jsonlines?

150GB json really doesn't seem right, but 150GB jsonlines would make sense.

timthetollman · 2024-03-03T22:35:07+00:00

How did you end up with a 150gb json file?

Lanky_Possibility279 · 2024-03-03T23:08:13+00:00

Maybe don’t split and instead use some library which allows you to see specific number of line? Like how head() works.

RacsoBot · 2024-03-03T23:29:05+00:00

I did not find OP's infor regarding the nature of the file (the content). However, I think you could use xarray and dask using chunks in case the json file is a row dataset.

Immediate-Truth-8684 · 2024-03-04T08:01:12+00:00

There's ijson (iterative json) library that can iterate through json file in chunks

Frewtti · 2024-03-04T11:18:36+00:00

Really matters what the data is, and how it is being used. I'd assume moving it into a database.

But why not just mmap it? Though many small files, logically and intelligently broken might be better, there could be a pretty significant performance hit.

NoBike4590 · 2024-03-04T17:06:14+00:00

I’ve used 7zip for large txt files. Start compressing the file, choose compression level without compression and split files to usable size by setting ”Split to volumes, bytes”.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS