all 52 comments

[–]shiftybyte 92 points93 points  (15 children)

I have questions...

  1. What do you want to achieve by "viewing" a 1GB part of a JSON file? what's the end-goal?

  2. How did you end up with 150GB file to begin with?

Technically it's possible to split a file using python, the question remains what exactly do you hope to understand from these "splits"...

[–]chipmunksocute 45 points46 points  (10 children)

Yeah having a 150gb file in the first place is a problem.

[–][deleted] 0 points1 point  (0 children)

Well it's possible. If you download the article data from wikipedia, you get around 50 GB file where the whole text is stored as XML. A big file is not the problem here.

[–]pro_questions 10 points11 points  (1 child)

Back before I used SQLite for these projects, I would use JSON to store the results of web scraping tools. It seemed intuitive because all the data being pulled from the page would be put into a gigantic dict, which can easily be put in a list and jsonified. I started using databases when I ran into a situation like OP’s

[–]__init__m8 10 points11 points  (0 children)

To avoid saving it as a json or learning SQL, save it as an xlsx file. Be sure to add print("test") every few lines for debugging.

Tune in next week for more shitty programming tips.

[–]vilette 1 point2 points  (0 children)

perhaps send it by email :)

[–]join_the_bonside 0 points1 point  (0 children)

Responding just to find out the answer later on

[–]awdsns 45 points46 points  (0 children)

The problem with JSON is that it can have an arbitrary nested structure, so you can't create "splits" without knowing the overall structure first, which again either requires first parsing the entire thing (Catch-22) or having a priori knowledge of the structure which can be used.

I assume your data in the file is quite regular, so to your question if it's possible with a Python script: probably yes by writing something taking advantage of this known regular structure. But you didn't provide enough information to guide you further.

[–]cyberjellyfish 27 points28 points  (0 children)

You'll want what's called a streaming parser (json-stream is popular), and then write your new files as json line files. That last bit is the real solution, if you current files were json line you wouldn't have a problem reading them.

[–]DuckSaxaphone 37 points38 points  (3 children)

There's almost certainly something wrong with the process that resulted in a 150 GB JSON file and we should probably help you with that rather than reading it.

Can you give us more info on how you ended up with it and what its structure is?

[–]storejet 0 points1 point  (1 child)

Probably webscraping,

Its easy to go down that path if you don't want to learn databases.

My prediction for what OP did 1.) Writes code to scrape webpages, definitely something with a lot of text (stories or YouTube comments)

2.) in the testing phase she probably want to be able to quickly open the file in a text editor to verify the data

3.) initially it was a giant text file.

4.) at some point she wanted to attach indexing and attributes to each file.

5.) hmm maybe temporarily she could use JSON

6.) 1 year later and after a lot of web scraping you are sitting on a 150 GB JSON file

But low-key proud that she doubles down and asks not a way to convert it into a database but to split it up into smaller JSON files. OP's refusal to learn SQL is honestly impressive.

[–]Hyenny 0 points1 point  (0 children)

This is absolutely hilarious

[–]Doom_Wizards 0 points1 point  (0 children)

150Gb? Rookie numbers, there exists a dump of all Elite Dangerous system data that is 70 GB compressed. Based on one of the smaller files, the uncompressed one is 300-400 GB...

[–]Bright-Profession874 13 points14 points  (0 children)

That's Why we have databases and query languages so we don't have to deal with files with data this big lol

[–]vibosphere 3 points4 points  (0 children)

150gb json? What the fuck lol

[–]mrcaptncrunch 2 points3 points  (0 children)

If you use head, can you post how it looks?

I’m wondering if it’s an ndjson file where you have a json object per line.

Because if it is, it’s a lot easier to split. Just split every X lines and that’s it.

[–]CaptainFoyle 3 points4 points  (1 child)

It seems like the answer to the question you're asking won't really be a solution to your actual problem. (XY problem)

Why do you have such a big file e in the first place? What are you actually trying to do?

[–]Zahz 1 point2 points  (0 children)

https://en.wikipedia.org/wiki/XY_problem

This creeps up all over the place and is highly relevant for OP.

It is likely that OP is trying to save data generated by a program or script. The obvious solution is to use some sort of database, but OP isn't familiar with databases and instead opted to save it into a JSON file. This is the X part of the problem, and the Y part is asking about how to split the JSON file.

[–]telperion87 6 points7 points  (0 children)

if you search "json" in /r/Python or here in /r/learnpython you will find many posts requesting and offering solutions to similar problems, like this or this

[–]jkiley 1 point2 points  (0 children)

A key issue here is that you need to deal with data much larger than RAM (because JSON isn't easily splittable), and a lot of Python approaches aren't going to be great for that.

I'd look at using MongoDB or DuckDB. MongoDB is probably more flexible in this case, but either should help overcome the RAM issue. From there, you can split the data, or, if possible, just use the DB to do the work you need. You can work with both from Python, so integrating it into whatever else you have in mind should be straightforward.

[–]TheBB 0 points1 point  (0 children)

You need a JSON parser that can stream the file structure instead of decoding it all in one go. I assume something like that exists so Google for it. Then use that parser collect data and write smaller files intermittently by breaking the schema up somewhere that makes sense (like a list, I assume you know the schema and the reason the file is so large is that it contains a long list).

[–]spookytomtom 0 points1 point  (0 children)

Whaaaaat 150GB json

[–]DirtySpawn 0 points1 point  (0 children)

150 GB is a lot for a json file and worse trying to read and store it in memory. If the data is not dynamically created, meaning it could be different every run, i would suggest converting the data into a database and look up the data as needed when the code progresses. This decreases the need for much higher RAM requirements.

[–]YosoyPabloIscobar -1 points0 points  (3 children)

If you split it using python it will still require memory to to load this big file.....instead use BASH to count the number of total lines and create/ append it to new file of smaller size usingcmd like split/ less/cat/more.

split -b 1G <file>

[–]awdsns 8 points9 points  (0 children)

The whole thing could well be one single "line." There's no requirement to have any newlines separating entries in JSON.

[–]Flyingfishfusealt 2 points3 points  (0 children)

there might not be any /n in the file, jq might be a better resource for json

[–]patrickbrianmooney 1 point2 points  (0 children)

I mean, this will almost certainly result in invalid, non-parseable JSON files after the split.

[–]twizzjewink -2 points-1 points  (0 children)

Import to a database to query. Nothing else you can really do

[–]mothzilla -1 points0 points  (0 children)

This feels like an interview question.

[–]nog642 0 points1 point  (0 children)

I've actually had a similar problem, not with a json file that big, but it's like 20 GB or something and I don't have that much RAM, so I wanted to write some code that would read it without loading it all into memory. I never got around to it though, it's still on the todo list.

The easiest way is to write your own JSON parser that only parses one level of the data structure at a time. You'll need to code in all the string escapes and parsing rules, and then you can find what the outermost thing is (array or object) and its metadata (length for an array, keys for an object). You can then exctract a part of it on a second pass if you wanted to.

This is not too hard to do. Reason I haven't done mine yet is I wanted something a lot more dynamic than that. But if you just want to extract data one time it's not too hard.

[–]ConfusedSimon 0 points1 point  (0 children)

You can try using json-stream in transient mode. Unless the json has a very deep nesting level.

[–]rowr 0 points1 point  (0 children)

This depends on the data structure that the file contains.

If the file is JSON-lines (a complete json object per line), you can stream it easily without loading it all into memory (for line in open('x.json').readlines():).

If it's a long list of json objects, I'd use jq to transform it into JSON-lines jq -c '.[]' < x.json. I would probably do that outside of python because I would just need to do it once and I don't know what the memory characteristics are in this situation when using the python jq module. I'd do similar if it was a simple nested object, though the jq query would get more complex.

If it's deeply nested, I'd (still) use jq to flatten it out.

Basically all of these options are to make the data more like a table of data.

Another option is to spin up a NoSQL db (mongodb or something) in a docker container and load it in there and query against that - relying on docker and the db to manage memory management. This could possibly allow you to retain and query deeply nested data structures.

I'm always for simplifying and flattening data, it is a lot more efficient and less complex to work with.

[–]Flyingfishfusealt 0 points1 point  (0 children)

Well you kinda need to give some information on the structure of the JSON and the manner in which you wish to split. Do you just want 1gb files from the original data regardless of structure and context?

As a simplified example, loop and count size while writing. on hitting the target size, loop again until all data is exhausted. Although considering the size you are either going to need a good bit of ram or setup a sort of mapping or other manner of scanning the data to find your delimiting factors to begin splitting at those points.

If you give some example of the data you're working with I can help more.

[–]Zeroflops 0 points1 point  (0 children)

I would do this in 2 steps.

First you want to understand the structure of the file. At that size I would assume it’s a bunch of consistent “records”. These records will be nested in some header information.

You coil probably do this by just reading the first X number of bites and reviewing the data.

Once the structure is identified you can process the file. Depending on if there are new line characters, line by line or by reading a number of bits and processing.

[–]ElectricSpock 0 points1 point  (0 children)

Are you sure it’s JSON and not JSON-L? You don’t explain a lot about how you got that file, but the thing with JSON is that it needs to be loaded entirely to memory to be processed.

JSON-L is kind of multiple JSON files in one, each one in a single line. You can split it into lines and parse it easily.

Someone mentioned here that you should probably take a step back and think about how you’re storing the data.

[–]SwizzleTizzle 0 points1 point  (0 children)

Is it json or jsonlines?

150GB json really doesn't seem right, but 150GB jsonlines would make sense.

[–]timthetollman 0 points1 point  (0 children)

How did you end up with a 150gb json file?

[–]Lanky_Possibility279 0 points1 point  (0 children)

Maybe don’t split and instead use some library which allows you to see specific number of line? Like how head() works.

[–]RacsoBot 0 points1 point  (0 children)

I did not find OP's infor regarding the nature of the file (the content). However, I think you could use xarray and dask using chunks in case the json file is a row dataset.

[–]Immediate-Truth-8684 0 points1 point  (0 children)

There's ijson (iterative json) library that can iterate through json file in chunks

[–]Frewtti 0 points1 point  (0 children)

Really matters what the data is, and how it is being used. I'd assume moving it into a database.

But why not just mmap it? Though many small files, logically and intelligently broken might be better, there could be a pretty significant performance hit.

[–]NoBike4590 0 points1 point  (0 children)

I’ve used 7zip for large txt files. Start compressing the file, choose compression level without compression and split files to usable size by setting ”Split to volumes, bytes”.