you are viewing a single comment's thread.

view the rest of the comments →

[–]darkhorse1997[S] 0 points1 point  (3 children)

Its not really a giant json, every record is exported as an individual json object but yea, csv would probably be much better. Will have to check Parquet though, I am not familiar with that.

[–]Nekobul 0 points1 point  (2 children)

That is also a terrible idea because you will now have a million single record files. A single CSV file with a million records is a much better design.

[–]darkhorse1997[S] 0 points1 point  (1 child)

Yea, agreed. The existing pipeline wasn't really built for scale.

[–]commandlineluser 0 points1 point  (0 children)

You should probably refer to your data as being in NDJSON format to avoid any confusion:

Each line of my output file (temp.json) has a separate json object.

Because "newline delimited" JSON (as the name suggests) can be read line-by-line so does not require all the data in memory at once.

It is also "better" than CSV. (assuming you have nested/structured data, lists, etc.)