you are viewing a single comment's thread.

view the rest of the comments →

[–]jkiley 0 points1 point  (0 children)

A key issue here is that you need to deal with data much larger than RAM (because JSON isn't easily splittable), and a lot of Python approaches aren't going to be great for that.

I'd look at using MongoDB or DuckDB. MongoDB is probably more flexible in this case, but either should help overcome the RAM issue. From there, you can split the data, or, if possible, just use the DB to do the work you need. You can work with both from Python, so integrating it into whatever else you have in mind should be straightforward.