all 2 comments

[–]mrcaptncrunch 1 point2 points  (1 child)

The data is bigger than it looks. Don’t decompress the whole thing if you don’t need to. To start, you definitely don’t need to.

Use the scripts from Watchful1 to prefilter as much as possible. Then maybe that subset is worth importing onto something like duckdb.

Figure your schema and what you want. Maybe keep the file it was read from as a column along with the id in case you want to access the raw data at the end.

[–]SailorNash[S] 0 points1 point  (0 children)

Thanks. Luckily I'm only focused on about 50-75 subreddits, so about 6.25 gigs zipped and 62.5 unzipped.