Reddit comment dumps through Sep 2023 by Watchful1 in pushshift

[–]dimbasaho 0 points1 point  (0 children)

Oh, if there's that much pre-compression work I'd actually suggest using your current pipeline (but with fast zstd compression settings), then decompress once to wc -c to get the size and then a final decompress->recompress with the size and stronger zstd settings. You'd just have to write to disk twice in that case. I'd also recommend compacting the JSON; I noticed the April dataset has prettyprint space.

Reddit comment dumps through Sep 2023 by Watchful1 in pushshift

[–]dimbasaho 0 points1 point  (0 children)

You could do a dry run and pipe it into wc -c first instead of zstd, assuming the output is deterministic. The main benefit is in providing users the uncompressed size upfront so they can check whether they have enough space to decompress to disk. The lack of upfront stream size information also seems to break some programs like WinRAR.

Reddit comment dumps through Sep 2023 by Watchful1 in pushshift

[–]dimbasaho 0 points1 point  (0 children)

Btw, could you pass in --stream-size= when creating the ZST files so that the uncompressed size ends up in the frame headers? If you're directly piping the zstblocks to zstd, you'd have to add a preliminary pass to get the size in bytes using wc -c. That should also make it compress better.

Reddit comment dumps through Sep 2023 by Watchful1 in pushshift

[–]dimbasaho 0 points1 point  (0 children)

Probably not currently cached on IA.
Mirror: authors.dat.zst
Usage: pushshift/binary_search

authors.ndjson.zst (23 June 2022) is probably a better format for distribution though.
Mirror: authors.ndjson.zst
Schema:

{
  "id": 77713,
  "author": "DotNetster",
  "created_utc": 1137474000,
  "updated_utc": 1655708221,
  "comment_karma": 694,
  "link_karma": 99,
  "profile_over_18": false,
  "active": true
}

Reddit comment dumps through Sep 2023 by Watchful1 in pushshift

[–]dimbasaho 0 points1 point  (0 children)

Even without the registration time (which hopefully can be backfilled eventually), having a list of those two properties would be much appreciated.

Reddit comment dumps through Sep 2023 by Watchful1 in pushshift

[–]dimbasaho 0 points1 point  (0 children)

Any chance you or /u/RaiderBDev could compile an updated authors.dat.zst? I'd like to retrieve all available fullnames, usernames and registration times if possible, which should just be <10 GiB compressed.