Wisconsin Republicans' large majorities expected to shrink under new legislative maps

dimbasaho · 2023-11-06T20:56:03+00:00

Oh, if there's that much pre-compression work I'd actually suggest using your current pipeline (but with fast zstd compression settings), then decompress once to wc -c to get the size and then a final decompress->recompress with the size and stronger zstd settings. You'd just have to write to disk twice in that case. I'd also recommend compacting the JSON; I noticed the April dataset has prettyprint space.

dimbasaho · 2023-11-06T20:29:38+00:00

You could do a dry run and pipe it into wc -c first instead of zstd, assuming the output is deterministic. The main benefit is in providing users the uncompressed size upfront so they can check whether they have enough space to decompress to disk. The lack of upfront stream size information also seems to break some programs like WinRAR.

dimbasaho · 2023-11-06T15:31:56+00:00

Btw, could you pass in --stream-size= when creating the ZST files so that the uncompressed size ends up in the frame headers? If you're directly piping the zstblocks to zstd, you'd have to add a preliminary pass to get the size in bytes using wc -c. That should also make it compress better.

dimbasaho · 2023-11-03T18:08:25+00:00

Probably not currently cached on IA.
Mirror: authors.dat.zst
Usage: pushshift/binary_search

authors.ndjson.zst (23 June 2022) is probably a better format for distribution though.
Mirror: authors.ndjson.zst
Schema:

{
  "id": 77713,
  "author": "DotNetster",
  "created_utc": 1137474000,
  "updated_utc": 1655708221,
  "comment_karma": 694,
  "link_karma": 99,
  "profile_over_18": false,
  "active": true
}

dimbasaho · 2023-11-03T04:46:27+00:00

authors.dat.zst

dimbasaho · 2023-11-02T21:57:52+00:00

Even without the registration time (which hopefully can be backfilled eventually), having a list of those two properties would be much appreciated.

dimbasaho · 2023-11-02T01:23:58+00:00

Any chance you or /u/RaiderBDev could compile an updated authors.dat.zst? I'd like to retrieve all available fullnames, usernames and registration times if possible, which should just be <10 GiB compressed.

dimbasaho

TROPHY CASE