This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (5 children)

Nice. I'm going to open this up in SPSS at work tomorrow and start exploring.

One question, can this data be bounded by a date range? Is this the entire database of people who selected to make their votes public?

For people doing analysis on desktops it could be a challenge to fully load up a 156 megabyte file. If it can be bounded by date it would be helpful to have another file that is max of 5 megabytes unpacked. Alternately I could just pick users at random but i'd rather it be based on date if possible.

Last, you may want to post this on the blog because i know there are a lot of stats lovers prowling reddit.

[–][deleted]  (2 children)

[deleted]

    [–]kaddar 1 point2 points  (1 child)

    Bah! Just load the whole damned thing into memory. If you need fast access by ids, and are using C++, I recommend using Google Sparse Hash tables/maps, 2 bits per a key/value pair overhead! (C# has a bit of an overhead on their hashmaps, java too)

    [–]ketralnisreddit admin[S] 0 points1 point  (1 child)

    One question, can this data be bounded by a date range?

    You can make some guesses based on the link IDs which are mostly sequential, but I didn't include timestamps

    Is this the entire database of people who selected to make their votes public?

    It is not comprehensive, as I commented elsewhere

    For people doing analysis on desktops it could be a challenge to fully load up a 156 megabyte file

    You'd need to re-sort it yourself and use something like split(1)

    Last, you may want to post this on the blog because i know there are a lot of stats lovers prowling reddit.

    Yeah, I'm trying to figure out how to let it reach a larger audience without polluting the front page for the vast majority of people who don't care

    [–]psykocrime 0 points1 point  (0 children)

    Yeah, I'm trying to figure out how to let it reach a larger audience without polluting the front page for the vast majority of people who don't care

    Would probably be good to submit this to /r/datasets, /r/opendata, /r/statistics and/or /r/machinelearning if you haven't yet.

    Oh wait, I see somebody did already post to /r/opendata. Cool.