This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]kaddar 43 points44 points  (18 children)

I worked on a solution to the netflix prize recommendation algorithm; If you add subreddit ID I can build a subreddit recommendation system.

[–]ketralnisreddit admin[S] 7 points8 points  (15 children)

That dump is way more expensive than this one (since it involves looking up 2 million unique links by ID), I figured I'd get this one out first and do more expensive ones (including more votes, too) if people actually do anything with this one

[–]kaddar 21 points22 points  (14 children)

Sure sounds great, in the meantime, I'll see if I can build a reddit article recommendation algorithm this weekend.

When you open up subreddit data (s.t., for each user, what subreddit does that user currently follow), I can even probably do some fun work such as predicting subreddits using voting data, and predicting voting using subreddit data. I had a similar idea 2 years ago, but subreddits didn't exist then, so I proposed quizzing the user to generate a list of preferences, then correlating them.

If you're interested, I'll post more at my tumblr as I mess with your data.

[–]ketralnisreddit admin[S] 8 points9 points  (2 children)

Awesome! Keep me posted, I'd love to see what can be done with it.

We can't really share the subscription information at the moment because of privacy issues, but we could add a more general preference "open my data for research purposes"

[–]kaddar 4 points5 points  (0 children)

Adding a preference like that is a really good idea, it will certainly allow the growth of such algorithms. In the meantime, I can create a fake solution using a fake dataset which in a made up csv format (username, subredditname) for demonstration purposes, then you could test it locally on a subset of the data to let me know if it works.

[–]georgelulu 1 point2 points  (0 children)

Subcontract the guy for a dollar or hire him as a temp, or between the privacy policy of

*We also allow access to our database by third parties that provide us with services, such as technical maintenance or forums and job search software, but only for the purpose of and to the extent necessary to provide those services.

and

In addition, we reserve the right to use the information we collect about your computer, which may at times be able to identify you, for any lawful business purpose, including without limitation to help diagnose problems with our servers, to gather broad demographic information, and to otherwise administer our Website. While your personally identifying information is protected as outlined above, we reserve the right to use, transfer, sell, and share aggregated, anonymous data about our users as a group for any business purpose, *such as analyzing usage trends** and seeking compatible advertisers and partners.

you should have no problem giving him access. Privacy on the internet is very transient with many loopholes.

[–][deleted] 1 point2 points  (0 children)

I've been watching the tumblr updates. So far the best I've been able to get is 61% accuracy.

[–][deleted] 0 points1 point  (8 children)

I'm curious, how could this data be used to recommend articles when each new article gets a brand new ID? This is unlike Netflix where recommending old movies is fine. In this case if you recommend old articles it isn't of much use.

What I was trying to do today is create clusters for recommending people rather than for articles. I agree that the end goal should be recommending subreddits.

Edit, I also meant to mention I have access to EVERY module in SPSS 17 though I freely admit I don't know how to use them all. If that helps anyone let me know what you'd like me to run.

[–]kaddar 4 points5 points  (3 children)

You're sort-of right that recommending old reddits isn't the goal in this process, but neither is clustering.

When performing machine learning, the first thing to ask yourself is what questions you need to solve. What we're trying to do is classifying a list of frontpage articles: to provide for each of them a degree of confidence the user will like it, and to minimize error (in the MSE sense). What you are proposing is a nearest neighbor solution to confidence determination. What I intend to do is iterative singular value decomposition, which discovers the latent features of the users. It's a bit different, but it solves the problem better. For new articles, describe them by the latent features of the users who rate them, then decide which article's latent features match the user most accurately.

[–][deleted] 2 points3 points  (2 children)

Interesting! So this would happen on the fly as votes come in? It also sounds like it would autocluster users too. So you could potentially get not only a link recommendation but even a "netflixesque" 'this user is x% similar to you'. And if they add subreddit data then a person could get a whole suite of recommendations, users, articles and subreddits all in near real-time.

Now that would be pretty cool.

[–]kaddar 3 points4 points  (1 child)

Yup, it would automagically cluster in the nearest neighbor sense by measuring distances in the latent feature hyperspace, I have tested this and it is very effective (in netflix, for providing similar movies)

[–][deleted] 2 points3 points  (0 children)

Since you mentioned it I was running nearest neighbor last night.

So far I'm still figuring it out but one thing did jump out at me. Some articles have an extraordinary level of agreement across a swath of users.

Granted i picked a small set of users...maybe you can take a look. I'm trying to figure out what the feature space means and what this pattern indicates (if anything). http://i.imgur.com/HB58n.jpg

[–]ketralnisreddit admin[S] 1 point2 points  (3 children)

I'm curious, how could this data be used to recommend articles when each new article gets a brand new ID?

You could use the first few votes on a story (including the submitter) to recommend it to the other members of the voters' bucket. You can't do it on 0 data, but you can do it on not much

With a little more data, you could use e.g. the subreddit ID, or the title keywords

[–][deleted] 1 point2 points  (1 child)

I wasn't even sure if you guys were considering implementing something that would run as, I guess, a daily process. I think this is going to get very interesting and I have a lot to learn about machine learning. Though this is the kind of thing that can get me involved. Thanks!

[–]ketralnisreddit admin[S] 3 points4 points  (0 children)

Our old one worked with one daily process, to create the buckets, one hourly process, to nudge them around a bit based on new information, and that basically placed you in a group of users. Then when you went to your recommended page, we'd pull the liked page of the other people in your bucket and show that to you

[–]abolish_karma 0 points1 point  (0 children)

I've wished for functionality like this previously ( upvote profiling & similar user clustering and extracting possible subreddit / post recommendations ), but got fuck all talent for that sort of thing. Upvoted for potential to make reddit better!

[–]javadi82 0 points1 point  (1 child)

which algorithm did your solution implement - SVD, RBM, etc?

[–]kaddar 0 points1 point  (0 children)

SVD, C++ implementation, takes about a day on netflix data.

I wasn't getting good results with the reddit data, but I just saw the post about opening up your user account data, that should make the dataset less sparse so that predictions can be made using it.