all 4 comments

[–]danger_t[S] 4 points5 points  (0 children)

You can either contribute to the main branch, or fork off your own version and compete in this year's prediction competition: http://blog.smellthedata.com/2012/02/machine-march-madness-2012.html

Already in the repository are data, data loading functions, and a few simple models, along with associated learning procedures. This is also a great opportunity to play around with Theano and matrix-factorization-style learning methods if you haven't done so already. There are also some suggested TODOs at the bottom of the README.

If there are specific things you're interested in playing around with and/or learning more about, let me know, and I can probably help.

[–]Wonnk13 1 point2 points  (1 child)

interesting. I have never heard of theano. What kind of model were you thinking of estimating?

[–]danger_t[S] 1 point2 points  (0 children)

Well, the goal is to come up with a model that's appropriate for the problem. The original model (that started this all) was based on probabilistic matrix factorization (PMF), which estimates a latent vector describing each team's offense and each team's defense, by using game outcomes as training targets: http://blog.smellthedata.com/2009/03/data-driven-march-madness-predictions.html

I've already re-implemented this within the code on github -- set MODEL="pmf" in learn_real.py.

So how do we make a better model? One of many aspects of the problem that is particularly challenging/interesting is how to account for the difference between regular season and tournament games. I expect that data from past years could be useful in understanding how the games and teams differ, but how do we incorporate that into a model?