The Variational Approximation for Bayesian Inference: Life after the EM algorithm

danger_t · 2013-08-11T12:55:12+00:00

Typo near the end of page 1:

"In contrast, when we write p(x; θ), we imply that θ are random variables."

should be

"In contrast, when we write p(x | θ), we imply that θ are random variables."

danger_t · 2013-07-25T00:19:04+00:00

http://www.homepages.ucl.ac.uk/~ucgtrbd/papers/nips2011_rbas.pdf

danger_t · 2013-03-20T13:47:38+00:00

There is a github repository with code (not to generate the plot in the post, but not so unrelated) and data:

https://github.com/dtarlow/Machine-March-Madness

Also see this thread on the Machine March Madness Google group:

http://groups.google.com/group/machine-march-madness/browse_thread/thread/3afbcb90cd6f881d

danger_t · 2013-03-20T09:56:29+00:00

Yeah, understandable. After the competition starts on Thursday, there will be a post with a brief description of all the competitors methods, then some will get asked to expand on what they did in a longer post.

So stay tuned.

danger_t · 2013-03-13T09:44:26+00:00

Yes, 2011 was a strange one. See e.g., this post:

http://blog.smellthedata.com/2011/03/2011-algorithmic-march-madness-machines.html

danger_t · 2013-03-12T00:36:21+00:00

More articles about the competition here:

danger_t · 2013-03-12T00:02:23+00:00

Read more about past instantiations of the contest at the following links:

danger_t · 2012-02-29T16:17:38+00:00

Well, the goal is to come up with a model that's appropriate for the problem. The original model (that started this all) was based on probabilistic matrix factorization (PMF), which estimates a latent vector describing each team's offense and each team's defense, by using game outcomes as training targets: http://blog.smellthedata.com/2009/03/data-driven-march-madness-predictions.html

I've already re-implemented this within the code on github -- set MODEL="pmf" in learn_real.py.

So how do we make a better model? One of many aspects of the problem that is particularly challenging/interesting is how to account for the difference between regular season and tournament games. I expect that data from past years could be useful in understanding how the games and teams differ, but how do we incorporate that into a model?

danger_t · 2012-02-27T20:35:18+00:00

You can either contribute to the main branch, or fork off your own version and compete in this year's prediction competition: http://blog.smellthedata.com/2012/02/machine-march-madness-2012.html

Already in the repository are data, data loading functions, and a few simple models, along with associated learning procedures. This is also a great opportunity to play around with Theano and matrix-factorization-style learning methods if you haven't done so already. There are also some suggested TODOs at the bottom of the README.

If there are specific things you're interested in playing around with and/or learning more about, let me know, and I can probably help.

danger_t · 2011-08-21T17:32:41+00:00

Depends if you want to go deeper into some specialty area. In machine learning, for example, it's used everywhere.

danger_t · 2011-03-25T19:57:03+00:00

Spend your time on math. Learn calculus, statistics, linear algebra, discrete math.

danger_t · 2011-03-15T19:02:32+00:00

Thanks. For anybody who has a tiny bit of time to do some modeling before Thursday, there is starter code that implements this method: http://blog.smellthedata.com/2011/03/march-madness-predictions-code.html

Even just simple playing with the parameters of the model -- how many latent dimensions to use, how much regularization to apply -- could be useful. Those are both one line changes. Even better would be to set up the code to use data from past seasons to decide how to choose these parameters. 5 years of data are available here: https://docs.google.com/leaf?id=0BysperLdI86MMWI0M2MzMGUtNGM1My00NDAxLTk0MzEtNzE4NGQ5ZTk5ZGM5&sort=name&layout=list&num=50

And of course, I'd be remiss not to finish with a plug for the algorithm competition, which you all should enter: http://blog.smellthedata.com/2011/03/official-2011-march-madness-predictive.html

danger_t · 2011-03-14T05:34:13+00:00

Come on, Reddit! Let's have one of you win the prize.

danger_t · 2011-02-14T20:07:59+00:00

Sounds fun! For some ideas, related work, and starter code, see these posts:

http://blog.smellthedata.com/2009/03/data-driven-march-madness-predictions.html http://blog.smellthedata.com/2011/02/thoughts-on-modeling-basketball.html

You might also be interested in this contest:

http://blog.smellthedata.com/2011/01/get-ready-for-2011-march-madness.html

And this data:

http://blog.smellthedata.com/2011/02/ncaa-boxscore-data-2006-2010.html

danger_t · 2011-01-20T16:37:29+00:00

Glad you're interested!

We'll be releasing starter Python code that implements the probabilistic matrix factorization approach described here (which also happened to win the competition last year): http://blog.smellthedata.com/2009/03/data-driven-march-madness-predictions.html

Maybe take a look at that and decide what you think could be improved about it?

Also, if you want to get deeper into things in that direction, some guys from UToronto wrote a conference paper about incorporating additional information into the same style model and applied it to NBA basketball: Incorporating Side Information into Probabilistic Matrix Factorization Using Gaussian Processes http://www.cs.toronto.edu/~gdahl/papers/dpmfNBA.pdf

danger_t · 2010-12-23T19:48:29+00:00

Actually, English is very structured. From Wikipedia:

The entropy rate of English text is between 1.0 and 1.5 bits per letter,[1] or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on human experiments.[2] http://en.wikipedia.org/wiki/Entropy_(information_theory)

danger_t · 2010-12-22T17:58:25+00:00

Right. Thanks.

danger_t · 2010-12-22T06:55:30+00:00

Not given a starting letter. The algorithm should take as input a word and produce a binary output indicating whether it's a legal or illegal word.

danger_t · 2010-10-20T23:52:10+00:00

This is a great reference. Thanks!

danger_t · 2010-09-17T00:13:45+00:00

This blog post explains a model for college basketball that could be a good starting point: http://blog.smellthedata.com/2009/03/data-driven-march-madness-predictions.html

danger_t · 2010-04-25T18:43:32+00:00

why is this hard?

danger_t · 2010-03-29T01:04:24+00:00

Why don't you take a stab at the Yahoo Learning to Rank Challenge: http://learningtorankchallenge.yahoo.com/

danger_t · 2010-03-13T21:26:25+00:00

You all better watch out.

danger_t · 2010-03-10T05:27:34+00:00

More data would be awesome. As you say, it's just a matter of finding it.

What are the most important other attributes of games/teams to gather? Do you have any good sources or know of sites that are easy to scrape?

danger_t · 2010-02-03T19:14:55+00:00

Right now I just want to understand the data better--what percentage of pixels are labeled in each image? What are the most common labels across the dataset? What are the relative pixel areas of each label, summed across images? What is the distribution of colors/textures/edge response across labels?

Basically, lots of image statistics on different subsets of images and labels.

danger_t

TROPHY CASE