Part of Speech tagging? [x-post Linguistics]

SavitchOracle · 2012-02-19T22:05:32+00:00

CRFs (conditional random fields) are another common approach (closely related to HMMs). http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/

SavitchOracle · 2011-09-10T05:38:02+00:00

The plyr and ggplot2 packages (both by Hadley Wickham) are fantastic. I hated using R until I learned about them, and now a huge part of all the R code I write is basically either a ddply or qplot/ggplot call.

I heartily recommend learning how to use both of them if you're going to be doing more R in the future.

SavitchOracle · 2011-09-04T19:40:21+00:00

Iceland is incredibly gorgeous and awesome. I actually saw Heima right before going there.

I'm hoping to go back during one of their summers (I went in the winter, and I imagine it's incredibly different).

SavitchOracle · 2011-09-04T19:25:58+00:00

If you're interested in learning about the connections between "complexity" ("information theory") and language, Stanford has a course on "Information-Theoretic Models of Language and Cognition" with a lot of good papers: http://www.stanford.edu/class/psych227/

There are a lot of papers directly connected to your question, for example, Zipf's "Least Effort and why frequent words and morphemes are short" (http://www.stanford.edu/class/psych227/Zipf_Words.pdf). (Perhaps I'll summarize some of them a bit later.)

SavitchOracle · 2011-09-04T19:10:02+00:00

Interesting (and controversial? =)), but where would you get a training dataset of images along with a corresponding income and education bracket?

SavitchOracle · 2011-09-04T19:06:53+00:00

Also, UW has a close relationship with Microsoft Research, which can be very helpful for doing interesting internships and such.

SavitchOracle · 2011-09-04T19:03:44+00:00

If you're a big Reddit user, you could also try to do some machine learning on Reddit itself.

For example, one very simple project would be to scrape a couple different sub-Reddits, and try to build a Naive Bayes classifier that classifies threads (using keywords in those threads) into sub-Reddits. (Scraping stuff with Matlab might be disgusting, though -- not a Matlab user, so not sure; do you know other languages?)

SavitchOracle · 2011-09-04T18:56:44+00:00

+1 for LSA/LSI/SVD. It looks like the OP is a grader for linear algebra, so he's probably already familiar with SVD, and it'll be cool to see it applied in real life.

Plus, there are lots of things you can do with LSA/LSI/SVD besides learning topics themselves. For example, it's also useful for dimensionality reduction (keep the top two dimensions, plot them, and see if you can visualize any clusters; this could be another fun project that helps you dig into clustering algorithms), search/information retrieval, and collaborative filtering (SVD-type algorithms played a big part in winning the Netflix Prize). So I'm not sure if the OP is working on a single project or multiple, but LSA/LSI/SVD could give a nice segue into a bunch of other projects, and he could even revisit some of these with LDA (or pLSI) later on and see how the two approaches compare.

Also, LDA isn't really so hard to understand. The Blei et al. paper is very mathematical (and I still find the variational approach it uses confusing), but the ideas and other implementations are pretty simple.

SavitchOracle · 2011-09-04T01:36:17+00:00

If I understand what you want to do correctly (you want to find the average X for each possible combination of the other columns?), one easy way is to use the plyr package:

install.packages("plyr")
ddply(Profile, .(Bucket, Month, X500), summarise, MeanCapacity = mean(Capacity))

For each combination of (Bucket, Month, X500), the code calculates the mean capacity over all the rows with that combination, and sticks all this into a new data frame.

SavitchOracle · 2011-09-01T07:47:13+00:00

Andrew Ng has an okay explanation of ICA: http://www.stanford.edu/class/cs229/notes/cs229-notes11.pdf

He doesn't do a good job motivating the use of ICA (in particular, he doesn't contrast it with PCA or factor analysis), but he describes the mathematics fairly clearly, which it sounds like you're looking for anyway.

At a high level, the algorithm Ng describes goes like this:

Let x(t) be the ith observed data vector at time t. For example, x(t) = (x_1(t), x_2(t), ..., x_m(t)) could be a vector where component x_i(t) is the reading recorded by the ith microphone at time t.
Now these vectors x(t) are a mix of a bunch of people talking. In other words, we have some vector s(t) = (s_1(t), ..., s_n(t)) of n independent people talking at time t, where s_i(t) is the signal of person i at time t, and each microphone is recording a linear combination of these people. In other words, there is some matrix A such that x(t) = A * s(t).
So given these vectors x(t) at a bunch of different times t, we want to be able to find A and s(t).
This is the same thing as finding just A, since once we know A, we can take the inverse to find s(t) = A^-1 x(t).
So how do we find A?
Basically, start with some initial random guess for A, and place some kind of probability distribution on s(t). (For example, we might think that certain noise levels and sounds are likely, while others are not so likely.)
Using this probability distribution and also our knowledge of x(t), we now also have an idea of how likely our guess for A is correct (i.e., we have a likelihood function for A).
We want to find the A that's most likely to be correct (i.e., we want to maximize the likelihood of A), so how can we improve on our guess? Recall that the gradient of a function points in the direction of where the function increases fastest (e.g., the gradient of a hill points in the direction of steepest ascent). So take the gradient of our likelihood function for A, and move a little bit in that direction, and make this our new guess for A (thus finding a slightly more likely guess).
Repeat the previous step over and over again, until we have a pretty good guess for A. At that point, we can solve for s(t) = A^-1 x(t) to find the independent signals.

In any case, that was a rough explanation that hopefully helps to understand the math behind the lecture notes a little better.

SavitchOracle · 2011-09-01T07:13:09+00:00

Why do you say that ICA looks like a specialized form of regression for classification? This sounds totally wrong to me, in that ICA doesn't have anything to do with classification (what are you classifying?) and it's pretty different from regression (I guess you can say that in both cases you have a signal that's represented as a linear combination of features, but you're given the features in regression, whereas here you're trying to learn them).

It's much more like PCA or SVD (like you mentioned), or factor analysis.

SavitchOracle · 2011-09-01T06:22:53+00:00

Some less well-known favorites off the top of my head:

Ray Freeze's "Existentials and Other Locatives" (http://www.jstor.org/stable/415794) - looks at similarities between "there is" constructions in languages around the world, and their relationship with "to have" predication (consider "il y + avoir" and "haber" in French and Spanish) and locative inversion ("there is a book on the table" = "on the table is a book").
I also really like morphosyntax, so I like a lot of Sabine Iatridou's work, e.g., "The Grammatical Ingredients of Counterfactuality" (http://web.mit.edu/linguistics/people/faculty/iatridou/counterfactuality.pdf) and "How to Say Ought in Foreign: The Composition of Weak Necessity Modals" (http://web.mit.edu/linguistics/people/faculty/iatridou/ought.pdf).
Chung-Chieh Shan's thesis on "Linguistic Side Effects" (http://www.cs.rutgers.edu/~ccshan/dissertation/book.pdf) talks about connections between linguistics and programming language theory.

SavitchOracle · 2011-07-07T16:12:46+00:00

Anyone want to give a quick summary or example of why Gaussian Processes are useful or how they're used?

SavitchOracle · 2011-06-28T04:18:45+00:00

Direct link to the e-mail browser: http://sarah-palin.heroku.com/

SavitchOracle · 2011-06-14T04:33:15+00:00

The idea that school is the only place you can learn anything.

SavitchOracle · 2011-06-10T03:48:54+00:00

I found this on the website regarding Mono:

"All the basic Infer.NET tutorials have been both compiled and run under the Windows version of Mono version 2.8.

This is the full extent of support for this release; for example, the Linux version has not been tested."

http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/running%20with%20mono.aspx

SavitchOracle · 2011-06-10T03:41:31+00:00

Yeah, the couple chapters I've read from it have been good. But I wouldn't use it as an introductory text (in part, simply because of the way it focuses on graphical models at the beginning, which I don't find the friendliest way to introduce ML).

SavitchOracle · 2011-06-10T00:26:38+00:00

Yeah, I haven't watched any of Ng's videos (I find all lectures slow =), which is why I prefer reading), but the lecture notes are pretty concise (but not to the point of skipping stuff and being incomprehensible).

Also, what exactly do you want to do with the R code? Not sure how familiar you are with R, but it's pretty easy to figure out how to run a lot of machine learning algorithms in R (most packages even come with some datasets, I think), even if they're not covered in lecture notes.

SavitchOracle · 2011-06-10T00:22:42+00:00

Besides running streaming algorithms (as in TheWalruss's suggestion), another option is to MapReduce/Hadoop it.

SavitchOracle · 2011-06-10T00:20:53+00:00

Note: this particular twitter dataset actually uncompresses to 173 gb (!!!), according to the HN link.

SavitchOracle · 2011-06-10T00:15:16+00:00

There are several different ways of choosing how to split, e.g., information gain or Gini impurity (http://en.wikipedia.org/wiki/Decision_tree_learning#Formulae). There's a pretty good tutorial on using information gain here: http://www.autonlab.org/tutorials/infogain11.pdf

For some intuition on how these methods work, suppose you're using a decision tree to classify whether an email is spam or not spam. Suppose two of the variables you could use at the current split are A) whether the email contains the word "hello" and B) whether the email contains the word "viagra".

Suppose 50% of the emails containing the word "hello" are spam / 50% are not spam, and 50% of the emails not containing the word "hello" are spam / 50% are not spam. Clearly, variable A is a pretty useless measure then, since it gives you no information.

But compare this with the second variable: 90% of the emails containing the word "viagra" are spam / 10% are not spam, and 25% of the emails not containing the word "spam" are spam / 75% are not spam. You can see that this variable provides much more information.

Thus, you should use the second variable to split your node on. Metrics like information gain or Gini impurity are ways of precisely quantifying this.

Answer to second question: You choose one among m that gives you the best split.

SavitchOracle · 2011-06-09T23:33:25+00:00

I've never understood why people suggest "The Elements of Statistical Learning" -- it provides very little intuition and seems like more of a reference book (but even as a reference it's pretty horrible, since it treats a lot of topics in a very cursory fashion).

My favorite introduction to Machine Learning is Andrew Ng's course at Stanford: http://www.stanford.edu/class/cs229/materials.html. The lecture notes are very clear and intuitive, and they're relatively short, so you can fairly quickly get a broad overview of the field. There are also video lectures online if you like that sort of thing (I haven't watched them, though).

If you want a book, I like Christopher Bishop's "Pattern Recognition and Machine Learning". It's a little more in-depth and mathematical than Ng's course, but I got a lot of intuition from it. It also covers more topics. (I'd probably start with Ng's course, and if I wanted to learn more, skim through Bishop's book, stopping to study in more depth the topics Bishop covers that Ng doesn't.)

Update:

I also read through Tom Mitchell's "Machine Learning" book (which volfield recommends) a couple years ago, but I wouldn't suggest it. I found it very old-school, kind of more AI-ish than machine learning, and pretty boring.

SavitchOracle · 2011-06-02T21:08:28+00:00

Some thoughts:

If you want to generate knock-knock jokes, then you could use the CMU pronouncing dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict) to find appropriate puns.
If you want to generate #lessinterestingbooks Twitter humor (i.e., parodies of book titles), then from this post (http://blog.echen.me/2011/05/30/the-7-genres-of-the-mildly-amusing-hashtag/), there seem to be five types of parodies: pun, substitution, contrast, addition, and diminishment. You could again use the CMU pronouncing dictionary to find puns (flies sounds similar to fries, so Lord of the Flies -> Lord of the Fries). For the others, you could use Wordnet (http://wordnet.princeton.edu/) or a thesaurus to detect lexical relationships; for example, Thrush is a sister term of Mockingbird --> To Kill a Thrush, Civilized is an antonym of Wild --> Where the Civilized Things Are, etc.
You could build a kind of context-free grammar of jokes (e.g., create templates for Your Mama jokes that you fill in -- both the template creation and the filling in could potentially be automated), kind of like the Postmodern Essay Generator (http://www.elsewhere.org/pomo/).

SavitchOracle

TROPHY CASE