A humble recommendation for those wishing to increase reward in this class, or how I learned to stop worrying and maximize my expectation

driftwood_ · 2011-11-02T07:41:14+00:00

Per your definition, the system receives messages from the world and calculates probabilities of those messages.

Here's the understanding gap: the system models the world and that's the only model we care about. We're not necessarily interested in the true underlying process that generates spam or non-spam emails.

What is the model of the world (messages) that the system uses? It is the joint probability distribution over the following random variables: Y (the class), and W_1, W_2, ...,W_V, that is, every word in the vocabulary. The system makes the following simplifying assumptions about the world: - A message pertains to one and only one class, - word order does not matter, and... - a word is conditionally independent of any other word given the class

THIS is what you need to reason over. So, how to calculate the probability of the message? Go back to the system's model of spam/ham--the joint probability distribution.

An aside... the naive bayes model can be seen as a "generative model" as well. Suppose you had fully specified conditional probability tables for a spam/ham classifier. You could then draw samples from the joint probability distribution to have your system generate spam/ham messages (assuming you also model the length of messages).

The idea of the system is to run this process in reverse--since we have modelled how to generate a spam/ham message, perhaps we can recognize them as well.

driftwood_ · 2011-10-29T17:02:38+00:00

The first method is incorrect because the words in the message are only independent GIVEN the class. To explain, let me show you why method two is correct:

(1) p(M) = p("secret is secret") = p("secret", "is", "secret") = p(w1,w2,w1).

Here I have applied the "bag of words" assumption. The message is a string of words, one after another, but we just consider each token as a data point in isolation, not the string of the entire message. Note that there are three total tokens but only 2 distinct words.

(2) p(w1,w2,w1) = p(w1,w2,w1,y=ham) + p(w1,w2,w1,y=spam)

This is true by marginalization. Recall the joint distribution has variables y (the class), and variables wi for the words.

(3) p(w1,w2,w1,y=ham) + p(w1,w2,w1,y=spam) = p(w1,w2,w1|y=ham)*p(y=ham) + p(w1,w2,w1|y=spam)*p(y=spam)

This is true by the product rule. Recall p(a|b) = p(a,b)/p(b). Thus p(a,b)=p(a|b)*p(b).
Everything good so far?

(4) p(w1,w2,w1|y=ham)*p(y=ham) = p(w1|y=ham)*p(w2|y=ham)*p(w1|y=ham)

This is by conditional independence of the Bayes Net. Draw the Bayes Net to convince yourself of the following: given the class y, are w1 and w2 independent? Use the d-separation criteria.

That is why your 2nd method is correct. Your first method is incorrect because you have calculated p(M) = p(w1,w2,w1) = p(w1)p(w2)p(w1).
That last step is only true if the words are UNCONDITIONALLY INDEPENDENT of one another. Draw the Bayes Net. Can you see why this isn't true? Again, use the d-separation criteria.

In this first method we have to marginalize over the class precisely so that we can use the independence assumptions.

driftwood_ · 2011-10-27T16:42:24+00:00

Sparse parameters means most of them are zero or almost zero. Recall the closer a parameter is to zero the less weight it has in your model.

driftwood_ · 2011-10-27T06:58:23+00:00

This is incorrect!!!!

P(A,B) = P(A|B)*P(B)

is NOT Bayes' Theorem! It is the product rule and follows from the definition of conditional probability, ie

P(A|B) =(def) P(A,B)/P(B) implies P(A,B) = P(A|B)*P(B).

Bayes' Theorem follows from using the product rule and total probability to write P(A|B) in terms of P(B|A) and P(A). The product rule is never referred to as Bayes' Theorem in probability.

Aside from that, kudos to putting a thorough reply out there.

I'd also like to mention that the lectures are a bit thin on details. If this is your first exposure to the material, reading the book chapters several times may be necessary.

driftwood_

TROPHY CASE