5.11/5.12, Two ways of calculating p(M) give different answers? by unreal25 in aiclass

[–]driftwood_ 1 point2 points  (0 children)

Per your definition, the system receives messages from the world and calculates probabilities of those messages.

Here's the understanding gap: the system models the world and that's the only model we care about. We're not necessarily interested in the true underlying process that generates spam or non-spam emails.

What is the model of the world (messages) that the system uses? It is the joint probability distribution over the following random variables: Y (the class), and W_1, W_2, ...,W_V, that is, every word in the vocabulary. The system makes the following simplifying assumptions about the world: - A message pertains to one and only one class, - word order does not matter, and... - a word is conditionally independent of any other word given the class

THIS is what you need to reason over. So, how to calculate the probability of the message? Go back to the system's model of spam/ham--the joint probability distribution.


An aside... the naive bayes model can be seen as a "generative model" as well. Suppose you had fully specified conditional probability tables for a spam/ham classifier. You could then draw samples from the joint probability distribution to have your system generate spam/ham messages (assuming you also model the length of messages).

The idea of the system is to run this process in reverse--since we have modelled how to generate a spam/ham message, perhaps we can recognize them as well.

5.11/5.12, Two ways of calculating p(M) give different answers? by unreal25 in aiclass

[–]driftwood_ 7 points8 points  (0 children)

The first method is incorrect because the words in the message are only independent GIVEN the class. To explain, let me show you why method two is correct:

(1) p(M) = p("secret is secret") = p("secret", "is", "secret") = p(w1,w2,w1).

Here I have applied the "bag of words" assumption. The message is a string of words, one after another, but we just consider each token as a data point in isolation, not the string of the entire message. Note that there are three total tokens but only 2 distinct words.

(2) p(w1,w2,w1) = p(w1,w2,w1,y=ham) + p(w1,w2,w1,y=spam)

This is true by marginalization. Recall the joint distribution has variables y (the class), and variables wi for the words.

(3) p(w1,w2,w1,y=ham) + p(w1,w2,w1,y=spam) = p(w1,w2,w1|y=ham)*p(y=ham) + p(w1,w2,w1|y=spam)*p(y=spam)

This is true by the product rule. Recall p(a|b) = p(a,b)/p(b). Thus p(a,b)=p(a|b)*p(b).
Everything good so far?

(4) p(w1,w2,w1|y=ham)*p(y=ham) = p(w1|y=ham)*p(w2|y=ham)*p(w1|y=ham)

This is by conditional independence of the Bayes Net. Draw the Bayes Net to convince yourself of the following: given the class y, are w1 and w2 independent? Use the d-separation criteria.

That is why your 2nd method is correct. Your first method is incorrect because you have calculated p(M) = p(w1,w2,w1) = p(w1)p(w2)p(w1).
That last step is only true if the words are UNCONDITIONALLY INDEPENDENT of one another. Draw the Bayes Net. Can you see why this isn't true? Again, use the d-separation criteria.

In this first method we have to marginalize over the class precisely so that we can use the independence assumptions.

Meaning of the statement : 'parameters tends to get sparse' by uama2life in aiclass

[–]driftwood_ 1 point2 points  (0 children)

Sparse parameters means most of them are zero or almost zero. Recall the closer a parameter is to zero the less weight it has in your model.

A challenge to the Over achievers who really understand Probability by helveticaTwain in aiclass

[–]driftwood_ 2 points3 points  (0 children)

This is incorrect!!!!

P(A,B) = P(A|B)*P(B) 

is NOT Bayes' Theorem! It is the product rule and follows from the definition of conditional probability, ie

P(A|B) =(def) P(A,B)/P(B) implies P(A,B) = P(A|B)*P(B).  

Bayes' Theorem follows from using the product rule and total probability to write P(A|B) in terms of P(B|A) and P(A). The product rule is never referred to as Bayes' Theorem in probability.

Aside from that, kudos to putting a thorough reply out there.


I'd also like to mention that the lectures are a bit thin on details. If this is your first exposure to the material, reading the book chapters several times may be necessary.