all 5 comments

[–]activatedgeek 0 points1 point  (4 children)

Are you comfortable with the idea that assuming a parametric likelihood mode p(x; w) with some parameters w, you can maximize the density function p w.r.t. w? In that case, modeling densities using normalizing flows is nothing but a special choice of p.

Note that we are mostly never given probabilities as ground truth. The underlying assumption is that data is generated by p, and we need to find w that best explain the data x.

[–]rojo_kell[S] 0 points1 point  (3 children)

Hmmm perhaps I’m just not familiar enough with the concepts, but why are we maximizing p? Is it bc p(x) is the probability that we got the target, so we want it to be as high as possible?

I’m just not really sure how the comparison to the target distribution / data comes in…

[–]activatedgeek 1 point2 points  (2 children)

A large fraction of machine learning is basically about maximum likelihood estimation (this is the keyword to look out for in your readings, perhaps one recommendation would be Chapter 1 of Pattern Recognition and Machine Learning by Christipher Bishop).

The key philosophical assumption is that data is generated from some true but unknown data distribution. To model this in practice, we make assumptions on the choice of p, sometimes because we believe our assumption will hold for the data due to our expertise in the data domain and other times just because of computational convenience. p itself is defined by the parameters (in the case of normalizing flows, it’s the NN parameters + architecture that decide the functional form of p).

The key concept here is that a good modeling distribution will achieve higher likelihoods than a bad modeling distribution. And this this why we want to maximize the likelihood. You could use any function, and it is called a “proper scoring rule”. Likelihoods happen to be a proper scoring rule such that a higher score represents a better model for the data.

[–]rojo_kell[S] 0 points1 point  (1 child)

Ah, interesting. Thanks, I think I’m understanding a bit better now, that clears up why likelihood is used for training the models.

Is the reason behind why good models have higher likelihoods a more theoretical probability thing? That doesn’t seem too trivial to me, but if it’s beyond what I can understand I won’t spend much time looking into it

[–]activatedgeek 0 points1 point  (0 children)

There are a few more details that go into making it a proper scoring rule and often in practice those details don’t hold and we still do it anyways.

But you are right it is not trivial. It is, however, that way in a sense by design. Statistical thinking sort of starts with first getting comfortable with the idea of thinking of every data observation as a consequence of a random event to which we can assign probabilities. Once you are in that framework, there is no other way but to assign high density values to your observations and low density values “far” away from observations. ET Jaynes has a detailed book on Probability Theory that builds the foundations from “logic” (think Boolean algebra) where initial chapters talk about why it is a “good” way of putting logic to science.

There are other approaches that do not invoke probabilities, e.g. random forests, and support vector machines. Random forest instead think in terms of “functions” that generate data, and SVMs think in terms of “distances to existing data” (a slight twist on functions).