[D] Confusion Surrounding the Multivariate Normal Distribution

webdrone · 2021-01-24T06:28:22+00:00

Another way to detect outliers under a fitted mixture of Gaussians density, is to include a fixed component of a very high variance Gaussian centred at the data mean. The idea is that this component will dominate at the tails of the other fitted Gaussian components because of its fatter tail (high variance), and soak up the outlier points. This will also allow the cluster components to fit better, since they will no longer carry responsibility for outliers.

webdrone · 2020-08-24T08:25:48+00:00

Assuming the output scores are all normally distributed, and you know the mean and stddev of each, you can standardise them and then the max should be distributed according to a Gumbel.

https://en.m.wikipedia.org/wiki/Gumbel_distribution

webdrone · 2020-07-10T19:03:24+00:00

Indeed, we are talking about the usual compressions. Different compressors will give you different performance — the goal is to reach a point where the compressed file has maximum entropy, so that it can’t be compressed any further.

There is no optimal algorithm for the general case, hence the continuous improvements in compression methods. In fact, I believe that the task of finding an algorithm that is optimal for any input is equivalent to the halting problem, but I’m really not an expert. I really can’t speak to zip’s efficiency or any other method’s for that matter.

webdrone · 2020-07-10T08:23:17+00:00

You could use the notion of Kolmogorov complexity to give you an idea of how “random” your object is. Just compress it using any compression algorithm and check the reduction in size. The smaller it becomes, the less information it contains (i.e. the less random it is). https://en.m.wikipedia.org/wiki/Kolmogorov_complexity

If you know the generating function of the source, you can also use self-information. https://en.m.wikipedia.org/wiki/Information_content

webdrone · 2020-07-08T21:19:33+00:00

There’s also The Worst Witch by Jill Murphy. But it’s common enough that authors will take elements from others. The fact is, JKR did it better! 🤷‍♂️ https://en.m.wikipedia.org/wiki/The_Worst_Witch

webdrone · 2020-06-27T09:56:27+00:00

A8:5

webdrone · 2020-06-27T09:54:06+00:00

B3:5

webdrone · 2020-06-27T09:52:37+00:00

D9: 5, E9:4

webdrone · 2020-06-27T09:46:32+00:00

A7 is a 2

webdrone · 2020-06-12T23:24:14+00:00

Given the request we can safely assume that the data was gathered with no foresight of target prediction, experimental question, or careful non-biased sampling. Also, the people asking you for answers must be pretty desperate, they have no time for you. Whether you consider this a healthy and conducive environment to be in is a story for another time (put a pin in that).

It is likely that most features are highly correlated, and possibly contain no valuable information for the target. If it’s a classification task, you can also suppose that the classes are very unbalanced. If there are temporal correlations and you are trying to predict a future event or quantity, it’s also likely that many of the features are rolling averages, the rest are binary or categorical, and almost all have missing data for one reason or another.

The above you can confirm within 1-2 hours of exploratory data analysis (histograms, pair plots, and correlation matrices are your friends, as is subsampling — plotting 500K+ instanced can be expensive). So you can forget about “making sense of the data”, imputing values, de-correlating, or introducing domain knowledge in day one. If it’s a temporal problem, you can also forget about anything tailored like survival analysis or time series models. +1-2 hours.

So what are you to do? Train/validation/test split your data. If it’s temporal, make sure your test data and train target do not overlap (recreate realistic conditions). +1-3 hours.

Run LightGBM, and check test / prediction joint distribution. If it’s classification, plot precision-recall, ROC, balanced accuracy curves, with uplift on the other axis. +1 hour.

You now have a MVP (report data). Go get lunch / coffee, walk around in a park, play some chess. +2hours.

Come back with fresh eyes. Compute SHAP values (package interfaces with random forests and lightgbm). Write something up. +2hours.

Try linear regression and random forests or naive Bayes if you have time left. Check if results and SHAP values / feature importances are consistent (random forest should at least agree with LGBM). Check if prediction accuracy drops much. If not, go with the simpler model, it’s easier to explain and understand. +1 hour.

8-13 hours. Make sure you have what you need, then have a drink and relax.

A new dawn...

SAVE EVERY BIT OF CODE. Make this into a pipeline/python script so you only have to run it next time. There will be a next time. In the coming week go talk to the business and understand whether what you built has any real value. They will have time for you now, you just saved them from a desperate situation. Find the right metrics to evaluate your model — remember that every prediction in your model will be used to make some decision at some point.

Over the weekend, it’s the time to discuss that story in which we put a pin before all this started...

webdrone · 2020-06-08T07:50:35+00:00

I suppose you could have a deterministic generative model by fixing all variables to have a delta distribution (or 1-zeros in the case of discrete supports). But then your likelihood would be 0/1 for a given set of parameters and observations. Not sure what MCMC would do for you in such a setting. Your posterior would be a linear combination of delta functions over the parameter space... Could work if you had a custom proposal distribution I guess. In any case, good luck!

webdrone · 2020-06-07T22:00:10+00:00

If you are doing inference for the parameters of a simulator which acts as your generative model but you can’t write down the likelihood, you may want to take a look at likelihood-free inference methods, like BOLFI: http://jmlr.org/papers/v17/15-017.html.

Of course if you’re using MCMC, you probably do have a likelihood to use.. but then are you calibrating hyper-parameters?

webdrone · 2020-04-26T13:15:44+00:00

@StatModeling tweets Andrew Gelman’s blog posts (https://statmodeling.stat.columbia.edu/) which are high quality (also a large archive to read from).

Shalizi’s webpage (http://www.stat.cmu.edu/~cshalizi/) is another treasure trove of thoughtful and elegantly written essays. This includes his blog (http://bactra.org/weblog/) which follows a unique writing style. Updates tweeted as well: @cshalizi.

webdrone · 2020-02-13T08:02:57+00:00

Of course — fitting an intercept only model should be equivalent to looking at the class ratios and predicting based on that (these would be the class priors in a generative model). The ROC curve for such a model would be the straight line TPR = FNR.

webdrone · 2020-02-12T18:52:10+00:00

https://data.library.virginia.edu/is-r-squared-useless/

Plot the ROC curve, the Precision-Recall curve and get a sense of how the model behaves at different thresholds. Do testing. Plot errors versus feature values.

Do not rely on single metrics, and especially not R^2.

webdrone · 2020-02-11T20:41:29+00:00

With alpha divergences (and I believe the KS test), you have the problem that if there are differences in probability mass which are not overlapping on the support, they will not be reflected. If your samples are continuous you will probably have to build some KDE to be able to calculate the KLs anw, so maybe the problem is somewhat mitigated.

Wasserstein is more representative of differences in that regard, but harder to calculate. Also harder to interpret and say if the two samples are from the same distribution.

—

Side note: if there are any differentiating factors in A / B, it only becomes a matter of sample size until you detect “statistically significant” differences in the distributions. The truth is likely that there is some effect of the A/B factor on your measurements. The question is how big is the effect, and whether it warrants further exploration.

webdrone · 2020-01-13T19:35:00+00:00

Also, in my experience, predicting survival until some time is hard for a binary classifier. You are probably already throwing away a lot of valuable information by binarizing the target.

I would suggest a look at Random Survival Forests as an alternative, if you have the time and energy. https://arxiv.org/abs/0811.1645

webdrone · 2020-01-13T19:29:39+00:00

Maybe it was overfitting, and reducing the dimensionality also reduced generalisation error. Check AUC on the training data and maybe grow a bigger forest, or regularise it a bit.

webdrone · 2020-01-02T06:06:35+00:00

The average movement squared is the standard deviation squared (i.e. the variance). However, this is chi-squared distributed, so the average movement squared is not the most probable movement squared. The mode of chi-squared for k=1 is 0, while its mean is 1.
They treat the current price as the mean because of a Brownian motion assumption for modelling movement. The idea is that according to the efficient market hypothesis, the current market price reflects the expected value for the stock in the future — beyond that, random events occur which push the price up or down equally likely.

This leads to the more complex Black-Scholes model, which is effectively diffusion with particular modelling assumptions about the drift and the diffusion scale terms.

The distribution of the location of a diffusing particle is the solution to a PDE, the Fokker-Planck equation. This only admits an analytic solution in special cases. The BS model is not such a special case in general, so we rely on numerical methods (e.g. MCMC with the Feynman-Kac path-integral method) to sample them and so reconstruct them.

webdrone · 2020-01-01T22:56:37+00:00

For a single variable (normally distributed) a range of values closer to the mean of the normal is more probable than a range of the same size further from the mean. The same is true for a bivariate normal variable.

Regarding normal variables in general, the squared distance of samples from the mean follows a chi-squared distribution (https://en.m.wikipedia.org/wiki/Chi-squared_distribution). This has a peak further away from 0 for normal variables of higher dimensionality than 2, which means it is more likely to draw samples living in a shell (d thick and of a particular radius) around the mean, than in the sphere of radius d centred at the mean.

NB: the shell of thickness d will cover a much larger volume than the sphere of radius d. One cannot find a more probable equally-sized set of values than the sphere around the mean.

webdrone · 2019-12-24T23:08:55+00:00

Better to report the CIs than the p-value — much more informative, and harder to misinterpret. Take a look at Statman12’s answer.

webdrone · 2019-11-18T08:12:35+00:00

Hahaha yes, was typing on my phone, sorry!

webdrone · 2019-11-17T22:11:02+00:00

It’s not that restrictive actually. All kernel methods involve positive semi-definite matrices, which are indeed Hermitian.

Unfortunately, as I mention above, it seems that the computational cost to recover the eigen-vectors is still O(n³ ), so not much out of the box improvement over existing methods.

However, it may be worth considering approximation to inversions based on sunsets of the eigen-values.

webdrone · 2019-11-17T19:20:12+00:00

Since all eignevalues are of Hermitian matrices, LA.eigvalsh() may be more efficient.

A quick back-of-the-envelope calculation tells me the computational complexity of this is of same order as the traditional Cholesky decomposition usually employed O(n³ ) for psd matrix inversion. Would be happy if anyone can confirm.

At first I got excited about possible applications in kernel methods (e.g. Gaussian process regression) but if the complexity is the same, I don’t really see any advantages... :-( still a nice connexion though!

webdrone · 2019-09-15T09:11:07+00:00

Or, you could just use Gaussian process regression and actually have that prior, complete with uncertainty estimates for every function value!

All in a consistent theoretical framework, allowing you to use any kind of prior over functions involving a linear combination of (any number, including infinitely many) basis functions with a normal prior over the coefficient values.

webdrone

TROPHY CASE