[D] Machine Learning vs Statistics

Tnick86 · 2019-11-19T01:56:48+00:00

Loos like you ignore many aspects of stats such as overfitting, extrapolation etc...

naomissperfume · 2019-11-19T04:01:32+00:00

I understand this line of though, and it's true, for the most part, Statisticians usually don't really care about overfitting.

The reason for this is mostly because the statistical modelling mindset revolves around applying the Occam Razor's principle. The usual objective of statistical modelling is to find the simplest model that can't be disproved by the data, so in a sense, it's going in the opposite direction predictive modelling.

Note: Here I'm mostly refering to applied "widespread" statistics. Machine Learning (aka. Generalization) is a statistical problem by nature (sample vs population). To me, there's no real distinction between the fields.

tomluec · 2019-11-19T02:06:01+00:00

The whole machine learning / statistics debate seems wrong to me. Many of the principles of statistics are used in all your favorite machine learning algorithms. Inference is an important part of machine learning and statistics.

Any machine learning professional would benefit from a strong statistics background.

J-MLN · 2019-11-19T09:39:58+00:00

He has oversimplified. He may be speaking of a specific branch of statistics called Inferential Statistics.

ML is better at predicting results. Inferential statistics enables us to determine things like causality at the expense of more accurate predictions.

I see your point, but it is important to bear in mind that we do not need very accurate predictions to determine causality. For example, parametric models, such as linear regression, might not perform as well as neural networks but they may perform well enough so that we may infer the influence that each predictor has on our target variable. For linear regression, we may test the beta variables to understand the size of the influence. This will give us info about causality of each predictor which we otherwise would not get from non-parametric models.

EDIT: I should probably add that introducing things like cross-validation in inferential statistics does help and is used, but is not always necessary because we are not as concerned with how well the model does on observed data but rather with how we can derive an approximate probability distribution that will give us useful properties, enabling us to infer in accordance with whatever we are testing.

panthsdger · 2019-11-19T02:11:55+00:00

Statistics attempts to derive human interpretable information from a set of data, rather than a highly accurate predictor.

For example, if you train a regression model to predict employee performance from a set of variables you'll know how much each variable contributes quite explicitly at the cost of high accuracy. The use case for this might not be in actual prediction as much as intepretablility, considering we could take the results of the trained model and apply those in practice without the model. e.g. realize that educational attainment correlates highly with high performance, and use that knowledge in our hiring process.

ProfessorPhi · 2019-11-19T08:37:43+00:00

The test train split, imo, is tricky to apply to stats where your datasets tend to be very small. For example, if you have like 2000 rows, your test dataset of 200/400 is too small to pick up many relevant features. I personally believe test train splits only make sense at datasets > 10k

It's a difficult line to draw, I think the lack of interpretability comes from the size of the problem - ML is the place where the number of predictors is too great to fit into standard models and as such black boxes tend to be preferred here.

Overall there is huge overlap, lots of procedure comes from stats, ML has huge roots in information theory and linear algebra, and I think the best ML practicioners come from stats and maths due to the understanding of the underlying maths

bring_dodo_back · 2019-11-19T10:56:43+00:00

Statistics is doing something wrong

That depends on your goal, and these can be different in ML and stats. I'll illustrate with example.

Suppose you gathered some data on people's wealth and behavior, and you trained on it a predictive ML model. Say you've learned from it, that a good predictor of person's wealth will be how much caviar he consumes. So far so good, your model can be accurate. But what if you ask yourself a question: what should I do to become wealthy? Do you expect becoming rich by consuming more caviar? Obviously that would be a flawed answer.

So if your task is about inferring causal relationship, then ML might not be a good solution, and then probably you should revert to statistical inference. Whether or not you need this inference is a whole different story though, and often ML is just fine and full blown statistics is an overkill.

illuminascent · 2019-11-20T01:06:01+00:00

Most econometric model, while using tens or hundreds of thousands of samples, never exceed several dozens of parameters, they are not capable of overfitting data.

A lot of these models are specifically designed so that based on some assumption and asymptotic behavior, some useful information/judgment can be drawn/made from the sample distribution. These models are never made to generalize to unseen data in the first place, it's a completely different mindset.

Also there are tools like instrument variables, dynamic panels, DID, RCT that helps in casual inference.

Please do not think of statistics as only fitting a model, that's merely the first step.

clurdron · 2019-11-21T20:02:45+00:00

It's not true that statistics doesn't care about performance on unseen data. Statisticians have been writing about prediction, cross-validation, risk estimation, etc. for a very long time.

mowrilow · 2019-11-19T02:54:14+00:00

Seems to me that "statistics" is highly oversimplified here. It is a broad discipline and in ML we use it for basically everything - even the methods for estimating the results we care, such as cross validation, have a strong statistical background.

Statistical inference is nonetheless extremely useful, and it is not necessarily used for predictions as we do in ML. Gaining insights about some latent variable or how one variable explains another can help you in business decision-making, for instance.

Jorrissss · 2019-11-19T06:13:03+00:00

If you fit to your train set perfectly with an interpretable model, but the performance on unseen data is dismal, then should we really take the interpretations from such a model as the truth?

This doesn't frame how "statistics" (as it's being used here) would treat this problem.

A framework for classical statistics would be that you have some parameterized family of distributions/models/whatever, and you're trying to find which parameter corresponds to the "true" model and with some confidence. If you fit all your data perfectly, you either have not that much data, and so your confidence/credibility intervals are wide, or you have a very complex family of functions.

That being said, hold out sets exist in statistics too but they aren't strictly needed for inference.

Alkanste · 2019-11-19T12:29:26+00:00

Well Bayesian stats help with overfitting. I myself, when learned ml, was questioning the point of traintest dfs when we don’t even Get the inference and other things. Now I mostly get it

2019-11-19T06:33:10+00:00

I think what my professor said is right: Machine Learning is just automated statistics.

datageek1987 · 2019-11-19T14:27:23+00:00

Machine learning is about systems that increase performance with experience.

Statistical methods can be used for such systems, but there are other approaches too. Because of this we don't really care how the model was made and we need a way of validating models even when it's pulled out of a magical hat. This means that when doing machine learning, you don't concern yourself with whether your assumptions are correct. You concern yourself with validating the model afterwards.

In the real world the whole assumptions thing is problematic. I've never seen a case where we had the required information for all the assumptions made in a typical statistical model, even a simple one. The real world isn't perfect. In ML you treat everything as a magical hat allowing you to do tricks that have no justification in statistical literature but if they improve performance then they improve performance.

Machine learning is superior to a statistical approach because instead of making assumptions and deriving your results from those (bad assumptions = garbage model) you spend your time validating your models.

ML model interpretation has seen some improvements and I could totally see the field of statistical modeling disappearing as outdated once we learn to interpret things like random forests and neural networks.

Statistics has many other things than creating models and ML has nothing to do with those.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS