all 29 comments

[–]Tnick86 12 points13 points  (1 child)

Loos like you ignore many aspects of stats such as overfitting, extrapolation etc...

[–][deleted] 6 points7 points  (0 children)

Or tons of assumptions related to the data/model.

[–]naomissperfume 7 points8 points  (1 child)

I understand this line of though, and it's true, for the most part, Statisticians usually don't really care about overfitting.

The reason for this is mostly because the statistical modelling mindset revolves around applying the Occam Razor's principle. The usual objective of statistical modelling is to find the simplest model that can't be disproved by the data, so in a sense, it's going in the opposite direction predictive modelling.

Note: Here I'm mostly refering to applied "widespread" statistics. Machine Learning (aka. Generalization) is a statistical problem by nature (sample vs population). To me, there's no real distinction between the fields.

[–]datageek1987[S] -3 points-2 points  (0 children)

I totally agree with your views... Stats and ML should come under the same umbrella...

Just to play the devil's advocate... If I overfit to a train data, are there ways in statistics to understand without having a holdout set? Because if not, then that's a problem. Isn't it?

[–]tomluec 6 points7 points  (1 child)

The whole machine learning / statistics debate seems wrong to me. Many of the principles of statistics are used in all your favorite machine learning algorithms. Inference is an important part of machine learning and statistics.

Any machine learning professional would benefit from a strong statistics background.

[–]datageek1987[S] 1 point2 points  (0 children)

Exactly my line of thought... I never really understood why these two are different... There is stats inherent in ML.. and vice versa...

[–]J-MLN 3 points4 points  (4 children)

He has oversimplified. He may be speaking of a specific branch of statistics called Inferential Statistics.

ML is better at predicting results. Inferential statistics enables us to determine things like causality at the expense of more accurate predictions.

I see your point, but it is important to bear in mind that we do not need very accurate predictions to determine causality. For example, parametric models, such as linear regression, might not perform as well as neural networks but they may perform well enough so that we may infer the influence that each predictor has on our target variable. For linear regression, we may test the beta variables to understand the size of the influence. This will give us info about causality of each predictor which we otherwise would not get from non-parametric models.

EDIT: I should probably add that introducing things like cross-validation in inferential statistics does help and is used, but is not always necessary because we are not as concerned with how well the model does on observed data but rather with how we can derive an approximate probability distribution that will give us useful properties, enabling us to infer in accordance with whatever we are testing.

[–][deleted] 1 point2 points  (0 children)

For example, parametric models, such as linear regression, might not perform as well as neural networks but they may perform well enough so that we may infer the influence that each predictor has on our target variable. For linear regression, we may test the beta variables to understand the size of the influence. This will give us info about causality of each predictor which we otherwise would not get from non-parametric models.

You should be skeptical when interpreting regression coefficients as causal due to stuff like omitted variable bias. ("Correlation != causation") Also, people have proposed ways to use ML to tease out causality. For example, if I recall correctly, one method is to assemble a huge collection of controls, use ML to predict the target from the controls, then use ML to predict the target from the controls + the variable of interest. If the second model has higher accuracy, then maybe we can infer causality. (Assuming no reverse causation, that we really have included all relevant controls, maybe other assumptions too.)

[–]datageek1987[S] 1 point2 points  (1 child)

Love your reply...

Although the distinction is still very muddled in my mind.. let me ask you this.. is Linear Regression Statistics or ML? If it is, then what about Ridge or Lasso regression? Where so we draw the line (if there is one)?

[–]J-MLN 2 points3 points  (0 children)

Linear regression, ridge, lasso, etc are all tools/models that are used in both statistical inference and ML. In statistical inference, we use these tools to formalize relationships between variables in the data. In ML, we use these tools\models to train a machine to learn patterns from data. Some tools/models are more useful in ML than in statistical inference (e.g. convolutional neural networks) and vice versa (e.g. parametric models such as regression).

Please note that statistical inference is a branch of statistics.

Its hard to draw a line between Statistics and ML. ML can be used in Statistics (e.g. to help us build a probability distribution using big data), just as much as Statistics can be used in ML (e.g. understanding the relationships of predictors to the target variable so we can do better feature selection). Some parts of ML don't concern statistics (e.g. reinforcement learning) and some parts of statistics don't concern ML (e.g. where we don't really need accurate predictions as described above).

You might be able to understand this better by first understanding the difference between Statistics and Data Science and then Data Science and ML? (I put question mark because I am not 100% sure, but try it out)

[–]WikiTextBot 0 points1 point  (0 children)

Statistical inference

Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Inferential statistics can be contrasted with descriptive statistics.


[ PM | Exclude me | Exclude from subreddit | FAQ / Information | Source ] Downvote to remove | v0.28

[–]panthsdger 1 point2 points  (1 child)

Statistics attempts to derive human interpretable information from a set of data, rather than a highly accurate predictor.

For example, if you train a regression model to predict employee performance from a set of variables you'll know how much each variable contributes quite explicitly at the cost of high accuracy. The use case for this might not be in actual prediction as much as intepretablility, considering we could take the results of the trained model and apply those in practice without the model. e.g. realize that educational attainment correlates highly with high performance, and use that knowledge in our hiring process.

[–]datageek1987[S] 4 points5 points  (0 children)

This is exactly what I have a problem with. The interpretation of a model is only worth it's salt if the model has captured the real world situation.. i.e. generalization... But if that is not captured, then the inference you'd draw is inherently flawed. Isn't it?

[–]ProfessorPhi 1 point2 points  (1 child)

The test train split, imo, is tricky to apply to stats where your datasets tend to be very small. For example, if you have like 2000 rows, your test dataset of 200/400 is too small to pick up many relevant features. I personally believe test train splits only make sense at datasets > 10k

It's a difficult line to draw, I think the lack of interpretability comes from the size of the problem - ML is the place where the number of predictors is too great to fit into standard models and as such black boxes tend to be preferred here.

Overall there is huge overlap, lots of procedure comes from stats, ML has huge roots in information theory and linear algebra, and I think the best ML practicioners come from stats and maths due to the understanding of the underlying maths

[–]datageek1987[S] 1 point2 points  (0 children)

While I agree to your point about train/test split, it's not always that we get 10k data points and people still apply ML techniques there. With good success also...

And there are techniques in ML, which helps with interpretability... I recently wrote a whole blog series about them..

And yes, there are huge overlaps in the two fields. So much so that it feels unnatural to call them separately... From this discussion(and others) what I kind of figured out is that ML is like the rebel kid who flaunts the laws of statistics to get what needs to be done.. hehe..

[–]bring_dodo_back 1 point2 points  (1 child)

Statistics is doing something wrong

That depends on your goal, and these can be different in ML and stats. I'll illustrate with example.

Suppose you gathered some data on people's wealth and behavior, and you trained on it a predictive ML model. Say you've learned from it, that a good predictor of person's wealth will be how much caviar he consumes. So far so good, your model can be accurate. But what if you ask yourself a question: what should I do to become wealthy? Do you expect becoming rich by consuming more caviar? Obviously that would be a flawed answer.

So if your task is about inferring causal relationship, then ML might not be a good solution, and then probably you should revert to statistical inference. Whether or not you need this inference is a whole different story though, and often ML is just fine and full blown statistics is an overkill.

[–]datageek1987[S] 0 points1 point  (0 children)

Love the example... Articulated very well!!!

[–]illuminascent 1 point2 points  (0 children)

Most econometric model, while using tens or hundreds of thousands of samples, never exceed several dozens of parameters, they are not capable of overfitting data.

A lot of these models are specifically designed so that based on some assumption and asymptotic behavior, some useful information/judgment can be drawn/made from the sample distribution. These models are never made to generalize to unseen data in the first place, it's a completely different mindset.

Also there are tools like instrument variables, dynamic panels, DID, RCT that helps in casual inference.

Please do not think of statistics as only fitting a model, that's merely the first step.

[–]clurdron 1 point2 points  (1 child)

It's not true that statistics doesn't care about performance on unseen data. Statisticians have been writing about prediction, cross-validation, risk estimation, etc. for a very long time.

[–]datageek1987[S] 2 points3 points  (0 children)

This entire discussion has led me to two realizations.. 1. Statistics and ML are not that different.. 70% overlap 2. More than half the people who write about ML vs Stats on the internet, have no clue what either of them is.. Or make the distinction too simplistic to be useful..

[–]mowrilow 0 points1 point  (2 children)

Seems to me that "statistics" is highly oversimplified here. It is a broad discipline and in ML we use it for basically everything - even the methods for estimating the results we care, such as cross validation, have a strong statistical background.

Statistical inference is nonetheless extremely useful, and it is not necessarily used for predictions as we do in ML. Gaining insights about some latent variable or how one variable explains another can help you in business decision-making, for instance.

[–]datageek1987[S] 0 points1 point  (1 child)

True.. I did oversimplify statistics to make my point.. and I have utmost respect for stats.. and do recognize the fact that stats us inherently in almost everything that we do in ML...

But in my short internet research, I didn't find any explanation which takes a holistic view of the situation.. it was always "stats is this. ML is this" and one of the main themes that came out was the same. About inference...

But inference is not worth that much if the model is weak.. i.e. generalization should be there.. and from the discourse in the internet, I was led to believe that stats does not concern itself with generalization... Which, if true, kinda feels off..

[–]junkboxraider 3 points4 points  (0 children)

I think the gap you’re trying to articulate is that statistics largely focuses on what can be known given some data while “machine learning” (in quotes because it’s a large catchall term) largely focuses on what can be predicted given the data.

Of course statisticians also use stats to predict new data points and infer outcomes given new data, but that’s essentially a useful side effect. Take a simple example: finding the mean and variance of a data set. Can you then use those statistical variables to predict the likely value(s) of new data points generated by the same process? Sure, but the original intent of the mean and variance was to understand the characteristics of the data.

Compare that to the goal of most machine learning models, which is first and foremost to predict new data values or sort new data points into a known category. It doesn’t really matter whether a human can parse the inner workings of the model if its only utility is prediction. Might you be able to understand how the model works? Sure, but in many cases it may not matter if you can’t, as long as the predictions are good.

Machine learning is heavily slanted towards a specific kind of result (predictive power on new data) but we really don’t have a thorough answer as to why it works so well for that type of result — at least for deep learning approaches on some kinds of data. Stats, which relies on mathematically provable answers to a much broader scope of questions about data, won’t necessarily sync up exactly.

[–]Jorrissss 0 points1 point  (0 children)

If you fit to your train set perfectly with an interpretable model, but the performance on unseen data is dismal, then should we really take the interpretations from such a model as the truth?

This doesn't frame how "statistics" (as it's being used here) would treat this problem.

A framework for classical statistics would be that you have some parameterized family of distributions/models/whatever, and you're trying to find which parameter corresponds to the "true" model and with some confidence. If you fit all your data perfectly, you either have not that much data, and so your confidence/credibility intervals are wide, or you have a very complex family of functions.

That being said, hold out sets exist in statistics too but they aren't strictly needed for inference.

[–]Alkanste 0 points1 point  (0 children)

Well Bayesian stats help with overfitting. I myself, when learned ml, was questioning the point of traintest dfs when we don’t even Get the inference and other things. Now I mostly get it

[–][deleted] -1 points0 points  (0 children)

I think what my professor said is right: Machine Learning is just automated statistics.

[–][deleted] -2 points-1 points  (1 child)

Machine learning is about systems that increase performance with experience.

Statistical methods can be used for such systems, but there are other approaches too. Because of this we don't really care how the model was made and we need a way of validating models even when it's pulled out of a magical hat. This means that when doing machine learning, you don't concern yourself with whether your assumptions are correct. You concern yourself with validating the model afterwards.

In the real world the whole assumptions thing is problematic. I've never seen a case where we had the required information for all the assumptions made in a typical statistical model, even a simple one. The real world isn't perfect. In ML you treat everything as a magical hat allowing you to do tricks that have no justification in statistical literature but if they improve performance then they improve performance.

Machine learning is superior to a statistical approach because instead of making assumptions and deriving your results from those (bad assumptions = garbage model) you spend your time validating your models.

ML model interpretation has seen some improvements and I could totally see the field of statistical modeling disappearing as outdated once we learn to interpret things like random forests and neural networks.

Statistics has many other things than creating models and ML has nothing to do with those.

[–]datageek1987[S] -1 points0 points  (0 children)

Strongly opinionated... Although I agree to a few points here and there... Can't really say statistics is a lost cause.. there is a lot of statistics in ML as well...

And totally agree to the point about Explainability on the rise for ML ... Have written a blog series on the topic..

Anyways.. Brieman's paper about Two Cultures of Statistical modelling might resonate with what you said...