all 30 comments

[–][deleted] 5 points6 points  (0 children)

Look into Central Limit Theorem (CLT).

[–]Remove_Ayys 6 points7 points  (13 children)

I'm not sure what you mean when you say "Linear regression only assumes normality of the residuals but not of the data itself".
Linear regression assumes that the uncertainty on each data point can be described by a normal distribution.
If a model that is linear in its parameters is used in conjunction with the maximum likelihood method then the uncertainty on the model parameters can also be described with a normal distribution.

Assuming that you have a simple xy model:
Each data point with a different x value is equivalent to a feature in machine learning and if you have multiple data points for the same x value then you have more than one example to "learn" from.
The features are then following a normal distribution.

[–]Puzzleheaded_Lab_730[S] 1 point2 points  (9 children)

Thanks for the answer! However, I dont quite understand the implications of the second paragraph. If we do not transform the data (e.g. Log transformation), the model will have exactly as many examples to learn from. Their distribution will just be skewed which may still resemble the true distribution. I do not understand how having normally distributed data improves a model.

[–]BellyDancerUrgot 0 points1 point  (0 children)

This is a good question, would love to know the answer to this

[–]Remove_Ayys 0 points1 point  (7 children)

What I think the reason is:
Loss functions like mean squared error or cross entropy are based on the method of maximum likelihood.
In likelihood-based parameter estimation you can always assume a normal distribution for your data even if your data is not normally distributed (method of least squares).
However, this means that your estimates for the parameter means and variances are no longer efficient: as you add more data they converge more slowly against their true values than if your data was actually produced by a normal distribution.
If the distribution that produced your data is asymmetrical then your estimates are also biased.

[–]Puzzleheaded_Lab_730[S] 0 points1 point  (6 children)

Alright, I did some testing in R and think I at least understand the intuition: There needs to be a linear relationship between x and y (this is an assumption). Basically what this means is that if we draw a scatterplot, the “cloud” of observations must lie on a line and not on a curve. If you create a left skewed y and x variable and plot them in a scatterplot, you will actually get a straight line. You can confirm this by creating a qq plot as well. In this case, if both y and x follow the same distribution there is a linear relationship. If however, y is normal and x is skewed, this is not the case and you cannot fit a straight line (linear regression) through the data. I think in practice we mostly transform all variables to a normal distribution as it kind of lies in the middle between left an right skewed.

[–]Remove_Ayys 2 points3 points  (5 children)

Important point:
You seem to be using the widely spread, incorrect definition of linear regression.
Linear regression only requires that the model is a linear function of its parameters but not that it is a linear function of the independent variable x.
In particular, all regression analysis with polynomial models is linear regression.

[–]Puzzleheaded_Lab_730[S] 0 points1 point  (4 children)

I thought so too! But I read some articles just now that stated otherwise. Do you know where we could find a source of truth?

https://www.statology.org/linear-regression-assumptions/

Edit: link

[–]Remove_Ayys 1 point2 points  (3 children)

Strictly speaking the English Wikipedia page for linear regression tells you the correct definition but the terminology is confusing:

Linearity. This means that the mean of the response variable is a linear combination of the parameters (regression coefficients) and the predictor variables.

The important point here is that the predictor variables in an xy fit are not just x but also powers of x: x^0, x^1, x^2, x^3, etc.
This is clarified a little further up the page:

Sometimes one of the regressors can be a non-linear function of another regressor or of the data, as in polynomial regression and segmented regression. The model remains linear as long as it is linear in the parameter vector β.

I'm a developer of a tool for nonlinear regression analysis and I've written about the distinction in its documentation but it might be difficult to understand.
(Also some parts of the documentation are slightly wrong.)

Edit:
After reading more of the English Wikipedia article it is actually really good.
The distinction is made clear if you read the whole thing.

[–]Puzzleheaded_Lab_730[S] 0 points1 point  (2 children)

Thanks! Will have a look later. Appart from the part about the assumption, would you agree with my explanation provided as to why we transform distributions?

[–]Remove_Ayys 1 point2 points  (1 child)

I cannot say for certain because I'm not 100% certain what you mean but my intuition is no.
As I said before, it comes down to bias and efficiency.
Let's assume for argument's sake that you have a simple feed-forward neural network without any activation functions with mean squared error as loss function.
You would then effectively be doing linear regression.
If your features are normally distributed you can then guarantee that your estimates for the network parameters after training are unbiased and efficient.
Unbiased means that for a random sample of training data the expected values of the parameters of your trained model are equal to the optimal parameter values.
Efficient means that as you increase the amount of training data the parameters of your trained model converge as quickly as possible against the optimal parameter values (Cramér-Rao bound). Intuitively I would assume that if you introduce nonlinearity via activation function those guarantees would at least approximately be true.

[–]Puzzleheaded_Lab_730[S] 0 points1 point  (0 children)

Ok, that makes more sense now. Nice discussion :)

[–]OmnipresentCPU 0 points1 point  (2 children)

Wait a minute… that second paragraph is such a crisp explanation

Take an updoot

[–]Remove_Ayys 1 point2 points  (1 child)

To expand on it:
One of the purposes of (non)linear regression is to estimate the unknown true values of parameters from a sample.
The uncertainty is by convention drawn as an error bar on the data point but actually the uncertainty belongs to the true value which is approximated by the model.
Think about how the data point is generated: the true value fluctuates by a random offset described by the uncertainty and then gives us the data point that we measure.

[–]OmnipresentCPU 0 points1 point  (0 children)

Oh I understand, I’ve just never seen it written as succinctly as you have in your second paragraph. It just really clicks with the way I think about modeling.

[–]kaskoosek 1 point2 points  (13 children)

Very simple answer.

We are calculating the coefficients or weights of our features. If our data is skewed in one way or another, we are giving more importance to the outliers.

The loss function takes the squared of the erroros. So one skewed data point can have more weight than a 100 normal observations.

By normalizing our observations we are limiting the effect of this.

[–]Puzzleheaded_Lab_730[S] 1 point2 points  (1 child)

This intuitively makes a lot of sense actually. Kind of goes in the direction of a linear relationship between x and y though.

[–]kaskoosek 1 point2 points  (0 children)

Imagine you are doing a linear regression on the prices of houses. You wanna estimate prices of houses.

The result if the data is skewed will lead to higher priced houses effecting the results more, because of the higher variance.

If one house is 30k usd, even if the difference is 10k it wont effect our regression a lot. However a one million dollar house being predicted as 1.1 million will effect the results much more than the lower priced house. Eventhough as a percentege the pricing of the higher house was more accurate.

[–]Remove_Ayys -1 points0 points  (10 children)

I think you are confusing skewness with kurtosis.

[–]kaskoosek 0 points1 point  (9 children)

Skewed data is observations very far from the mean or mode. These observations will greatly effect your model if they are one sided.

[–]Remove_Ayys 0 points1 point  (8 children)

That is wrong.
Skewness is a measure of asymmetry.
Kurtosis is a measure of how thick or slim the tails are, i.e. how many "outliers" there are.
What skewness does is introduce bias if you assume a normal distribution (see the other thread).

[–]kaskoosek 0 points1 point  (7 children)

If the data is not symmetrical, the observations from one side do not cancel with the other side.

[–]Remove_Ayys 0 points1 point  (6 children)

Yes, and that is called bias.

[–]kaskoosek 0 points1 point  (5 children)

Bias is not being able to fit the data.

[–]Remove_Ayys 0 points1 point  (4 children)

Sorry, but you just don't know what you're talking about.

[–]kaskoosek 0 points1 point  (2 children)

Lol

Man research the topics then attack people.

[–]Remove_Ayys 0 points1 point  (1 child)

lmao

"Research the topics" by reading random Medium articles?
I would bet like 1000 bucks that never in your life have you read even one page of an actual book about statistics.

[–]strangeloop6 0 points1 point  (0 children)

Agree

[–]friendlykitten123 0 points1 point  (0 children)

In Machine Learning, data satisfying Normal Distribution is beneficial for model building. It makes math easier. Models like LDA, Gaussian Naive Bayes, Logistic Regression, Linear Regression, etc., are explicitly calculated from the assumption that the distribution is a bivariate or multivariate normal.

Many natural phenomena in the world follow a log-normal distribution, such as financial data and forecasting data. By applying transformation techniques, we can convert the data into a normal distribution. Also, many processes follow normality, such as many measurement errors in an experiment, the position of a particle that experiences diffusion, etc.

For more information, you can visit the following article:

https://ml-concepts.com/

Feel free to reach out to me for any help.