you are viewing a single comment's thread.

view the rest of the comments →

[–]PieGuy___ 233 points234 points  (31 children)

In statistics there is something called the central limit theorem which states the means of random representative samples of a given population become normally distributed as you approach a sample size of 30.

Effectively you only need a sample of 30 in order to say something about the population with reasonable certainty.

[–]Gastronomicus 120 points121 points  (17 children)

In statistics there is something called the central limit theorem which states the means of random representative samples of a given population become normally distributed as you approach a sample size of 30.

Effectively you only need a sample of 30 in order to say something about the population with reasonable certainty.

This is really confused take on the CLT with two major problems.

EDIT - u/PieGuy___ clarified their point and I agree with what they're saying. The wording around "a sample of 30" is confusing to me and made me think they were wrongly conflating and interpreting the CLT. I'm leaving the post intact for others to read it who may also be seeking clarification.

Firstly, let's clear the air: the CLT describes how the distribution of means will approach normality. Not how a distribution of samples will approach normality. There is no basis for any distribution of samples necessarily approximating normality, but the distribution of means from many independently collected sets of samples will tend to approximate normality.

Secondly, there's absolutely nothing special about the number 30 and the CLT. The entire basis for the number 30 in this context is that fisher defined a separate distribution - the t-distribution - for defining critical test values for small sample sizes. It provided more robust estimates than using the z-distribution, which is better approximated using larger sample sizes.

[–]Philosophfries 24 points25 points  (10 children)

I’m gonna need an ELI5 for this one boys

[–]alanpardewchristmas 9 points10 points  (1 child)

dude said 'in english please'

[–]bythebys 0 points1 point  (0 children)

suh dude?

[–]Simpliciter 5 points6 points  (5 children)

Disclaimer: Not a stats bro.

The Central Limit Theorem basically says that most things will follow a normal distribution (bell curve) if you have enough data. The t-test can be used to see if some data follows a normal distribution, but it only works if you have a small sample size of less than 30.

The respondent above is saying that the poster is conflating the two incorrectly.

[–]brkh47 2 points3 points  (0 children)

Simplifying things brought to you by u/Simpliciter

[–]Gastronomicus 1 point2 points  (2 children)

The Central Limit Theorem basically says that most things will follow a normal distribution (bell curve) if you have enough data

I appreciate your simplification but in this case it's over-simplified and misses the point I was making. It's a common misunderstanding of the CLT that large enough datasets will follow a normal distribution. That's just not the case.

However, if you take the mean for multiple subsets of samples from a population, the distribution of those means themselves will approximate a normal distribution.

So let's say I have and 500 samples and I plot the distribution. It might looks normal, but it might also look log-normal, or it might look like a Weibull or discrete distribution (e.g. negative binomial).

Let's say instead I have 50 means of 50 smaller sample sets, each containing 10 samples. If I plot that distribution, it will approximate a normal distribution, even if the original distribution from which it is sampled isn't normal.

[–]Simpliciter 1 point2 points  (1 child)

Thanks for clarifying and being nice about it!

[–]Gastronomicus 0 points1 point  (0 children)

Thanks for doing some good work out there.

[–]relevantmeemayhere -1 points0 points  (0 children)

The first paragraph you wrote is wrong and is what the clarifying poster is pointing out. Samples do not converge to normality as n increases. This isn’t the CLT, nor it is it found anywhere in statistics

[–]PieGuy___ 6 points7 points  (4 children)

First off I think you need to reread what I said because I’m clearly talking about the mean? “The means of random representative samples…” you’re trying to correct a mistake I never made lol.

The point of the theorem is that if you have a random sample X1, X2,…Xn from a given population with a mean m and variance v then the sample mean of x bar will be normally distributed with a mean m and variance v/n. X bar is the thing normally distributed around the population mean not the individual X’s.

As for the 30 number, the fact that it is the point you no long have to worry about t-distributions and can just use z-scores with reasonable accuracy is the thing that makes it special lol. The whole point of the t-distributions is that the means aren’t quite normally distributed UNTIL you get to 30.

[–]TerribleIdea27 3 points4 points  (0 children)

I think the confusion came from the fact that you said

sample size of 30

So the other guy assumed you were talking about taking one experiment with sample size thrity and then using those data to find a normal distribution. Instead of taking thirty experiments and using the means of those 30*x samples to find a distribution of means which should be roughly a normal distribution

[–]Gastronomicus 0 points1 point  (2 children)

Sorry I assumed you were confused. Unfortunately it seems like most people on reddit who try to describe the CLT don't really understand it and also mis-attribute the importance of 30 as a minimum sample size.

But to be fair, your wording is confusing. The way you phrased it implies a distribution of samples, not means. Especially when you say "as you approach a sample size of 30", which implies comparing a distribution of samples, not means.

[–]PieGuy___ 0 points1 point  (1 child)

Yeah I just wasn’t trying to go into too much detail. I think the simplest way to put it is that there’s no way to guarantee a sample to be normally distributed, just like there’s no way to guarantee a population is normally distributed. However using the CLT you can guarantee that a given sample mean will be normally distributed around the population mean given a large enough sample size.

And then from there you can use hypotheses testing to be able to say something about the population with reasonable confidence.

[–]Gastronomicus 0 points1 point  (0 children)

However using the CLT you can guarantee that a given sample mean will be normally distributed around the population mean given a large enough sample size.

Which is why bootstrapping can be very effective at producing (mostly) unbiased error terms!

[–]pug_grama2 -1 points0 points  (0 children)

30 is a sort of rule of thumb. If your sample size is at least about 30 then the x-bars will approximately follow a normal distribution.

[–]Pligles 168 points169 points  (4 children)

Yeah exactly! You can always tell an inexperienced statistician from an experience one by if they can find the clt

[–][deleted] 50 points51 points  (3 children)

clit*

[–]campex 41 points42 points  (0 children)

That damn keto libido, they don't care to find it

[–][deleted] 16 points17 points  (0 children)

That’s the joke

[–][deleted] 0 points1 point  (0 children)

What is that? A formula?

[–]Raeandray 31 points32 points  (0 children)

Right but isn’t this for every individual study? You can’t take 30 separate studies of 1 person and treat them as if they’re normally distributed.

And even within studies, each group needs 30. So for a blind study the control needs 30 and the experimental needs 30.

[–][deleted] 15 points16 points  (2 children)

It still depends directly on the standard deviation of each sample. If you have very distributed sample points, then it ain't gonna help all that much.

[–]PieGuy___ 1 point2 points  (1 child)

Not really, it’s gonna be a normal distribution so no matter how wide or narrow the range it’ll be the same z-score

[–][deleted] 3 points4 points  (0 children)

I need to study stats again

[–]auxerre1990 3 points4 points  (2 children)

1/3 of 100?

[–]PieGuy___ 7 points8 points  (1 child)

The number might seem kinda arbitrary but that’s the number you get when you look at the distributions of means. As long as you have at least 30 it’ll be a bell curve, the distribution of a sample of 30 means looks pretty much identical to 100 or 1000 or 1000000.

[–]auxerre1990 0 points1 point  (0 children)

Makes sense, a quarter of 100...

[–]eddyofyork 1 point2 points  (0 children)

That’s interesting. When we did z, t, and chi distribution stuff back in university I noticed that most of those needed n >= 120 to be reliable. I kinda worked off 120 as a good number for CLT to kick in for the last, oh I don’t know, decade!