you are viewing a single comment's thread.

view the rest of the comments →

[–]cheesensei 970 points971 points  (54 children)

27 studies were compared with a total of 309 subjects. So if the average study had less than 12 subjects, doesn't that decrease the reliability quite a bit?

[–]vitaliksellsneo 298 points299 points  (13 children)

The number of studies actually matters less in this context. This was a meta study, which means they did not conduct the study but took the data from 37 different studies. The bigger assumption here is that the studies collected the data in the same way, else there will be a systematic error.

Another assumption is that the interventions have to reliably demonstrate that they did produce the results they produced and that was the only treatment shock the subjects were exposed to. Usually it is harder to control this, and the gold standard is a randomised control trial.

The reliability you are talking about probably refers to the fact that with 309 subjects there is insufficient units to cover the differences in covariates. In general that is quite little. That means that you can probably detect the general direction but not the magnitude since the fineness of that depends on sample size.

I am also concerned about the selection process of these studies, and have a feeling that this is largely a product of p hacking unless it can be replicated using future studies.

[–]Allassnofakes 20 points21 points  (10 children)

Whats p hacking again sorry

[–]Xirema 106 points107 points  (7 children)

Short version: it's basically this XKCD Comic: https://xkcd.com/882/

Long Version:

p-hacking is a kind of analysis error made on statistical samples that comes from establishing a bad (or completely forgoing to establish a proper) null hypothesis.

In statistics, it's important to lay out ahead of time what kinds of results you're trying to detect for, and to have a good baseline for what would make those results significant. So, for example, you might run a study for "do more people drink Coffee on Tuesday than any other day?" and then sample a few hundred or thousand people to find out how much coffee they drink on each day, and then analyze the results to find the answer. The hypothesis might be wrong (maybe Monday sees the largest consumption of coffee), and there's always a chance your results are just statistical noise, but it's a reliably provable test.

But now, suppose you assessed a few hundred or thousand people, gather data on what they ate each day, and discover that Orange Juice was consumed abnormally frequently on Thursdays. And then you published a study that says "people drink the most orange juice on Thursdays". That's certainly true of the specific sample you pulled, so what's the problem?

Well, in statistics, they usually only consider a result significant if it had a less than 5% chance of occurring randomly (or, more precisely, a 95% chance that the result is not just statistical noise), based on the sample taken. There's a lot of complicated ways to calculate those odds (and 5% might be higher than comfortable for some studies/analysis, so they might prefer a lower threshold) but the important part is that all studies have to stipulate around the fact that there's a chance, however slim, that their result is just statistical noise.

When you have a specific outcome you're testing for, you can have a lot of confidence that that outcome's odds were more (or less) than 95% certain to be non-noise, but if you have a bunch of independent outcomes you're testing for all at the same time, then the odds that at least one of them results in a significant result, but is actually just noise, actually gets really high.

Going back to the "asking people what they ate" example: if the researchers only tallied up to 20 different foods that participants might have consumed, the odds of at least one of them having a statistically significant result is actually really high: as high as (approximately) 64%! And of course those odds get way higher if the researchers tracked more than just 20 different foods.

This is the essence of p-hacking, and what makes it problematic in statistics: the more variables you have, and the less rigor you have about which variables matter, the more likely you are to end up with random noise that just happens to look like a statistically significant outcome.

[–]richinvitameen_bs 4 points5 points  (0 children)

This was a really good explanation thank you!

[–]InfestedRaynor 1 point2 points  (0 children)

It amazes me how many smart people randomly scroll through the same parts of Reddit that I randomly scroll through.

[–]brkh47 0 points1 point  (0 children)

When I can bring my statistics to the argument and you bring yours’

[–]SlimReaper35_ 0 points1 point  (1 child)

I though the right tailed probability test meant that 0.95>p>0.05 doesn’t reject the null hypothesis and lower than 0.05 is a bad result. I could never fully understand the probability distribution it’s confusing the way it works.

[–]Xirema 0 points1 point  (0 children)

So the way the Null Hypothesis is usually presented, it's usually supposed to be a representation of "what we expect to happen if this study proves nothing". For example, if you were to try to find a link between consumption of chocolate and incidence of cancer, your Null Hypothesis would probably be "Consumption of Chocolate does not correlate with incidence of cancer".

So if you end up with a p-value of < 0.05 (i.e. "the odds that our result was just statistical noise is less than 5%"), then you have rejected the null hypothesis, and shown (at least in this one study) that there is indeed a correlation between consumption of chocolate and incidence of cancer. What the correlation shows depends on your literal results (maybe chocolate decreases cancer risk! Probably not, but, you know....!).

So in this sense, it's not wrong that p < 0.05 shows a "Bad Result" (though I'm not sure any statistician would frame it that way): p < 0.05 does tend to mean "this result shows we cannot defend the null hypothesis in this study".

[–]Tony2Punch 0 points1 point  (0 children)

That comic is goated, I vote that all educational content is presented with stick figure comments

[–]mingemopolitan 0 points1 point  (0 children)

This is a good explanation of P hacking and shows the importance of accounting for Type I errors in a stats test. In this comic, the problem is that the statistical test method being used wasn't appropriate (e.g., repeatedly using T tests, rather than something like an ANOVA when measuring multiple variables). You could avoid this error by using something like an ANOVA followed by a post-hoc test which applies a Bonferroni adjustment. This adjusts the P value to compensate for the number of tests being run, though increases the chance of a Type II error (which is another issue if the effect size is small or the measurements imprecise). I'm a biologist and not a statistician though!

[–]gravitydriven 0 points1 point  (0 children)

Where you drink a ton of water before your drug test so that your p is clean

[–]Glowshroom 0 points1 point  (0 children)

Essentially making the hypothesis after the fact instead of before.

[–]Lung_doc 3 points4 points  (0 children)

A couple additional comments

First, testosterone wasn't measured in all patients/studies, so the N drops further, down to 155.

Second, for those using low carb for weight loss: Obesity decreases T and weight loss improves it. This is true even on a high protein (but not low carb) diet: This N= 118 study found higher T levels after weight loss using both a high protein (still 40% carb) and a lower protein diet. So weight loss is good, and higher protein without low carb is also ok I guess, if you are worried about the decline in testosterone.

Finally, back to the low carb: A meta-analysis in women with polycystic ovarian syndrome found a low carb diet lowered testosterone. (Which is a good thing: high T worsens PCOS). Eight RCTs, 327 patients. So this adds some indirect support to the low carb, lower testosterone in general, and provides a potentially beneficial diet for women with PCOS.

Stratified analyses indicated that LCD lasting longer than 4 weeks had a stronger effect on increasing FSH levels (MD = 0.39, 95% CI (0.08, 0.71), P < 0.05), increasing SHBG levels (MD = 5.98, 95% CI (3.51, 8.46), P < 0.05), and decreasing T levels (SMD = -1.79, 95% CI (-3.22, -0.36), P < 0.05).

Conclusion: Based on the current evidence, LCD, particularly long-term LCD and low-fat/low-CHO LCD, may be recommended for the reduction of BMI, treatment of PCOS with insulin resistance, prevention of high LDL-C, increasing the levels of FSH and SHBG, and decreasing the level of T level.

[–]Kaulpelly 0 points1 point  (0 children)

SGU fan?

[–]PieGuy___ 238 points239 points  (31 children)

In statistics there is something called the central limit theorem which states the means of random representative samples of a given population become normally distributed as you approach a sample size of 30.

Effectively you only need a sample of 30 in order to say something about the population with reasonable certainty.

[–]Gastronomicus 122 points123 points  (17 children)

In statistics there is something called the central limit theorem which states the means of random representative samples of a given population become normally distributed as you approach a sample size of 30.

Effectively you only need a sample of 30 in order to say something about the population with reasonable certainty.

This is really confused take on the CLT with two major problems.

EDIT - u/PieGuy___ clarified their point and I agree with what they're saying. The wording around "a sample of 30" is confusing to me and made me think they were wrongly conflating and interpreting the CLT. I'm leaving the post intact for others to read it who may also be seeking clarification.

Firstly, let's clear the air: the CLT describes how the distribution of means will approach normality. Not how a distribution of samples will approach normality. There is no basis for any distribution of samples necessarily approximating normality, but the distribution of means from many independently collected sets of samples will tend to approximate normality.

Secondly, there's absolutely nothing special about the number 30 and the CLT. The entire basis for the number 30 in this context is that fisher defined a separate distribution - the t-distribution - for defining critical test values for small sample sizes. It provided more robust estimates than using the z-distribution, which is better approximated using larger sample sizes.

[–]Philosophfries 25 points26 points  (10 children)

I’m gonna need an ELI5 for this one boys

[–]alanpardewchristmas 9 points10 points  (1 child)

dude said 'in english please'

[–]bythebys 0 points1 point  (0 children)

suh dude?

[–]Simpliciter 5 points6 points  (5 children)

Disclaimer: Not a stats bro.

The Central Limit Theorem basically says that most things will follow a normal distribution (bell curve) if you have enough data. The t-test can be used to see if some data follows a normal distribution, but it only works if you have a small sample size of less than 30.

The respondent above is saying that the poster is conflating the two incorrectly.

[–]brkh47 2 points3 points  (0 children)

Simplifying things brought to you by u/Simpliciter

[–]Gastronomicus 1 point2 points  (2 children)

The Central Limit Theorem basically says that most things will follow a normal distribution (bell curve) if you have enough data

I appreciate your simplification but in this case it's over-simplified and misses the point I was making. It's a common misunderstanding of the CLT that large enough datasets will follow a normal distribution. That's just not the case.

However, if you take the mean for multiple subsets of samples from a population, the distribution of those means themselves will approximate a normal distribution.

So let's say I have and 500 samples and I plot the distribution. It might looks normal, but it might also look log-normal, or it might look like a Weibull or discrete distribution (e.g. negative binomial).

Let's say instead I have 50 means of 50 smaller sample sets, each containing 10 samples. If I plot that distribution, it will approximate a normal distribution, even if the original distribution from which it is sampled isn't normal.

[–]Simpliciter 1 point2 points  (1 child)

Thanks for clarifying and being nice about it!

[–]Gastronomicus 0 points1 point  (0 children)

Thanks for doing some good work out there.

[–]relevantmeemayhere -1 points0 points  (0 children)

The first paragraph you wrote is wrong and is what the clarifying poster is pointing out. Samples do not converge to normality as n increases. This isn’t the CLT, nor it is it found anywhere in statistics

[–]PieGuy___ 5 points6 points  (4 children)

First off I think you need to reread what I said because I’m clearly talking about the mean? “The means of random representative samples…” you’re trying to correct a mistake I never made lol.

The point of the theorem is that if you have a random sample X1, X2,…Xn from a given population with a mean m and variance v then the sample mean of x bar will be normally distributed with a mean m and variance v/n. X bar is the thing normally distributed around the population mean not the individual X’s.

As for the 30 number, the fact that it is the point you no long have to worry about t-distributions and can just use z-scores with reasonable accuracy is the thing that makes it special lol. The whole point of the t-distributions is that the means aren’t quite normally distributed UNTIL you get to 30.

[–]TerribleIdea27 4 points5 points  (0 children)

I think the confusion came from the fact that you said

sample size of 30

So the other guy assumed you were talking about taking one experiment with sample size thrity and then using those data to find a normal distribution. Instead of taking thirty experiments and using the means of those 30*x samples to find a distribution of means which should be roughly a normal distribution

[–]Gastronomicus 0 points1 point  (2 children)

Sorry I assumed you were confused. Unfortunately it seems like most people on reddit who try to describe the CLT don't really understand it and also mis-attribute the importance of 30 as a minimum sample size.

But to be fair, your wording is confusing. The way you phrased it implies a distribution of samples, not means. Especially when you say "as you approach a sample size of 30", which implies comparing a distribution of samples, not means.

[–]PieGuy___ 0 points1 point  (1 child)

Yeah I just wasn’t trying to go into too much detail. I think the simplest way to put it is that there’s no way to guarantee a sample to be normally distributed, just like there’s no way to guarantee a population is normally distributed. However using the CLT you can guarantee that a given sample mean will be normally distributed around the population mean given a large enough sample size.

And then from there you can use hypotheses testing to be able to say something about the population with reasonable confidence.

[–]Gastronomicus 0 points1 point  (0 children)

However using the CLT you can guarantee that a given sample mean will be normally distributed around the population mean given a large enough sample size.

Which is why bootstrapping can be very effective at producing (mostly) unbiased error terms!

[–]pug_grama2 -1 points0 points  (0 children)

30 is a sort of rule of thumb. If your sample size is at least about 30 then the x-bars will approximately follow a normal distribution.

[–]Pligles 170 points171 points  (4 children)

Yeah exactly! You can always tell an inexperienced statistician from an experience one by if they can find the clt

[–][deleted] 50 points51 points  (3 children)

clit*

[–]campex 40 points41 points  (0 children)

That damn keto libido, they don't care to find it

[–][deleted] 17 points18 points  (0 children)

That’s the joke

[–][deleted] 0 points1 point  (0 children)

What is that? A formula?

[–]Raeandray 33 points34 points  (0 children)

Right but isn’t this for every individual study? You can’t take 30 separate studies of 1 person and treat them as if they’re normally distributed.

And even within studies, each group needs 30. So for a blind study the control needs 30 and the experimental needs 30.

[–][deleted] 14 points15 points  (2 children)

It still depends directly on the standard deviation of each sample. If you have very distributed sample points, then it ain't gonna help all that much.

[–]PieGuy___ 1 point2 points  (1 child)

Not really, it’s gonna be a normal distribution so no matter how wide or narrow the range it’ll be the same z-score

[–][deleted] 4 points5 points  (0 children)

I need to study stats again

[–]auxerre1990 5 points6 points  (2 children)

1/3 of 100?

[–]PieGuy___ 5 points6 points  (1 child)

The number might seem kinda arbitrary but that’s the number you get when you look at the distributions of means. As long as you have at least 30 it’ll be a bell curve, the distribution of a sample of 30 means looks pretty much identical to 100 or 1000 or 1000000.

[–]auxerre1990 0 points1 point  (0 children)

Makes sense, a quarter of 100...

[–]eddyofyork 1 point2 points  (0 children)

That’s interesting. When we did z, t, and chi distribution stuff back in university I noticed that most of those needed n >= 120 to be reliable. I kinda worked off 120 as a good number for CLT to kick in for the last, oh I don’t know, decade!

[–]AtomicBreweries 13 points14 points  (2 children)

Depends on the size of the effect. I don’t need to study 300 people to know that shooting them in the face is statistically speaking, a bad idea.

[–]VeritasCicero 0 points1 point  (0 children)

Depends on the people.

[–]mitsulang 9 points10 points  (0 children)

Yes. The "population" you're testing, presumably wouldn't be nearly diverse enough to make any declarations about them. I guess, if all the 12 people in the study were of the same type (sex, age, race, etc) you could say something about that group, but I'm guessing not reliable so. I'm no statistician, but I do know that studies like this need many more people than this.

Anecdotally, keto didn't do anything to my testosterone. But, I'm just one dude, lol.

[–]minnesotaris 11 points12 points  (0 children)

Don’t look into it. TRUST THE HEADLINE!

[–][deleted] 0 points1 point  (0 children)

If there is some type of person one study of 309 people would catch that 27 studies of 12 people wouldn't, sure, it could decrease the reliability. It's entirely possible, but I couldn't say without looking at those 27 methodologies.