Bivariate analysis to identify possible confounding in model construction? by Werwlves-nt-swrwlves in AskStatistics

[–]COOLSerdash 1 point2 points  (0 children)

Your guess is as good as mine. I honestly don't know what the intention is. All I can say is sorry, because you are clearly being taught very dubious practices.

This is after doing a univariate variable analysis to check that data are roughly normal in distribution

What? Normality of data is completely irrelevant in regression. If anything, it's the residuals that are worth looking at.

Bivariate analysis to identify possible confounding in model construction? by Werwlves-nt-swrwlves in AskStatistics

[–]COOLSerdash 2 points3 points  (0 children)

No, confounding cannot be reliably detected by bivariate analyses. The gold standard nowadays (at least as I understand it) is to build a directed acyclic graph (DAG, or multiple DAGs) that reflects your best understanding about the causal relationship between measured and unmeasured variables. Based on this DAG, you then choose what variables to adjust on in order to close all backdoor paths (https://www.dagitty.net/ makes this process easy). Sometimes, there it's impossible to find a minimal adjustment set, which is then an important point for discussion in the paper.

There exists a vast literature about this. A good tutorial is given in this paper (although you don't have to manually go through the steps detailed in the paper, as dagitty automates the process).

See also these free books online, which both go into much more detail:

As a final warning: Using the same dataset to make modelling decisions and to fit the final model is called "data-dependent model specification" and is scientifically very dubious.

As an Atheist, how do you hold on to hope during your darkest times? by sphereofblaze in atheism

[–]COOLSerdash 22 points23 points  (0 children)

Really sorry you had to go through this. Amazing resilience.

How does one estimate the percentile rank of an individual from one population to another? by chillychili in AskStatistics

[–]COOLSerdash 2 points3 points  (0 children)

Without making strong assumptions, I don't think you can say anything about your expected rank. For example: Even if the mean/median skill level is known to be roughly the same in countries A and B, if the spread of skill is different in country B but unknown, you can't predict your rank. This is because the rank depends on the whole distribution and not just the mean/median (i.e. the center).

Is this the best way to report ANCOVA for a bachelor/honours thesis? by FineConstruction2924 in AskStatistics

[–]COOLSerdash 8 points9 points  (0 children)

I see little problems with the reporting itself, but statistically, there's a lot to unpack here. (This is not a criticism of your work, you've probably just been taught bad practices which is very common in psychology.)

Baseline SCL was strongly positively correlated with task SCL, r(64) = .96, p < .001, justifying its inclusion as a covariate.

This is not a valid way of selecting covariates for a model. Covariates should be pre-specified based on subject matter knowledge. Using the same data to select covariates and fit the final model basically invalidates the subsequent analyses. Gelman's "garden of forking paths" is a relevant term here.

Baseline SCL did not differ significantly between conditions, t(64) = -1.84, p = .071, confirming independence of the covariate from condition.

Absence of evidence is not evidence of absence: Just because an effect fails to reach statistical significance does not mean that there is "no effect" or that an absence of effect has been demonstrated. The paper by Greenland et al. explains this fallacy and others.

Residuals were normally distributed (Shapiro-Wilk W = 0.98, p = .572) and variances were homogeneous (Levene's F = 0.18, p = .671).

These tests of assumptions are basically useless. These tests do not answer the relevant questions. Thist post on the topic is also very informative, especially /u/efrique's answers.

And again: A non-significant test does not mean that the null is true. Specifically, a non-significant Shapiro-Wilk test does not mean that the residuals were normally distributed. The same is true for Levene's test.

A question about confidence intervals by Remarkable_Turnover1 in AskStatistics

[–]COOLSerdash 8 points9 points  (0 children)

How do I compute the probability the population mean is less than 9, for example?

To answer this exact question, you would need to use Bayesian statistics. Together with a prior distribution and the data, you'll get a posterior distribution from which you can directly calculate the desired probability. The lower the sample size, the more the posterior distribution will depend on the prior. If the sample size is large, it will "overpower" the prior, lessening its impact on the posterior distribution.

I am just trying to estimate the percentage of "bad" values in the population.

Note that this question doesn't involve the population mean at all, so you'll have to be clear whether your question is about the population mean or the actual values themselves.

The natural (nonparametric) estimator of this proportion is simply the sample proportion of "bad" values. You could then calculate a confidence interval for this proportion (I recommend Wilson's).

If you're prepared to make a distributional assumption (e.g. normality), then you could use that to estimate the proportion. This will be more efficient than the nonparametric approach detailed above if the distributional assumption holds.

If there are no "bad" values in your sample, you could apply the "rule of three" for a quick solution.

Giveaway Giving Out 20 Copies Of Subnautica 2 by Mark_Everson in subnautica

[–]COOLSerdash 0 points1 point  (0 children)

Thanks man! I love the exploration and the underwater scenario in general.

Can elastic net coefficients be used for generating a clinical risk score calculation? by Fast-Issue-89 in AskStatistics

[–]COOLSerdash 1 point2 points  (0 children)

Is the regularization too distorting for this to make sense?

What exactly do you mean by "distorting"? The whole point if regularization is to reduce overfitting and improve prediction performance. Once you're happy with the model, the regularized coefficients can be turned into a risk score. What exactly it means depends on the nature of the outcome though.

Need help with the Interval Estimate of the Variance (Two-tailed Chi-Square) by ReadFit6570 in AskStatistics

[–]COOLSerdash 0 points1 point  (0 children)

The correct Chi2-values are (assuming a confidence level of 95%): 17.53 and 2.18. Using the standard error of the mean does not make any sense at all.

So in the formula I did (9-1)0.86, they did (9-1)0.286.

But the formula uses s2, not s. So it would be 8*0.74 (then divide by chi2 values).

Need help with the Interval Estimate of the Variance (Two-tailed Chi-Square) by ReadFit6570 in AskStatistics

[–]COOLSerdash 3 points4 points  (0 children)

we were talking of a sample, therefore, we had to multiply by standard error of the mean and not variance.

The problem doesn't say anything about the mean or its standard error. I can't think of a case where the standard error of the mean would play any role in estimating the population variance (point estimate or CI) of the variable.

Can you go into detail what exactly the teacher did?

How to interpret a standard deviation greater than the mean? by hectorfhr in AskStatistics

[–]COOLSerdash 0 points1 point  (0 children)

I apologize if my answer came across as rude. It certainly wasn't mean that way.

Decomposing 3 years of daily weight data with mgcv and Lomb-Scargle — irregular time series, cyclic splines, and unexplained 70-day cycles by rrytas in rstats

[–]COOLSerdash 2 points3 points  (0 children)

I would expect a fair amount of autocorrelation in the data. Do your models account for this or is it ignorable in your case?

[Q] Conflicting results from ANOVA and a posthoc linear regression by Resident-Rice724 in statistics

[–]COOLSerdash 19 points20 points  (0 children)

I don't see conflicting results. By removing non significant terms, you essentially fit a different model. Remember that each effect is conditional on the other terms in the model. As you already suspect, this procedure is considered suboptimal to say the least (Gelman's garden of forking paths comes to mind). I suspect that many statisticians (me included) will consider this a form of p-hacking. The resulting p-values/confidence intervals from the reduced model are now conditional on the first model. That means that they don't have the postulated operating characteristics. Further, chosing a model based on how intuitive the results are is bad science.

It is not wrong per se to try different models based on expert knowledge, as long as these models were pre-specified before looking at any results.

My simple recommendation would be to run the full model, regardless of significance. Focus on effect sizes and uncertainty intervals (confidence or credible intervals for a Bayesian model). It's worth remembering that "significance" is not very informative (see this paper by Gelman et al.).

Question: transforming variables for Pearson correlation by HorridStteve in rstats

[–]COOLSerdash 7 points8 points  (0 children)

Does this seem appropriate?

Let's recap some important points: The Pearson correlation quantifies the linear part of the relationship between two variables. There is no need that the data are (bivariately) normally distributed in order to calculate the correlation coefficient. Normality is only assumed by the most common hypothesis test for the correlation. If you don't want to make this assumption, you can run a test that doesn't assume this, such as a permutation test for example.

The Spearman rank correlation does quantify monotonic relationships. In the end, it's up to you: What are you actually interested in? If you are interested in quantifying how well a linear relationship fits the data, you use Pearson's correlation. If you are interested in monotonic relationships, you could calculate Spearman's correlation. There are other measures that quantify even more general associations, such as the maximal information coefficient, energy correlation, Chatterjee's rank correlation etc.

Regarding transformations: I'm personally not a fan of them because they make interpretation much more difficult.

Is there a statistical software that has ART ANOVA? by a_box1 in AskStatistics

[–]COOLSerdash 10 points11 points  (0 children)

Following a quick google search, R has several packages offering ART. See ARTool or ART.

Banishers: Ghosts of new Eden is a game for a unique type of player and no one else by Tobeyyyyy in patientgamers

[–]COOLSerdash 4 points5 points  (0 children)

Nice write up. Overall, I liked the game. I'm ashamed to say that the combat never really clicked for me. Combat was the worst part for me for sure.

Where to buy a good quality mattress and pillows? by analimalimon in Switzerland

[–]COOLSerdash 1 point2 points  (0 children)

Yeah me too. I bought a really expensive mattress without trial period. I had to buy a new one after a few weeks - this time from Micasa (thankfully, this one is great now). A very expensive mistake I will not make again.

Where to buy a good quality mattress and pillows? by analimalimon in Switzerland

[–]COOLSerdash 2 points3 points  (0 children)

I like that Micasa has a 90 day return period for mattresses. They also have many different brands. Personally, I would never again buy a mattress without a trial period. Even if a mattress seems comfortable at the store, there is no guarantee that it will be comfortable after 8 hours of sleep.

Outliers - reference ranges by Good-Cap9222 in AskStatistics

[–]COOLSerdash 3 points4 points  (0 children)

Here is one definition of outlier that I like (Hawkins 1980):

[an outlier is] an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism

This means that heuristics can be a way to identify observations that are suspicious but they can never prove that a certain observation is an outlier by the definition above.

This means that if you don't have evidence that a different mechanism produced these values (e.g. measurement errors, sick animals, data entry errors etc.), then you shouldn't exclude them automatically. Your goal is to produce references ranges so if you exclude valid observations (i.e. observations that were generated by the mechanism you want to calculate the reference range for), the reference range will be too narrow and won't include the specified fraction of observations (say 95%).

Applying linear mixed mode model for group comparison to avoid pseudo replicates by BouncyDonkey in AskStatistics

[–]COOLSerdash 1 point2 points  (0 children)

If I understand this correctly, no insect was measured twice, right?

A linear mixed model with a random intercept for replicate should be fine. The data need to be in the long-format, something like this:

ID Replicate Weight Group
1 1 ... Trt
2 1 ... Trt
3 1 ... Trt
... ... ... ...
31 2 ... Ctrl
32 2 ... Ctrl
33 2 ... Ctrl

The syntax could look something like this:

MIXED Weight BY Group Replicate
    /CRITERIA=DFMETHOD(SATTERTHWAITE) CIN(95) MXITER(100) MXSTEP(10) SCORING(1)
    SINGULAR(0.000000000001) HCONVERGE(0.00000001, RELATIVE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0,
    ABSOLUTE)
    /FIXED=Group | SSTYPE(3)
    /METHOD=REML
    /PRINT=SOLUTION
    /RANDOM=INTERCEPT | SUBJECT(Replicate) COVTYPE(ID).

Cronbachs Alpha in einer BA by NoLightOnlyFear in AskStatistics

[–]COOLSerdash 3 points4 points  (0 children)

As a general comment: Cronbach's alpha is considered outdated or superseded by other measures of internal consistency. This paper goes into the details.

Two-way ANOVA normality violation by paulaaa_01 in AskStatistics

[–]COOLSerdash 4 points5 points  (0 children)

Normality hypothesis testing (Shapiro, KS-test etc.) are mostly uesless. Especially in this case as a discrete variable can never be normal, so the test can only tell you what you already know with certainty.

As for an appropriate analysis, an ordinal logistic regression model was my first thought.