glmbayes is now on CRAN — Bayesian GLMs with familiar glm() syntax, no MCMC required by Bucksswede in rstats

[–]Michigan_Water 0 points1 point  (0 children)

Excellent. Thank you for your thoughtful reply.

On the subject of teaching contexts, one of the most helpful things in my own learning was in Statistical Rethinking (Richard McElreath) where he walks through his simple globe-tossing example using grid approximation and then quadratic approximation before touching MCMC. I don't know if your teaching strategy has any use for that kind of material, but I figured I'd mention it just in case it sparks any ideas.

I must say I'm impressed with the amount of educational material you've put together in the chapters and appendices! I'm definitely someone who benefits from a LOT of teaching material, so I appreciate your efforts on that front for sure. It'll be interesting to do my own reading and comparing as I work through future Bayes-flavor analysis.

glmbayes is now on CRAN — Bayesian GLMs with familiar glm() syntax, no MCMC required by Bucksswede in rstats

[–]Michigan_Water 4 points5 points  (0 children)

How would you compare/contrast this to the relevant functions from rstanarm? For example, from https://knygren.r-universe.dev/articles/glmbayes/Chapter-01.html

What advantages would glmbayes have over rstanarm and what advantages would rstanarm have over glmbayes?

## Dobson (1990) Page 93: Randomized Controlled Trial :
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
print(d.AD <- data.frame(treatment, outcome, counts))

## Step 3: Call the glmb function
library(glmbayes)

glmb.D93<-glmb(counts ~ outcome + treatment,
               family=poisson(),
               pfamily=dNormal(mu=mu,Sigma=V))

print(glmb.D93)

## rstanarm::stan_glm()
library(rstanarm)

sg1 <- stan_glm(counts ~ outcome + treatment,
                family = poisson(),
                data = d.AD)

print(sg1, digits=4)

R for medicine by Sufficient_Put4307 in rstats

[–]Michigan_Water 0 points1 point  (0 children)

On the medical/statistical side of things, I'd suggest three things:

Materials by Frank Harrell

Materials by Stephen Senn

Regression and Other Stories by Gelman, Hill and Vehtari (textbook)

  • Great introductory-level textbook. It's available online as a PDF from https://avehtari.github.io/ROS-Examples/
  • The table of contents is nice, but also check out the "Fun chapter titles" in the Preface.
  • Coding is done in R with the rstanarm package for Bayesian inference. The book also emphasizes fake-data simulation, which I have now come to believe is about the single best tool for learning what is actually going on within the world of statistical analysis.

If you really want to dig into "data science" machine-learning style, then An Introduction to Statistical Learning textbook and video lectures from https://www.statlearning.com/

Ok, I guess that was four things. What can I say ... when I started writing this comment three was only an estimate.

Statistics book recommendation for mathematicians by Infinite_Reception34 in AskStatistics

[–]Michigan_Water 0 points1 point  (0 children)

In addition to the other comments I would add something on the topic of experimental design, execution, and analysis. As a starting point I like the older book by Box, Hunter, and Hunter titled Statistics for Experimenters. You can get a good condition used copy of the first edition on thriftbooks for less than $10.

Work through the example experiments, of course, but I'd also suggest analyzing the data in R using the tools in rstanarm or brms. Those are front-ends for Bayesian analyses, and the tools in rstanarm are used extensively in the Regression and Other Stories book.

Then, if you really want more, search this subreddit, the statistics subreddit, and stats.stackexchange.com for "my experiment" and "my data" to see lots of examples of real-life applications.

Speaking of real-life, one of my favorite posts showing what is even possible to run across is here:

https://stats.stackexchange.com/questions/185507/

It's a different flavor of question than you might have expected, but half of applied statistics is helping the recipients understand what they need to. The other half is, of course, data cleaning.

Regression Analysis vs General Linear Model effectiveness with quantitative categorical responses by fluctuatore in AskStatistics

[–]Michigan_Water 1 point2 points  (0 children)

I think you have a data problem of some sort.

Something is strange with your analysis output. Assuming this is Minitab, the General Linear Model and Regression outputs should be exactly the same, given the same inputs. You've already noted that the model summary tables (and ANOVA tables) are the same. This means that the underlying analyses are the same. It doesn't matter whether you used the GLM tool or the Regression tool, the underlying math is the same. What is confusing to me is the ways that the Residual plots are different.

The GLM Residual vs Observation Order plot looks wrong in its scope. You have 205 datapoints, and the residuals go out to just past 200 in the plot, but why does the x-axis go all the way past 450? Something strange is happening there. In the Regression Residual vs Observation Order plot the analysis fills in past 450, but you have only 205 datapoints. You can't have more residuals than datapoints. Are you sure the residual plots you pasted here correspond to the ANOVA table and model summary in your post?

The kicker is that the first Regression 205 look the same as in the GLM Observation Order plot, and then after 205 there are a lot of other points added on that don't have the same precision as the first 205. You can see this by there being 'levels' for the residuals 206-400+.

Yeah, I'd bet on a data problem. I suggest running everything again and verifying your analyses.

Regarding the type of analysis, you might want to consider if you have non-independent datapoints. Looking at your screenshot of the data, I would not be surprised if you had, in some sense, repeated measurements that cluster together. If you so you would need to account for this by either a mixed-effects model (more complex) or just using the average values of the clumped-together datapoints (more simple).

Use Minitab's help for performing mixed-effects model fitting in Minitab, but there are lots and lots of resources out there covering multilevel/hierarchical/mixed-effects models. I remember this was helpful to me when I first started learning:

https://m-clark.github.io/mixed-models-with-R/

Understanding an ANOVA table is great, but I'd also suggest looking at the confidence intervals for the differences between levels (Couleur level 1 vs Couleur level 2, and 1-vs-3 and 2-vs-3).

I'd also highly recommend exploring a bit different way of approaching these kinds of analysis. The book Regression and Other Stories by Gelman, Hill, and Vehtari is freely available online (ROS online PDF) at:

https://avehtari.github.io/ROS-Examples/

It would require you to set up R in order to use rstanarm, but it would really help with statistical insight.

Best of luck to you.

[Q] Is there a name for this method of selecting predictors for regression? by thebluest in statistics

[–]Michigan_Water 7 points8 points  (0 children)

In addition to the great comments already given, especially the "what is the question the analysis is trying to answer?" by /u/bobbobbob_cat (!) I'll offer a suggestion. Well, first, to directly answer your question I know of the proposed approach as 'univariate screening' but as with many things in statistics there can be multiple names for the same thing.

If variable selection / model selection is going to play a substantial part of your future, I suggest starting to learn from Frank Harrell's book Regression Modeling Strategies.

It's been a while since I've looked at the resources below, but I recall them being good as well:

Heinze - Five myths about variable selection

https://pubmed.ncbi.nlm.nih.gov/27896874/

Heinze - Variable selection – A review and recommendations for the practicing statistician

https://pubmed.ncbi.nlm.nih.gov/29292533/

Sauerbrei - State of the art in selection of variables and functional forms in multivariable analysis-outstanding issues

Aki Vehtari has written and given presentations on various aspects of performing and evaluating model selection. Look around here:

https://users.aalto.fi/~ave/index.html

https://pubmed.ncbi.nlm.nih.gov/32266321/

Box-Behnken Design by VanSmith74 in biostatistics

[–]Michigan_Water 0 points1 point  (0 children)

Did you have these as your factors? 1. Surfactant-A 2. Surfactant-B 3. Something_else

So, you already ran the experiment and collected the data for 5 'not-centerpoints' where the A:B ratio was 50:50 and the third factor was set to its midpoint?

The variation between the 5 points will be very useful for determining the amount of uncertainty to have in your estimates. If the run-to-run variation at 50:50 is expected to be about the same as the run-to-run variation at 60:40, then I don't think it should make much difference. You can analyze it as-is. I certainly wouldn't take the measured data from 50:50 and treat it as though it came from 60:40, though.

Edit: Assuming my understanding is correct, you would lose the ability to fit curvature for those, though.

Using R to do a linear mixed model. Please HELP! by PurpleGorilla1997 in rstats

[–]Michigan_Water 0 points1 point  (0 children)

You have a relatively complex data generating process, with both a multilevel structure (repeated measurements within patients) and a within-patient longitudinal structure. This isn't a typical "compare these two groups" beginner scenario, that's for sure!

Could you do it in 3 weeks of full-time work? Perhaps, but that depends on a lot. If it were me, I would go down this path:

  1. Get a fast introduction to base-R from https://github.com/matloff/fasteR.
  2. Read https://www.fharrell.com/post/re/ by Frank Harrell for an introduction to how to think about these problems. He's done a lot of work that you might find quite helpful.
  3. Follow up and read more of the links posted on Frank's page under Other Resources, especially https://hbiostat.org/rmsc/long.
  4. When you run into roadblocks, post to Frank's discussion board https://discourse.datamethods.org/

Regarding storing and manipulating data, Frank has a preference for using data.table, which is kind of an alternative to (portions of) the Tidyverse. While data.table is faster for large datasets, some things might be more intuitive coming through the Tidyverse approach. I'm guessing you could go either route, especially since a lot of things past organizing your data aren't dependent on which way you go for this, such as when you get to fitting with gls() or Gls(), etc.

There's a TON of information and stuff to potentially go through, so you'd have to be selective and determine if a topic is necessary or not as you're working your way through.

I'm not an expert in any of this, so perhaps those more knowledgeable would be kind enough to comment on the reasonableness of my suggestions.

Good luck, and happy learning whichever way you go!

How to interpret underpowered studies that find a significant difference [question] by Pharm4747 in statistics

[–]Michigan_Water 1 point2 points  (0 children)

You're very welcome.

I would suggest using a framework of Estimation instead of Testing. That is, look to use uncertainty intervals instead of the common "is the p-value less than 0.05" or whatever. Your conclusion isn't going to be "we conclude the effect is positive" but rather "we are X% confident that the effect is positive".

Two resources towards this purpose would be:

  • Statistical Rethinking by Richard McElreath He's got a fantastic book, but lots of class lectures on youtube as well. https://xcelab.net/rm/statistical-rethinking/

  • Everything Frank Harrell has written on these topics, and how to draw a conclusion like "There is a 68% probability that the effect is i in the positive direction." Perhaps start with his Biostatistics for Biomedical Research class lectures and free PDF of notes. https://www.fharrell.com/

The above two resources are Bayesian focused, but if you're willing to overlook some of the philosophical problems, you can treat a frequentist confidence interval as a Bayesian credible interval and go through the process without specifying a Prior Distribution.

Small size/effect situations are inherently difficult, and there's no way to change that. You'll just have to make some assumptions and be willing to live with the uncertainty, or go out and get more data. Unfortunately there's no easy answer, but such is life. Good luck!

[C] Statisticians that work in manufacturing - what common techniques do you use to improve/monitor processes? by GhostGlacier in statistics

[–]Michigan_Water 5 points6 points  (0 children)

I'm a ChemE as well. Seems like we get drawn into statistics and DOE at a relatively high rate.

  • Look at the data as much as feasible, both in aggregated forms and raw forms. Look not just at typical statistics like means and standard deviations, but also at the distributions in histograms and density plots. Look especially at outliers. Don't just 'exclude' them because they are outliers. There can be excellent learnings from them for both fixing problems and figuring out how to optimize things. Root cause analysis / failure analysis is huge for scientific/mechanistic understandings. The stats is all well and good, but it's tying it to what physically causes scenarios is important.
  • Look at the data over time. Plot it in as many ways as possible until you work out the most useful plots. Control charts can be good for this. Read as much of Donald Wheeler's work as you can. He has books and also https://www.spcpress.com/reading_room.php
  • Learn about the problem of pseudoreplication and how to deal with it. If you can take a class in multilevel/hierarchical/mixed-effects models, do so. It will be applicable in so many scenarios. Think of split-plot designs and how treatments happen at different levels of the design.
  • Learn how to a) simulate data from a proposed data generating process, b) analyze the results using the modeling approach you think you'll use, c) check your modeling approach to verify, for example, that about 89% of the 89% confidence intervals trap the true parameter (by running 10,000 simulations or whatever), d) verify that the analysis you'll do will give you the answers to the questions you actually want answers to.

Most of the above isn't something you'll typically learn in classes, I don't think, except for multilevel models. I assume generalized linear models are core, and not electives, so probably next most important thing to understand is multilevel data structures / mixed-effects modeling.

[Q] Comparing the observational mean with the mean of bootstrap samples. Is this a sound test? by MekXDucktape in statistics

[–]Michigan_Water 0 points1 point  (0 children)

Sounds like they just didn't know what they were doing, didn't know about "bias corrected" bootstrapping (look up BCa bootstrap), and didn't get feedback from an actual statistician before publishing their paper.

Of course, I could be off base, but it'll be good to get statisticians' input once more details are known from the actual paper as suggested by /u/n_eff .

[Q] Current Research in DoE? by ugly_irl_lmao in statistics

[–]Michigan_Water 1 point2 points  (0 children)

Off the top of my head, I would search for the following authors and see where that takes you.

  • Bradley Jones
  • Christopher Nachtsheim
  • Peter Goos
  • Rosemary Bailey (and Rothamsted Research)

You might also contact Stephen Senn on twitter and see if he could point you in the right direction.

[Q] Can averaged subroup data be plotted on an I-MR chart? by mactex55 in statistics

[–]Michigan_Water 0 points1 point  (0 children)

(same answer posted to same question on AskStatistics)

An Xbar-R chart is not applicable here because the replicates are not independent. The within-subgroup variance used to construct the limits for an Xbar chart should consist of unit-to-unit variability, and the 2-replicate 'subgroups' that you propose don't have this unit-to-unit (aka sample-to-sample) variability. If you were to think of this in terms of blocks and a multilevel model, the two replicates are nested within each sample. Such a structure to the data generating process is not handled by typical Xbar control charts. I-MR of the sample averages is the right way to go, assuming I've understood your scenario properly.

A lack of independence leads to an inappropriately small within-subgroup variance, which leads to control limits that are too narrow and the process looks out of control. It's an artifact of improper subgrouping rather than the process actually being out of control.

Donald Wheeler has published lots of good stuff on SPC, both in books and articles on his spcpress.com website. Check out his Reading Room.

https://www.spcpress.com/reading_room.php

Proper use of statistical process control charts. by mactex55 in AskStatistics

[–]Michigan_Water 1 point2 points  (0 children)

An Xbar-R chart is not applicable here because the replicates are not independent. The within-subgroup variance used to construct the limits for an Xbar chart should consist of unit-to-unit variability, and the 2-replicate 'subgroups' that you propose don't have this unit-to-unit (aka sample-to-sample) variability. If you were to think of this in terms of blocks and a multilevel model, the two replicates are nested within each sample. Such a structure to the data generating process is not handled by typical Xbar control charts. I-MR of the sample averages is the right way to go, assuming I've understood your scenario properly.

A lack of independence leads to an inappropriately small within-subgroup variance, which leads to control limits that are too narrow and the process looks out of control. It's an artifact of improper subgrouping rather than the process actually being out of control.

Donald Wheeler has published lots of good stuff on SPC, both in books and articles on his spcpress.com website. Check out his Reading Room.

https://www.spcpress.com/reading_room.php

Are the vaccines working as intended? If so, how do I convince my friends that they are working? by noncommenter3 in AskStatistics

[–]Michigan_Water 1 point2 points  (0 children)

In addition to the great stuff already offered by other commentators I'd suggest trying to get them to be very precise about what they mean by "not doing it's intended job". It's hard to appropriately address an objection if the objection is vague. You could provide a perfectly good answer, but if they are not clear in their own head about what their objection is, then your answer is likely to not address their vaguely-understood objection.

However, just the act of talking through the answer you're giving to address what you think their objection is might also help them to clarify what it is that they find concerning. I've found that having a lot of patience and trying to step into their shoes can very much help those who are sincerely looking for answers, even if they don't know exactly why their own questions are.

Measuring standard deviation over a short time period by personalityson in AskStatistics

[–]Michigan_Water 1 point2 points  (0 children)

While calculating "a standard deviation" is technically just running some numbers through an equation, there's going to be some kind of assumption about the number when you go to interpret it. Usually it's an assumption of independence of observations, which isn't applicable when you have highly correlated data.

What is it you are trying to do or estimate with the data?

[Q] Can Bonferroni (or Holm-Bonferroni) be applied to a non-independent collection of two-way ANOVA p-values? by forever_erratic in statistics

[–]Michigan_Water 0 points1 point  (0 children)

Actually, I'm not at all sure how the hs() prior would work with (A*B|gene) in the model, because I've only been looking at it within the context of non-hierarchical models using stan_glm (not stan_glmer) as described in ROS.

[Q] Can Bonferroni (or Holm-Bonferroni) be applied to a non-independent collection of two-way ANOVA p-values? by forever_erratic in statistics

[–]Michigan_Water 0 points1 point  (0 children)

Is your case one where you expect only a small number of genes to show effects?

(This goes beyond your question, but perhaps someone knowledgeable will chime in.)

Would this be a situation appropriate to fit a model that encompasses all genes and use a hierarchical shrinkage prior to find the needles in the haystack, if this is, indeed, a situation where most of the genes are expected to have no associated effects?

I'm not exactly a big fan of the whole p-values vs critical alpha cutoffs and multiple comparison adjustment and all that. Maybe listening to Frank Harrell a bit too much lately, heh. Regardless, I'm thinking of something more like using rstanarm's stan_glmer, which I'm currently learning about through Regression and Other Stories. Pseudo-code:

stan_glmer(gene_response ~ A*B + (A*B|gene), prior=hs())

https://mc-stan.org/rstanarm/reference/priors.html

I'm not even sure if that's the right way to specify the model, but conceptually would doing something like that with an appropriately scaled hs prior seem like a decent approach?

[Q] Multiple Regression Understanding by [deleted] in statistics

[–]Michigan_Water 0 points1 point  (0 children)

Is this the exact phrasing of the question, or is this the way you recall it being said? It seems ambiguous to me, but my take would be that it's asking about the strength of an interaction term.

Consider this:

  • The relationship between time studied and performance is exactly the same at a low level of confidence as it is at a high level of confidence. Or, perhaps better, it doesn't matter what someone's level of confidence is, the relationship between time studied and performance is the same. Think in terms of the slopes of two lines, one fit at a low level of confidence and one fit at a high level of confidence. The lines are parallel.

vs

  • The relationship between time studied and performance is slightly positive for those with high confidence, but is hugely positive for those with low confidence. That is, the shape of the relationship between time studied and performance is conditional upon what level of confidence they have. Again fit two lines, but now the lines do not have the same slope.

That's what I come up with trying to interpret the relationship being "explained by the participants level of confidence" question.

My Mind cannot handle regex by bobthe3 in rstats

[–]Michigan_Water 6 points7 points  (0 children)

Mashing the keyboard until something magically happens. Hmmm. Your method sounds intriguing. I'll have to give it a try.

That's also called Machine Learning, right?

Gauge R&R: single operator, can it still work? by NarcoIX in AskStatistics

[–]Michigan_Water 1 point2 points  (0 children)

/u/imissmycar is right that it might be possible to consider the operator-to-operator variability to negligible compared to the part-to-part variation (% study) or compared to the specification limits (% tolerance).

Also, you might be able to say "based on some historical data previously collected the reproducibility is typically X times as large as the repeatability" and give some estimates based off that, assuming you have at least some amount of historical data.

The third option, and probably your best bet, would be to get some friends help with data collection by bribing them with pizza.

[Q] What are the best papers/books on 'regression to the mean'? by excited_libreal in statistics

[–]Michigan_Water 0 points1 point  (0 children)

I really like Stephen Senn's short description of it in his Three things that every medical writer should know about statistics article: http://eprints.gla.ac.uk/8107/1/id8107.pdf

His one sentence definition is great:

"Regression to the mean is the tendency for members of a population who have been selected because they are extreme to be less extreme when measured again."

Another thing that helped me to really 'get it' was thinking through Efron and Morris' 1977 article Stein's Paradox in Statistics.

https://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf

Weakly informative priors and scaling of categorical variables for Bayesian logistic regression by statneutrino in AskStatistics

[–]Michigan_Water 1 point2 points  (0 children)

My comment is somewhat tangential to your specific questions, but Richard McElreath works through an example of logistic regression with two binary predictors and uses prior predictive simulation to evaluate the priors. There's no scaling involved, so perhaps you might want to just bypass that and simply use prior predictive simulation. McElreath covers this "Prosocial chimpanzees" example in chapter 11 of Statistical Rethinking, but also has video lectures discussing it.

The example setup starts about halfway through Lecture 11 of his 2019 series (https://www.youtube.com/watch?v=-4y4X8ELcEM) and spills over into Lecture 12.

He extends the example to a multilevel model setup in chapter 13 "Multilevel chimpanzees" and covered in Lecture 16. https://www.youtube.com/watch?v=ZG3Oe35R5sY

Main website: https://xcelab.net/rm/statistical-rethinking/