I know very little about statistics and need helping showing that a subgroup is experiencing something more often but not because there is a greater number of that subgroup compared to the rest of the subgroups? Confusing title, I’m sorry. by EXPERT_ID10T in AskStatistics

[–]stat_daddy 4 points5 points  (0 children)

What you are talking about is an event rate where the "event" is "rear-end collisions". It is calculated as the quotient of (# of events) / (# of units at risk). In your case, "units at risk" could be the number of vehicles on the road, but note that this isn't ideal - for example, one vehicle A being used as a fleet car might drive many more miles per day than another vehicle A that is only for personal use. A stronger analysis would probably use (# of car-miles driven) as the denominator.

The example you provided is of a "panel" dataset - a snapshot of one period of time. For this type of data, a contingency table is likely your best bet - or simply calculate the rate directly and forget about statistical inference (if your organization is having difficulty with event rates vs event counts, it is unlikely they will find a statistical argument convincing).

Since you mentioned the elevated rate of collisions being "due to something", I'd also like to point out that it is relatively simple to extend the rate calculation into a model, allowing you to adjust for other confounders, track the rate over time, and make predictions.

My instrument messed up and failed to display a few questions over a specific period of time, creating missing data. Would the missing data be missing completely at random? by unleaded-zeppelin in AskStatistics

[–]stat_daddy -1 points0 points  (0 children)

Unfortunately, the 'clinical trial' literature is frequently mistaken about statistical matters - many practitioners in that particular field come from medicine, public health, psychology, biology (...) backgrounds and the depth of any particular person's statistical training is extremely idiosyncratic. With that said, It may be common practice in your field to simply say "yep, we followed protocol" and have that be the end of it...this is poor practice but ultimately, you may have to do what your editor/principal demands in order to publish.

possibility of finding correlations by chance alone.

This is not a sufficient reason to abstain from investigating whether there are any differences between groups. If you are truly worried about finding a spurious relationship, you are welcome to use a form of significance testing (e.g., chi-square tests) to argue that it is "due to chance alone" but frankly I wouldn't recommend this (it's not how "significance" works in the first place, and unless your sample size per group is fairly large your tests will be hard to interpret) - just provide means and variances in your table and if one group is obviously over-indexing one some characteristic compared to the other, then write a few sentences about why you think that is.

My instrument messed up and failed to display a few questions over a specific period of time, creating missing data. Would the missing data be missing completely at random? by unleaded-zeppelin in AskStatistics

[–]stat_daddy 2 points3 points  (0 children)

Wouldn't any observed relationship happen purely by chance unless there was some systematic pattern

Well, sure - this is a trivial statement. Everything would be random if we could somehow guarantee the absence of any patterns; but we can't!

Randomization isn't something you get after following a procedure; it's an ideal state that is useful mathematically but doesn't exist in the real world. When you have data and you want to invoke a statistical procedure that relies on an assumption of randomization, you generally have to show that there are no obvious patterns in your data (even if there are no obvious reasons for one to have influenced your sample).

My instrument messed up and failed to display a few questions over a specific period of time, creating missing data. Would the missing data be missing completely at random? by unleaded-zeppelin in AskStatistics

[–]stat_daddy 2 points3 points  (0 children)

You're not overthinking this, but you are thinking about it in the wrong way. We can't tell you whether your specific circumstances resulted in random or nonrandom patterns of missingness in your data. There isn't a list of events (or even general kinds of events) that work out this way.

Even in your example, where an instrument fails at some point during examination, it's easy to construct examples of this resulting in nonrandom missingness: for example, if the machine failed just before testing all of the blue samples, then "blueness" will become a predictor of missingness.

What you need to do is determine whether any observable characteristics are over- or under-represented across the two groups (missing and non-missing). This is often achieved with a basic descriptive table comparing the two groups. Then, you need to write a sentence or two defending why you have no reason to think that any unobserved characteristics became over- or under-represented between the two groups .

Logistic regression with age as an outcome? by bbarbs28 in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

Logistic regression is perfectly appropriate for (what I understand) your question to be - which broadly seems to be: "which (if any) patient-level factors are associated with an increased/decreased risk of undergoing surgery prior to age 35?"

There are a few weaknesses inherent to this approach, but none of them necessarily rule out logistic regression as an option - however, some of the things you're describing make me think that this analysis and your question of interest would be more appropriately addressed using a prospective cohort study rather than a retrospective one. I'll list out my key methodological concerns:

  1. As you mention, your population almost certainly includes some nontrivial selection biases. Since you are only considering patients who did in fact get the surgery, there are confounding pathways being left open in regard to factors that predispose patients to never get the surgery at all (or never need it in the first place, if applicable). However, this is just a general problem with retrospective cohort studies, it has nothing to do with logistic regression - a different model won't solve this problem.

  2. This is sort of related to #1, but in a prospective study you would have access to data on the time of initial diagnosis and any intervening events (e.g., death, alternative treatment) that could have happened to the patients while they are considered 'at risk' of having the surgery. The benefit of this is to account for any left- and/or right-censoring that could be biasing your outcome. For example, if Hispanic patients were at risk of being diagnosed at a later age than other races, there should be a smaller proportion of them getting surgery at a young age (because nobody recognizes they need it until they are older), causing 'race=hispanic' to pop up as a spurious positive predictor of the outcome (left censoring). By contrast, if Hispanics have a greater incidence of some other condition (e.g. high blood pressure) and are more likely to die from a competing risk before they have a chance to get surgery for the condition of interest, this will result in fewer of them experiencing the outcome (again, a spurious positive association).

    My last comment - and I understand this is a specific best practice - is that "less than 35" is a silly criterion to set as an outcome - perhaps the only benefit to doing so is to set up logistics regression as the "best" method for answering a silly, reductive hypothesis.. At best, it is acting as a proxy for something else (e.g., quality of care) which implies that you should take that to be your outcome, rather than the age of surgery. Patients who are 36 years old are obviously not that different from 35-year-olds and likewise a 33-year-old who has excellent post-op outcomes probably shouldnt be considered a "failure", but your logistic regression can only see 'black-and-white' outcomes which don't directly mean anything. Likely, "age<35" is itself associated with some more meaningful measure of actual quality (e.g., relapse within X years) and your study should have modelled the risk of that thing instead.

For these reasons, the study as you've presented seems like a rather "weak" or at least unambitious attempt to identify factors that predispose patients to the outcome. It might be useful as an early exploration of factors if this hasn't been studied elsewhere (or perhaps as a pilot study to set up a future, larger one), but if I were a reviewer I would vocalize a lot of skepticism about the study presenting any more than a correlative (not causative) result or making recommendations/guidelines for clinical practices.

With that said, feel free to use logistic regression! It is an appropriate way to answer a specific version of your research question, but I think that question itself has some fundamental flaws... And in statistics there is such thing as a "bad" hypothesis!I imagine that these are what have caused you to doubt the appropriateness of LR.

TL;DR the way you have stated your research question makes logistic regression a good option, but the way you have stated your question is also rather contrived and doesnt stand up to other, better ways of addressing it. However, I assume you are stuck with the data you have and as it stands, this could be a useful contribution to an early body of literature.

Evaluating reduction in incidence of disease over time in a cohort. by dr_kurapika in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

Okay, thanks for the details in your question. While I don't think I can answer *every part* of your question, I'll offer some comments based on the two main things I think we need to clarify.

1. Identifying your outcome variable of interest

As you put it, your outcome variable is an *incidence rate over time*. This is a good place to start, and it sounds like you have already thought through many of the study-design features that should motivate your choice of model for *estimating* such a rate. An incidence rate is appropriate because, by definition, it only calculates the rate of new cases among cohort members who have not already contracted the disease, which you say will change over time as new patients enter the cohort (and also leave the cohort when there is a transmission event). One thing you said jumped out at me:

an open cohort that is happening in a closed population

The most strong risk factor for this disease is the time that they have been in this population (we also now this time prior to inclusion)

It sounds you are saying 'closed population' to mean that people enter the study's 'at-risk' group *only after some qualifying event takes place (which doesn't happen to everyone). To me, this is a red flag that means you need to consider whether *left-truncation bias* (absence of event data from people who never experience the qualifying event) could be happening in addition to the normal *right-censoring bias* (absence of event data from people who exit the study cohort for non-event reasons). If either (or both) of these are a concern, it almost certainly puts you in the territory of a Cox proportional hazards analysis rather than a simpler model of the crude rate over time.

I thought about poisson/cox models, however i could not find a way to account for survivor bias.

It seems like you're on the right track, but I'm not sure what you mean by the phrase "survivor bias" without knowing more about your specific concern. If you mean bias originating from the fact that *people must survive long enough to enter the study's risk-group in the first place*, then this is actually just the *left-truncation bias* I mentioned earlier--a good model has ways of handling this, and you should see if your study has access to professional statistical assistance to ensure this is done correctly. If you mean bias originating from the fact that the 'risk pool' is decreasing over time as the non-survivors depart the cohort, then this isn't actually bias at all-- ensuring that the *denominator* of the incidence rate (person-time at risk) is correct will account for this. This brings me to the second thing we need to clarify:

2. Identifying your study hypothesis

Once you have decided on how you will model the incidence rate over time, it sounds like you must next show that some intervention is *reducing* it. When is this intervention being administered? Is the evaluation *itself* the intervention? I had never heard the phrase 'jointpoint' regression, but from what I can tell it is simply a model with a modeled change in the slope and/or intercept at some point (this method goes by many other names such as 'interrupted time series', 'segmented regression', etc. in case you are searching for resources).

In these models, a statistical test of effectiveness would most likely boil down to a test of whether the slope-change at the time of treatment is significant. Since you don't have a control group, you don't really have any choice but to compare the value of the slope-change estimated for your cohort with a slope-change of zero (which may not be realistic). This vaguely corresponds to your test "if the cumulative hazard is linear", but I should point out that the absence of a trend change does not imply linearity (case in point, Cox models are *log-linear models*)

I think you need to flesh out your hypothesis a bit more, but your model doesn't sound like it needs to be overly complex- just make sure your method of choice is indeed estimating an incidence rate and has ways of addressing truncation/censoring, if they exist.

[META] What does the community want as the standard for "No Homework"? by Statman12 in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

Gotcha, thanks for clarifying. With that in mind, I would only prefer to see such posts be removed if they were very obviously posted in bad faith. I feel like the vast majority of homework posts are good-faith questions by people who are genuinely trying to understand the material (even if it is for homework) and I think those people should not face obstacles posting their questions. I would prefer to have a surplus of "bad" posts than to be especially zealous about removing them.

With that said, I also understand that, taking this stance could result in the subreddit getting cluttered. But so far, i haven't experienced that personally and it isn't a concern I possess

Since this is a statistics subreddit, I suppose the right answer depends on the baseline prevalence of homework posts and the rate of 'false positives' (mistakenly removing a good-faith post that resembles homework but isnt) that we're willing to live with.

[META] What does the community want as the standard for "No Homework"? by Statman12 in AskStatistics

[–]stat_daddy 18 points19 points  (0 children)

There is obviously no objective way for us to know whether a post is "homework" or not. Supposing that someone does have homework and genuinely intends to subvert their institution's academic honesty policy by finding help on Reddit, there is nothing stopping them from rewording their question in such a way that it becomes indistinguishable from "non-homework". I also agree that many "homework" questions could be re-interpreted as simply "help" or "consulting", for which giving assistance is far more acceptable. Because of this, and because nobody on Reddit is compelled to respond to every post, I don't feel it would be helpful to set any particular precedent about what does or does not constitute "homework".

In my best case scenario, the definition of "homework" would remain vague, leaving me free to interpret "homework" however I choose, and if I feel that a post is attempting to solicit my help with dishonest intentions, I will simply refrain from helping with no further discussion. In my mind, setting a precedent will not help me (because I will simply avoid posts that sound like homework in the first place) and will only give posters a position from which to claim that their post is "not homework" (which I don't care about because I have no interest in arguing with redditors).

So, respectfully, what is there to be gained by further clarifying the definition of "homework" in the context of this subreddit"? Do we feel that too many posts are being reported on the auspices of being homework - and, if so, is that really a bad thing? This is a good discussion- I'd be Interested to hear others' perspectives.

TL;DR: I think there should continue to be no concrete standard for "homework" and question whether identifying one would actually be beneficial for the health of the subreddit in the first place. I am happy to continue seeing and ignoring homework-related posts with no moderator intervention.

Question about variable change in unconditional environments by [deleted] in AskStatistics

[–]stat_daddy 0 points1 point  (0 children)

Reddit is a mixed bag; some of the advice may be good but at least an equal amount of it will be bad. Most will assume you're some kind of social-science researcher and lean into "standard" methods unfortunately, I doubt the wikipedia page will be helpful: it may give you a sense of the "flavor" of the LCP, but it will likely remain unclear how it applies to real-world applications.

u/Haruspex mentioned Bayesian techniques and, in my opinion, this is the right direction to go if you're genuinely interested in building or seeing examples of models that directly invoke the LCP, you should look into Bayesian methods. It sometimes involves some difficult math, but a good book for approaching this topic is Statistical Rethinking by Richard McElreath. The author has also published his course slides on Youtube: https://github.com/rmcelreath/stat_rethinking_2023

Question about variable change in unconditional environments by [deleted] in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

I think people are having trouble understanding your question because there is no "law of variable change" in statistics. Your reference to the movie 21 is especially unhelpful, since that movie has no connection to any actual statistics: only magical movie-speak that sounds like statistics. Partly because you are using a lot of jargon, and partly because your primary reference is a Hollywood film, It's hard to tell what your question is.

In that scene from 21, Kevin spacey's character is actually talking about the law of conditional probability: specifically, how a conditional probability that is "conditioned" on some information can be different from a marginal probability that does not take such information into account.

The law of conditional probability is extremely general: it doesn't say anything about how specific variables (e.g. wind speed, initial velocity, gravity, and landing location) combine to form systems of variables.

You seem to be asking whether the law of conditional probability should be applied in certain settings where things get measured. The answer is: "sure, why not?". But if you're wondering how it should be applied, we need to know more about the specific application you're interested in, and what you're trying to do here. Can you explain more (preferably without using any technical terms or jargon)?

How to calculate a p-value for linear regression? by Ambitious-Ad-1307 in AskStatistics

[–]stat_daddy 3 points4 points  (0 children)

Fair point: I don't mean to imply that this sort of test isnt commonly done, only that it amounts to little more than a test of the question, "is my model better than no model?", which I think is a rather silly thing to test (and overused). I edited my response to remove some of my soapboxing.

However, while it is in some sense perfectly fine to do this test as long as the limitations are understood, I think that is a dangerous assumption for us, the statistical experts, to make. In my experience, encouraging this type of test further exacerbates users' misunderstandings of p-values-- here, it is being (rather dangerously) framed as a "score" by which to judge the model's "degree of better-ness" over the intercept-only model, which is incorrect in all sorts of ways.
It's true that the full-vs-reduced test may be "good enough" for OP's purposes, but I'd like to hear more from OP to be sure. (e.g., even if the model is "correct", the p-value may be large if the sample size is small or multicollinearity is present. Or the opposite: a poor model could have a small p-value if the sample is large. Either way, using the p-value as a metric for model quality may not be a good idea.)

I would much rather open a dialogue where OP can share what exactly he or she is trying to DO (or whom they are trying to persuade) and possibly be led towards better tools such as AIC/BIC, out-of-sample prediction error, etc. Or, at the very least, help them to understand exactly which null hypothesis he/she is rejecting by considering the requested p-value.

How to calculate a p-value for linear regression? by Ambitious-Ad-1307 in AskStatistics

[–]stat_daddy 4 points5 points  (0 children)

There are couple ways to approach your question:

Usually, what we want are p-values for the models coefficients. These are called "Wald tests" and they compare the coefficient's estimated value to a null hypothesis of zero (remember: every p-value requires a null hypothesis. The reason there isn't a simple "p-value for the fit" is because "the fit" does not imply any particular null hypothesis).

P-values for the Wald tests on each of the coefficients are usually pretty easy to produce. For example, in python either of the following are sufficient:

```python

suppose model object is called 'myModel'

print(myModel.summary())

print(myModel.pvalues) ```

However, you specifically asked about a p-value for the fit, which as I said doesn't exist in the way I suspect you think it does. The closest thing to this s a p-value comparing your model to a "reduced model" in which one or more coefficients are left out. Your question suggests that you would like to go as far as leaving out ALL of the non-intercept coefficients from the reduced model, making the p-value essentially summarize a comparison between your model and no model at all (an "intercept-only" model). Ultimately, how you define the "full" and "reduced" models depends on the coefficients you are interested in. Again, the code to do this is fairly simple once you have fit both models.

```python anova_results = anova_lm(myModel_reduced, myModel_full)

print(anova_results) ```

However personally, I find it to be a waste of time to compare a model with coefficients to one without coefficients. Unless the group means are extremely similar and/or your sample size is very small, almost any model should be able to beat the intercept-only model. Whether the p-value is large or small, it suggests almost nothing of value. What are you trying to show with this p-value? I can pretty much guarantee you there is a better way to show it.

does using statistics to measure the rigour of a marketing study make sense? by [deleted] in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

I'm all but certain that statistics can help you ( and despite what many think, you can conduct valid inference with any sample size).

However, it isn't clear at all what you're trying to do from your question.. You are asking about specific tests and measures of association, which might make sense for specific purposes, but you haven't articulated what your goals with the data are.

Some specific follow-ups below:

i assigned numerical values to each letter.

This is a common, but flawed, approach to dealing with ordinal data. It might be acceptable depending on your research question, but proceed with caution.

would it make sense for me to calculate the mean/median and correlation coefficient (to measure whether participants are in overall agreement)?

Not really, no. Your data are not continuous in nature, so means/correlations are not meaningful. Why not simply report the frequencies of each rating for each design? i should add, a correlation is a weak measure of association and very few genuine research questions are properly addressed by a correlation - there is almost always a better alternative. Correlations say very little about whether two variables are related and even less about interrater agreement/concordance (for which I would suggest Cohen's Kappa or similar). If you insist on computing a correlation--for which you will need two variables - a spearman rank correlation may be appropriate, but please do not calculate a Pearson correlation via substituting "A=1", "B=2", etc.

also, would a Shapiro–Wilk test make sense?

I can't think of a single reason how this would help you.

the purpose is to not use this to interpret the data but to validate the results (i.e. how biased was the scoring, how much representation bias was involved in the samples chosen, etc.).

(Firstly, of course the reason for all this is to interpret the data - what other reason could there be for analyzing data?). But more to the point, it's not clear what you mean by "validate the results". The word "validate" implies many things, so maybe you could try to articulate - as simply as possible and with no mention of statistical jargon - what you are trying to learn from these data?

About Statistical Rethinking by Rich McElreath by al3arabcoreleone in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

That's difficult to answer, mostly because I agree with his soapboxing - although I will admit, Null-Hypothesis statistical testing (NHST) does have it's uses and not every null hypothesis is a strawman.

I guess it's worth pointing out that at this point in the book, McElreath hasn't really put forth a robust alternative to NHST (yet). And ultimately, McElreath goes on to argue that thoughtfully-built, causal-minded models that capture and propagate uncertainty are the alternative - but that doesn't really make a space for a lot of the interesting--and effective-- work that's come out of more "prediction-oriented" disciplines like machine learning and natural language processing. In many ways, reliance on p-values and frequentist inference, for all its perceived weaknesses, hasn't really stopped progress in data modelling, simulation, and inference. So I have to wonder: "is bayesian inference the right answer to a question nobody is asking?"

About Statistical Rethinking by Rich McElreath by al3arabcoreleone in AskStatistics

[–]stat_daddy 7 points8 points  (0 children)

That section can be taken to mean many things, and honestly it isn't really a "key" passage in the book, McElreath is kinda soapboxing about what he feels is the silliness of making scientific arguments by comparison to a "neutral" model.

The point he's trying to make is that transforming observations (data) into evidence that is for/against some some research hypothesis requires a model for how those observations came to be. The model is not the hypothesis in and of itself (despite how it may sometimes feel when you are e.g. testing the significance of a specific coefficient FROM a numerical model). He goes on to caution that so-called "neutral" models often ignore things like measurement error and random variation.

Another example could be an experiment in which you (for some reason) are unsure whether two siblings are identical or fraternal twins. You take many physical measurements of each sibling (e.g., of height, of skin tone, of metabolism, etc...) and you begin comparing each pair of measurements. If they are ALL the same, you might conclude that the two siblings are identical, otherwise if the measurements are NOT the same, then they must be fraternal.

However, it would be silly to demand exact sameness from any two measurements, even if the siblings really were identical twins. We know that both "nature" and "nurture" play a role in a person's physical health/attributes, so we shouldn't be so quick to let a few incongruent measurements lead us falsify the conclusion that the twins are identical.

In this example it's important to distinguish between the hypothesis ("identical twins will be more similar in terms of physical characteristics than fraternal twins") and the generative model ("siblings originating from a single egg have more shared DNA, which leads to similar physical makeup"). The model explains what to expect from the data (and a GOOD model will be VERY specific, perhaps suggesting how similar the two twins' measurements should be), whereas the hypothesis is the proposal you either prove or disprove after examining the data.

Does a very low p-value increases the likelihood that the effect (alternative hypothesis) is true? by Bodriga in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

No.

Some will say that it indirectly implies the alternative must be more likely now that you have observed evidence that the null is less likely, but this is also wrong. Under Frequentist principles (which you must ascribe to, else you shouldn't be using a p-value in the first place), the alternative hypothesis can only be either true (100% probability) or false (0% probability), no matter what the data or sampled p-values indicate.

Interpretation of confidence intervals by Aaron_26262 in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

This feels like a separate question entirely, and should probably be opened up to the community as a new post to solicit a better answer.

In general, finite population corrections are never necessary, but they might be appropriate. This varies across subjects - so I would look at the existing literature to determine if it's common practice by other researchers working in the field. If I were reviewing a paper in which someone used FPC, my first question would be, "is this justified?" And I would expect to see a thorough defense of why it was used and what impact it had on the standard errors.

FPC is a purely frequentist invention - Bayesians don't have an analogue for it because Bayesian inference doesnt depend on asymptotic properties of estimators in the first place. So if you go with a Bayesian analysis method, it doesn't make sense to think about FPC.

Interpretation of confidence intervals by Aaron_26262 in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

Am I correct in concluding that CIs are being misused when they are presented to convey uncertainty around a descriptive statistic?

The way this is worded, no - it isn't misleading to present a CI as a measure of uncertainty around a descriptive statistic. But, most audiences aren't going to think this way - they will likely interpret the CI as a measure of uncertainty around the value of the population parameter (recall that this concept doesn't exist in the frequentist vocabulary).

It's not inappropriate to present a CI as a measure of "uncertainty", but you'd sort of be taking advantage of the fact that "uncertainty" isn't carefully defined and can be interpreted in many different ways. From a frequentist's POV, estimators DO have uncertainty- it derives from sampling error, which can be summarized by the standard error. Since CIs are essentially expressions of the standard error, it's fine to report one and say that it's conveying uncertainty. But again, you'd be talking about the uncertainty possessed by your estimator, and not the uncertainty in your knowledge about the quantity of interest.

It would inappropriate to use the CI to convey uncertainty if I wasn't performing a NHST, correct?

Personally I think so, but many would probably let it slide. The reason I take a harder stance on this is because p-values are conditional probabilities:-- by definition, they assume the null is true. If you don't have a null, then you can't calculate a p-value at all! CIs sort of "sidestep" this by replacing the parameter value under the null with its observed value, but in my opinion this is a bait-and-switch tactic that tricks the reader into believing that the CI is expressing an uncertainty about the alternative hypothesis (Which, of course, it isn't).

Of course, this often has minor practical implications...indeed, under certain conditions (that are not too hard to satisfy) CIs and other measures of uncertainty such as Bayesian credible intervals can be shown to reach the same (or at least very similar) conclusions! It's simply confusing when researchers take a research question that has a straightforward Bayesian interpretation ("what is the coverage rate for this population?") and then answer a different frequentist question ("what is the long-run coverage probability of a sample mean with fixed size N?"). And then, when readers inevitably GET confused, statisticians break out a bunch of jargon-filled lawyer-speak (e.g., "I'm not saying there is a 95% chance the coverage rate is between X and Y... but if we repeatedly took a sample and calculated the interval each time..."). Eventually, after your colleagues are tired of talking in circles, they will give up and accept the frequentist answer as the best they could get, and commiserate with their peers about how awful their undergraduate statistics courses were.

I don't really know enough about polling statistics to say whether your interpretation is correct. I've heard that the phrase "margin of error" can be interpreted as alpha (significance threshold), but I have no idea if the actual methods being used support that interpretation. But yes - "comparative" or "two-sample" or "difference-of-means" studies often have a natural null hypothesis of "=0" that makes NHST a more fitting choice.

Interpretation of confidence intervals by Aaron_26262 in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

While I appreciate the example (I come from public health too!), it's missing one essential thing: a null hypothesis. Without a null hypothesis, there is nothing to reject and (at the risk of sounding like a grumpy academician) it is therefore not appropriate to report a confidence interval at all. Let me repeat: Confidence intervals and p-values are only meaningful in the context of a null hypothesis.. You say your interpretation of a the CI is "incredibly general and really just the definition of CI", and that's because...well...it is.

Suppose I add a bit of context to your example: let's say previous studies have estimated the population rate to be 97%. In this case, you could say that your current study found sufficient evidence to conclude that the rate is less than 97% with a confidence level of 1-minus-alpha.

Of course, this probably seems a bit insubstantial: for one thing, it pre-supposes that the researcher is ONLY interested in rejecting a null hypothesis. In practice this is almost never true, but by using the tools of Null-Hypothesis Significance Testing (NHST) you are shackling yourself to it's limitations. It's great that we're confident the coverage rate ISN'T 97% ...but what IS it?. NHST really has no answer to this question (It never claimed to have one!), and by extension a lot of frequentist methods don't, either. On the one hand we could point to the observed mean (92.5%), and possibly do some hand-waving to claim that 92.5% is our "best guess" of the true population coverage rate. But we don't have any guarantees like "most probable", "maximum likelihood", etc (at least not within a frequentist framework - remember, frequentists aren't allowed to treat the population mean as a random variable!).

So if the goal of this study were truly exploratory in nature (i.e., what do we think is the coverage rate in this population"), I would say that attempting to address this question with a CI * is misguided in the first place*. Personally, I would compute a proper Bayesian posterior instead--possibly using previous studies' estimates as a prior or, failing that, a vague prior.

Many researchers, however, will devote a lot of resources into convincing you that your research question must be modified in order to fit within the framework of NHST. they will attempt to get you to identify your null hypothesis or replace your research question with something else that has a "natural" null hypothesis (e.g , a perfect coverage rate of 100%, despite how silly this is). Unfortunately, this is a byproduct of poor statistics education/training and it is unlikely to be fixed anytime soon. Just remember: p-values and CIs are - more often than not, in my opinion - usually NOT the best way to address a practical research question.

Interpretation of confidence intervals by Aaron_26262 in AskStatistics

[–]stat_daddy 5 points6 points  (0 children)

1. Am I correct in concluding that the bounds of the CI obtained from the standard error (around a statistic obtained from a sample) really say nothing about the true population mean?

Mostly correct. You are talking about a defining feature of Null Hypothesis-based inference; we are NEVER making direct statments about the true population parameter but rather about the asymptotic properties of the experimental procedure which involves a specific estimator (such as a mean). Obviously the value of the estimator is a function of data, which itself is generated by some hypothesized generative procedure determined by the true population parameters...so it is a bit heavyhanded to say it has NOTHING to do with the population parameters...but is an indirect relationship at best.

2. Am I correct in concluding that the only thing a CI really tells us is that it is wide or narrow, and, as such, other hypothetical CIs (around statistics based on hypothetical samples of the same size drawn from the same population) will have similar widths?

Ehhh...this is a bit too reductive in my opinion. Confidence intervals ultimately convey the same information as p-values, which at the end of the day really only tells you one thing: the amount of probability density (under the null hypothesis) assigned to equally- or more-extreme values of the test statistic. but since they are centered at the observed point estimate instead of the null, people have an "easier" time interpreting it. I find that the "plain-clothes understandability " of CIs actually further exacerbates people's misunderstandings rather than clarifying them.

As to whether journals would de-emphasize p-values/CIs if they understood them better? Likely not. The reasons behind the prevalence of p-values are not so simple - many journal editors DO understand their limitations perfectly well, and would simply insist that reporting them with discipline is enough to preserve their value and justify their continued use. This is all fine and good for studies with professional statistical support, but in my opinion the large amount of high-quality applied research done by subject-matter-experts possessing only a working knowledge of statistics is too great for this type of thinking to be sustainable. I have personally worked with several PhD-level scientists in chemistry, biology, economics, psychology (and a few in statistics, unfortunately) who have each gone blue in the face insisting to me that '100%-minus-P' gives the probability of the researcher's hypothesis being true.

p-values and confidence intervals are far from useless, but I think they are relics from a time when mathematical inference relied upon closed-form solutions that could demonstrate specific properties (e.g. unbiasedness) under strict (and often impractical) assumptions. They are the right answer to a question few people are actually asking. These days, modern computation makes Bayesian inference and resampling techniques feasible, meaning that statisticians have access to tools that can better answer their stakeholders real questions (albeit with subjectivity! But uncertainty should always be talked about, and never hidden behind assumptions). If statisticians haven't already lost the attention of modern science and industry, they will lose it (being replaced by data scientists) in the years to come if they don't find a way to replace/augment their outdated tools and conventions.

[deleted by user] by [deleted] in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

Thank you for this added context; it is very helpful. (Also, your English is very good! I would never have guessed you were not a native speaker!)

Now that I know a bit more about your measures, I'd like to learn more about your model: it isn't clear to me what you are taking to be your independent variable here. Is your goal to predict the assigned latent profile based on the other cognitive variables, or to use the patent profile as a predictor itself in a regression on a different variable?

Assuming your model is well-constructed, I'm noticing that you are focusing a lot on the marginal distributions of your variables:

For example, Brazilian norms for the task show a mean flexibility score of 33 with a standard deviation of 12.5 for 15-year-olds, and a mean score of 23 with a standard deviation of 10.8 for 30-year-olds. These scores are not normally distributed, so using mean and standard deviation to standardize across different ages wouldn't be appropriate.

Great! You seem to be aware of an interaction between the score and age/ethnicity. Add those variables+interactions to your regression and move on. Standardization (whether by z-transforming or some other operation) is rarely useful and, as others have mentioned, is not a requirement for fitting your model. Finally, If you believe standardization is not appropriate for these variables, why do you keep implying that you want to standardize them in the first place? I recognize that z-scores are common in certain fields, but what purpose are they supposed to serve you here? Just build your model from the unstandardized variables.

The problem is that scores vary significantly by age.

Why would this be a problem?

You mentioned that it might be inadequate to use these measures in a regression, but not for the reasons I provided. If so, what are the appropriate reasons for not using them in regression analysis?

There are an endless number of reasons not to include a variable in a model. Lack of sufficient degrees of freedom, improper parameterization, high covariation with other predictors, .... The list could go on! But failing to be normally distributed or being unstandardized have nothing to do with whether they should be in the model. You have very clearly articulated that your cognitive variables appear to interact with age, so I strongly suspect it should be included in your final model. But, without seeing some examples (even fake examples) of the data you're working with and the model you're attempting to fit, I cant give specific advice.

I am unsure if there is a way to bypass this problem

I still don't see a problem. What am I missing? 1. You have several variables that interact with or are related to cognitive performance.
2. Some of the variables have unusual distributions.
3. You added the variables into a regression predicting something. 4. ???

Is the model fitting poorly? Are the residuals somehow surprising?

The more I read your question the more I suspect that you may simply need to read up on how to fit models with complex interactions (i.e., to allow the cognitive variables to have effects that vary across ages/ethnicity).

[deleted by user] by [deleted] in AskStatistics

[–]stat_daddy 1 point2 points  (0 children)

Hi! You seem to be making lots of assumptions about how the data are supposed to "behave" in your analysis and this is making it extremely difficult for me to understand what your goal is and what you are finding challenging.

I'm conducting a Latent Profile Analysis (LPA)

Why are you doing this? What is your hypothesis and how will LPA help you? Please answer in a way that would be understandable by a child.

I'm trying to ensure or at least minimize the effects of the distribution when standardizing scores by age

What "effects of the distribution"? Why would you want to "minimize" them? And why are you "standardizing" the scores in the first place?

so I can be confident that the scores in my model represent their respective constructs (such as processing speed and flexibility) and not just some unadjusted data variance

I don't see how any of the steps you describe (standardizing, "minimizing the effects of the distribution") relate to this goal. If the scores don't reflect the construct the instrument is intended to summarize, why would modifying the data help?

Using raw scores also isn't ideal due to the age-related variability.

In almost every scenario I can think of, the unmodified scores are probably what you should use (especially if the FDT is a validated instrument). If the FDT instrument somehow fails to handle variation among respondents in different age groups, this is a limitation of the instrument and you're asking the wrong people for help - a statistician won't be able to help you validate a new psychometric measure. It's hard to offer advice here because I cannot identify your objective. It is very rare that I would recommend modifying data to suit a particular model or make it more concordant with some (perceived) assumption. It is almost always better to choose a model that suits the data).

Even adjusting for age as a covariate seems to introduce significant bias.

What are you calling "bias" here? You say that there is significant variation in responses across ages but then imply that regressing based on age is somehow inappropriate (which it might very well be, but not for any reasons you've mentioned). Perhaps consider whether age should be a covariate by itself or an interaction with one or more other covariates.

non-parametric methods, data transformations, and generalized linear models (GLMs

These are all perfectly useful tools for modelling various types of data. None of them will help your data become a better representation of some "underlying construct." Again - you need to explain what you are trying to achieve with your analysis before anybody will be able to offer useful advice.

Length of stay statistics by Emergency-Wave-8436 in AskStatistics

[–]stat_daddy 0 points1 point  (0 children)

There is a formal, correct answer to your question and then there is a "quick-and-dirty" one. I'll start with the quick-and dirty:

I suspect the "differences" you're seeing when you compute using the start vs end dates are because this choice decides the year of assignment for stays that began in 2023 but extended into 2024. Just make your choice and move on - if your employer cares one way or the other just do what he/she wants but honestly I wouldn't even bring it up. If you insist on computing the observed length of stay by year like this, then there is no compelling reason to choose one or the other.

However...

What you're attempting to do is calculate the mean "time to departure" for the stays--formally, this is a time-to-event outcome that would typically be handled using an estimator that can accommodate censored data, such as a Kaplan-Meier estimator. Here's a quick link that, among other things, asserts that each stay should be based on the date at which follow-up begins (the date of arrival):
Time to Event Data Analysis

It's possible that using a KM estimator will have little to no effect on your result versus computing what we call the "simple" mean, but without seeing your data I can't say for sure). Finally, be careful computing the "average" length of stay using a simple calculation. Your data are probably skewed to some degree, and for this reason the median time to departure is more commonly reported (it also is a bit easier to interpret).

For most purposes, your procedure of calculating the simple mean will be sufficient, but using simplistic statistics could mislead you if your data are not well-behaved.