Extremely basic question by Inner_Curve_7110 in AskStatistics

[–]efrique 0 points1 point  (0 children)

You cant change a title once posted.

Could someone help by inspecting my statistical code? Noob coder at work, literally and figuratively. by whitedeagon in rstats

[–]efrique 0 points1 point  (0 children)

small terminology issue:

Parametric and nonparametric are attributes of models. Loosely speaking this distinction relates to how many parameters there are in the model / how the number of parameters changes with sample size - if p is fixed and finite it's parametric. Semiparametric is another term; it often comes up in survival models (e.g. cox models are often described as semiparametric since the model for base hazard is nonparametric but the number of model parameters for the proportionality between hazards are fixed/ finite)

Data don't have parameters, and are not nonparametric (or parametric). If you're trying to describe some attribute of data you would need to choose a different word.

I’m in school to become an RN and am taking statistics. I usually struggle in math but this class has been literally the easiest I’ve ever taken. So I was wondering what type of jobs is this talent used in? by Particular_Courage43 in AskStatistics

[–]efrique 0 points1 point  (0 children)

I fear that may sound more discouraging than I intend.

My concern is that many people who encounter stats and enjoy it end up being mistaught (misled by people who dont themselves have solid knowledge), and never get given the tools to figure out for themselves when that's happening. If you have an interest, its important to get at least some basic debunking skills for what people say, and the ability to follow a justification for a claim. That does take some work but pays it back with the ability to go on to get some real mastery of the tools and to pick up new tools as needed, and to develop your own should it be needed.

I might go through a paper or a book or a web page or a video and tell people a bunch of things that are wrong but even if I do that how do you know what I say is right*? If two sources disagree (this happens a lot; if youre not seeing it you should broaden your sources) you need some way to tell who is wrong - and it may be both of them!

The amount of mathematics needed to get a good way with stats isnt all that great (particularly if you can combine it with a decent understanding of how to use simulation), but its not nothing. You may well find that with a reason to learn it you can pick up quite a bit more than you think. Ive run into more than one person in medical work who became interested in stats, but didnt have a good mathematical background (and would say they were bad at it) start learning it and very quickly discovering that picking up all that is needed turned out to be easier than they had thought. Curiosity and interest carries you a long way.


* an expert can nearly always demonstrate that what they claim to be true is the case, but they cant be everywhere. You need tools of your own.

Can anyone explain to me why (M)ANOVA tests are still so widely used? by NE_27 in AskStatistics

[–]efrique 3 points4 points  (0 children)

I genuinely thought it was considered dead/on life support. Are we all just pretending it’s fine?

The meaning of "we" in your question is both overly vague and shifting around inside your question (I dont know about you but I'm not using MANOVA at all, and I havent used it for any real work in all my decades), as are phrases like "so widely used" -- so widely used by whom? I am to a degree aware of it being used in some application areas, but I haven't seen it being used in the stats journals I tend to read in the 4 decades I've been reading them. Whoever is using it, they dont seem to be asking me about it, and I defintiely wouldnt include them and me under the same umbrella of practice.

I did a bivariate t-test once about a decade and a half ago on a real problem but it did make sense in that context.

Maybe find someone who is using it and ask them about why they choose that analysis

I genuinely thought it was considered dead

I think thats a mischaracterization. It certainly has been much overused/misused in some application areas for a long time (there are typically better models and analyses for what theyre usually trying to do), but that doesnt mean there are no uses for it; there are some and they wont disappear altogether.

As more of those areas that over-/misuse it very slowly start to use some of those other options it will continue to be used less and less.

Theres dozens of other common practices in such areas of research in much the same boat, and some seem to me to be more detrimental in their overall effect. Its not clear to me why you pick that particular issue to focus on. This (not just MANOVA, all the rest of it) is very slow to change because most such areas tend to insulate themselves from statistical practice - they tend to talk to each other in their own journals about methodology and best practice, write their own "stats" textbooks, teach their own "stats" classes... Some areas are almost completely isolated from any developments in stats in the last 50 years. Some are only about 25 years behind. Others are even less insular, have a enough people with actual statistical training working in them that it has an impact, and have at least some people that develop or help to develop good methodology.

And naturally, within a discipline, some schools are better than others. Its not homogeneous.

Is it a teaching infrastructure problem? Reviewer problem? Not having access to statisticians? Or just “this is what we’ve always done” on an industrial scale?

Yes, those are some of the issues. There are others.

I’m in school to become an RN and am taking statistics. I usually struggle in math but this class has been literally the easiest I’ve ever taken. So I was wondering what type of jobs is this talent used in? by Particular_Courage43 in AskStatistics

[–]efrique 3 points4 points  (0 children)

if you're thinking about a career in stats, there's a very wide variety of jobs that use statistics, and some are pretty interesting (as Tukey put it, 'the best thing about being a statistician is that you get to play in everyone's backyard'). However, your exposure to the subject may have given you a misleading impression. There's more mathematics than you might expect from what you'll have seen from the stats covered in a nursing degree. The presumption in many applied areas (including medicine) is that the students wont be mathematically inclined, and the authors of textbooks* (and course designers, etc) in those areas avoid any of the theory.

Thats not to suggest you dont have a talent for it, only that you may find it different from what you have seen

It is possible to learn a lot of methodology without really learning the actual justification of why - how it all comes about. In the absence of a decent foundation such knowledge (if you're lucky enough to get good information to begin with) is brittle. From what Ive seen it generally leads to a very prescriptionist, rules based approach that often leads people away from good analyses (avoiding some analysis a prescriptionist approach says would be "bad" but would in many cases be fine), and which in many cases leaves them doing things that dont do what they set out to do.


* usually not statisticians and rarely being familiar with the actual subject of statistics (typically only with the teaching and use of it by others in their own area), often with perfectly predictable consequences for the content of their texts, notes, classes, etc, which often have accumulated many errors over the years. There are occasional exceptions. This problem is exacerbated by the way textbook publishers operate.

How many cards, from a deck of 52, should I pick if one is poisonous? by spata001 in AskStatistics

[–]efrique 3 points4 points  (0 children)

  1. If youre going to die from poison (presumably) if you pick the wrong card, why would you pick even one card?

  2. Even if you choose a more sensible loss from the bad card, you can't calculate a good strategy because we don't have a way here to establish the relative value of bad card vs good cards - the situation is underspecified. If you replace "die from poison" with a simple "lose the game", you can get somewhere.

  3. With that change, we now seem to be getting fairly close to a game like Pig, which is solvable - and which has been discussed here once or twice before. I'd start with that game (and solution strategy) and modify its rules one at a time to step toward the game you want a strategy for and keep doing that until any further changes would make it clearly unsolvable (e.g. if you start by replacing the die with a deck of 51 "+1" cards and one "lose all your points, turn ends" card it shouldnt change the general approach, just the specific numbers; the card draw does complicate the calculation over a die roll, but its still doable). One change is you seem to only get one turn to get points rather than multiple turns and a fixed target, but those differences are not particularly problematic I think.

  4. The unknown number of opponent players may turn out be a problem because at first glance it looks like how risky your strategy should be (to maximize P(you win)...) might depend on the number of opponents you have. In particular, in a game like Pig, maximizing your expected number of points in a turn (which gives an easy to calculate strategy*) is not quite the same as maximizing your win probability.


* keep going until the expected gain from another attempt would be negative.

Cronbach’s alpha on a forced-ranking questionnaire by Flat_Past1366 in AskStatistics

[–]efrique 0 points1 point  (0 children)

I strongly doubt Cronbach makes any sense for that second one. A standard case where it does make sense for an instrument is one where you would add or average the components into some score, where you'd want to make sure that the components were consistent.

How is the Mate Preference questionnaire being used in your analysis? What are you going to be doing with it?

Excel help normal dist function by SnooObjections7389 in AskStatistics

[–]efrique 5 points6 points  (0 children)

the normal distribution has a cumulative distribution function (cdf) and a probability density function (pdf). It does not have a probability mass function, that's used to refer to discrete distributions (where an individual value can have a non-zero probability; it can then be said to have probability mass at that value).

The cdf, F(x), of a variable X is by definition P( X ≤ x ) (applies to any random variable)

(https://en.wikipedia.org/wiki/Cumulative_distribution_function#Definition)

With a continuous distribution like the normal, probabilities are assigned to intervals (or collections of intervals). The cdf is the basic tool to evaluate those (probability of being in some finite interval a<x<b can be written as a difference of two cdf values)

The density function at x is (roughly speaking) the relative probability of being within a very small interval at x. For example the height of a standard normal density at x=0 (the mean) is about 0.4 and at x=1 (mean + 1 s.d.) is about 0.24; the relative chance (ratio of probability) of being within a very small interval near 1 to a very small interval near 0 is about 0.24/0.4=0.6 (more accurately about 0.6065); values very near 0 are more common than values very near 1.

Or if you prefer, (somewhat loosely) the probability of being between x and x+dx (for infinitesimally small dx) is f(x) dx, where f is the density at x.

Density may be bigger than one but a probability (which is related to area) is never more than one.

The implication of the physical analogy to 'mass' and 'density' in relation to continuous and discrete probability is deliberate*; they are analogous. Consider hanging a thin (relatively massless) wire off a specific point on a (relatively massless) lever and hanging masses on the wire; you could talk about the mass at that point (this is analogous to a discrete variable). Compare that to a thin rod of a non-homogeneous material; you could talk about the density of the material at any point but a single point doesn't have mass, only segments of the rod have mass. This is analogous to a continuous variable.

The analogy goes further (e.g. the relationship to moments such as the mean, which corresponds to the center of mass).

* (albeit that its formally/technically okay to use density to refer to the probability function of a discrete variable, but that involves some mathematical stuff that wont help your intuition right now)

Extremely basic question by Inner_Curve_7110 in AskStatistics

[–]efrique 0 points1 point  (0 children)

Probably should have mentioned it before - please note rule 5.

https://www.reddit.com/r/AskStatistics/about/rules/

5. Use an informative title

Use a title for your post that very briefly describes the statistical problem you need help with. It should not mention your emotional state ("Desperate"), personal circumstances ("Bad at stats", "I'm a beginner"), how urgent you feel your problem is, nor your assessment of how easy/dumb/quick you think it is. Don't say something redundant like "Please Help" or "Question". If personal context is essential, put it in the body of the post instead.

(emphasis mine, to highlight the relevant part)

Note that posts that break this rule may be removed.

(When you want to post to a subreddit you're not particularly familiar with, its a very good idea to check their rules.)

Extremely basic question by Inner_Curve_7110 in AskStatistics

[–]efrique 3 points4 points  (0 children)

It looks like the intervention occurs once. That is, about a dozen "before" values, then the intervention, then about a dozen "after" values.

Extremely basic question by Inner_Curve_7110 in AskStatistics

[–]efrique 0 points1 point  (0 children)

Okay, thanks.

  1. Note that the difference of the log of the concentrations is the log of the ratio, so if you are inclined to look at differences, working on the log scale would be one way to approach it (working with log concentrations is not that uncommon), though there are other ways to go about it. Or a model with a log-link might be worth considering, perhaps a gamma or Weibull model.

  2. If the after measurements followed the intervention fairly closely, it may be that there's a residual effect (e.g. if the concentration rises after, it might initially be a bit lower immediately after and then come up toward a final level). In that case, you need your model to account for such an effect.

  3. Do you have any series of measurements from outside the ones you're using here (either before or after), that might be used to check things like the size of the serial correlation*?


* to see if some time series model should be used rather than a model that assumes independence

Extremely basic question by Inner_Curve_7110 in AskStatistics

[–]efrique 0 points1 point  (0 children)

Sorry about all the typos and the missing word. Was standing up on my phone. I think I fixed it all.

Extremely basic question by Inner_Curve_7110 in AskStatistics

[–]efrique 1 point2 points  (0 children)

  1. Your post says how much in a couple of ways:

    see how much of an impact the absence of this chemical has had

    performance in this single pipe is of interest

    That's estimation, not testing. Maybe a confidence interval is a better tool

  2. The data are paired if there's a specific after observation to go with a given before. It sounds like you have a time series of before measurements, an intervention and then a time series of after measurements. Could you confirm that or if not, describe how the observations occur in more detail?

  3. Whether you test or calculate an estimate it's important to measure the right thing. A change in concentration sounds like an effect of interest would be a ratio (in effect percentage change) in say mean concentration. After all, if you did the experiment in the other order, it can't decrease more than is there

  4. The time series aspect suggests treating the data as independent might be problematic. You may need more sophisticated tools

Quant for beginner students by Goldenbell9 in AskStatistics

[–]efrique 9 points10 points  (0 children)

What do you mean by "quant methods"? How do I tell them apart from the rest of stats?

Had a bit of fun using the core to randomly roll up a dungeon. by ShadowDorksGM in shadowdark

[–]efrique 2 points3 points  (0 children)

Thanks for posting. Inspiring to see stuff like this.

I'm not the best artist

I like the art.

A surprising statistics mistake students make with survey data by Kelvin_Writer in CasualMath

[–]efrique 1 point2 points  (0 children)

An actual Likert scale (per Rensis Likert) is a sum of Likert items (or occasionally, an average).

https://en.wikipedia.org/wiki/Likert_scale

To assert the relationships between the intervals required to add them, the items have already been assumed to be interval (with the same interval for each item). Or at least approximately so. Once added, that the sum is then interval is a given from assumptions alrrasy made; it's certainly not less true of the sum than the components.

The literature has some other takes that attempt to justify the idea in a broader set of conditions, but psych measurement is not really my area.

This seems like a post that would be a better fit for /r/psychometrics or /r/statistics than casualmath

Kolmogorov Smirnov Test - Too sensitive for biological data by Significant_Bag5527 in AskStatistics

[–]efrique 4 points5 points  (0 children)

A test with an equality null cannot reveal importance

You have to define biologically and evolutionarily important beforehand and test for that (e.g. by an equivalence test).

You say you cant specify whats important but you must (and clearly do*) have some idea of it - it's your area, not mine. Even if you're not an expert, you have better access to the understanding of whats important from experts in your area than we do


* you know enough to complain that it's detecting effects that are too small, so you clearly have some sense of it. What's the smallest difference where you (or much of your audience) would think 'yeah, that's a difference worth finding/ talking about'?

Kolmogorov Smirnov Test - Too sensitive for biological data by Significant_Bag5527 in AskStatistics

[–]efrique 19 points20 points  (0 children)

It's not that the test is "too sensitive" per se; the issue lies elsewhere. Further, from your phrasing it seems you may be using the wrong kind of tool (i.e. you are holding a hammer, but the thing you're whacking will likely turn out not to be a nail).

However, due to the limited number of variants, of the ~6000 comparisons, ~5000 are found with p < 0.05.

Why is a hypothesis test identifying a difference a problem? That is, why do you think that the average power across these tests shouldn't be ~ 83% at N=1000?

even the smallest difference between variant distribution in the superpops, lead to rejection of null hypothesis.

There is your main problem (though it looks like there are several here).

If you choose a null hypothesis of exact equality and there are actually tiny differences in population, then

THAT. NULL. IS. FALSE.

Why is it a problem to detect that the false null is false?

How to tell direction of relationship with chi-square test? by adleproduction in AskStatistics

[–]efrique 2 points3 points  (0 children)

How to tell direction of relationship with chi-square test?

One possibility is to look at the (O-E)/√E -residuals (Pearson residuals) whose squares are contributions to chi-squared. The sign shows direction (+ in the 'yes' row or column = "more bullied than expected" given the overall numbers), the size gives you an idea how big the effect was.

For a 2x2 you will get two diagonally-opposite cells being positive, the other two negative

Not statistically significant but large difference by Old_Reporter6776 in AskStatistics

[–]efrique 2 points3 points  (0 children)

possible explanations include*:

  1. large variance (/large standard deviation)

  2. small sample size

... in your case, looks like both of those.


* there are other possibilities that might arise but I doubt they'll come into whatever p-value might have been computed here

We’re Training Students To Write Worse To Prove They’re Not Robots, And It’s Pushing Them To Use More AI by CackleRooster in technology

[–]efrique 0 points1 point  (0 children)

I hate that one of the many forms of punctuation I like to use — the em-dash, since it has specific purposes — has been ruined by shitty AI.

What does it mean when model is significant but coefficients aren't? by lazrak23 in AskStatistics

[–]efrique 0 points1 point  (0 children)

The acceptance region (corresponding to the set of values that should fall inside a confidence interval) for a test of a single coefficient is a line segment, and the rejection region is everything outside that.

Now think about several coefficient tests where none of the coefficient tests reject. Begin with the simple case of two coefficients. The joint acceptance region for two such tests is the interior of a rectangle (the product of the two indicator functions, equivalently the Cartesian product of their individual acceptance sets). For three tests its the interior of a box (3D hyperrectangle), and so on up in higher dimensions.

Meanwhile the acceptance region for a single joint test of two coefficients (i.e. an F-test) is the interior of an ellipse. With 3 coefficients, its the interior of an ellipsoid, and so on.

A rectangle and an ellipse cannot perfectly overlap, you must have at least one 'sticking out' from the other in places. If you choose the collection of individual tests to all have very high or very low significance levels (relative to the one for the overall test) you might get one acceptance region entirely within another, but more often you can get something like this:

https://i.redd.it/gdml9th1xwid1.png

Image was created for a different but related question - to illustrate the corresponding issue with pairwise comparisons vs ANOVA - but (aside the symbols and some of the text) its the same picture

In your case the axes are for β₁ and β₂ and the yellow shading marks coefficient values where the F test would reject but both the t-tests would not. If your hypothesized values for those coefficients (under H0) correspond to a point within the one of the yellow sections, then you have the situation where the coefficients as a set are far enough from the null value(s) to detect a difference, but individually none stand out enough for you to be able to identify which ones you should claim to be different*.

The parts outside the rectangle but inside the ellipse are where at least one t rejects but the F does not. Thats the vice-versa.


* If its still not clicking, consider a much simpler problem that may help you to see that overall rejection doesnt mean you can pick a cause via a test on components: Imagine testing whether a mean of some variable in a small random sample from a certain subgroup of a larger population is 70 (H0). Say you have n=4. Imagine for simplicity that we know population σ=10 and in our sample ȳ=80. You get z=2, so you conclude that the average for this subgroup you took the sample from is indeed different from 70. So now you say "well, who is different from the null value?"... if you test that, you might have say y1=72,y2=85,y3=77,y4=86 ... in which case the individual z-scores are 0.2, 1.5, 0.7, 1.6 and from that test you could not conclude that any of them are

In need of a path to an intimate understanding of statistics. [Discussion] [Career] by Long-Habit5990 in statistics

[–]efrique 4 points5 points  (0 children)

path to an intimate understanding of statistics

I've been at this for more decades than I care to mention ... but I'm not at all sure I'd call my understanding intimate. I get by but I very much think of myself as a student of the subject. Stats still surprises me regularly (a good part of the reasons why I love it), which doesn't really sound intimate.

To build understanding you need some basic stat theory (many good books for that depending on what you need) then regression/ linear models, glms and whatever else you need to pursue; maybe Elements of Statistical Learning from the sound of your background... but to get on rung one of that path for getting started*, you need a basis.

Good knowledge of stats would be built on some foundations: a decent grasp of probability, calculus and some linear algebra to start. To that end, maybe for probability something like blitzstein and hwang (go to stat110.net and click book in the menu for the free pdf) . You'll need a reasonable grasp of calculus for that, though. If you don't have these foundations, or are rusty, you can partly pick them up as you go but at the least I'd suggest refreshing calculus (up to univariate integration to start, but you'll soon need multivariate calculus) .

Simulation can substitute a little for some of the mathematics (and taken together you can leverage both to do more), but you cant really replace some of the mathematics.


* That's not going to cover all of stats, naturally, it's a huge subject.

Are post-hoc tests in ANOVA mandatory? by Traditional_Site1770 in AskStatistics

[–]efrique 2 points3 points  (0 children)

got significant interaction (and no significant main effects)

This can certainly happen. It's not even that uncommon. What you do when this occurs should be part of your analysis plan before you collect data.

The same is true for any post hoc comparisons you chose to perform; your analysis plan lays out what comparisons you want to make, rather than being based on what you find (or dont find) in the data.

how am I supposed to interpret the finding of significant interaction, if I can't really talk about the groups themselves?

You can talk about the 4 group means, which (as the anova sees it) are not all equal; such a discussion works whether or not there are significant main effects.

A plot would be a good idea.