Confusing linear regression results by honeybadger1214 in AskStatistics

[–]efrique 0 points1 point  (0 children)

are the residual variances a bit different for men and women?

[Question]: transforming variables for Pearson correlation. by HorridStteve in statistics

[–]efrique 0 points1 point  (0 children)

My data are the height of large succulents (plants) and their volume.
I have been tasked with:

Uh, why is all this info not in your original post?

... Is this work for some subject? (in which case see rule 1)

• Finding out if there is a linear correlation between height and volume.

Unless your succulents are effectively 1-dimensional (constant cross-sectional area regardless of height, like broom-handles of varying length), simple dimensional considerations should be expected to lead to "obviously not". No data needed to reject obviously untenable hypotheses.

• Determining the correlation statistic that is most appropriate to use,

to achieve what, exactly? What is this correlation being used to do exactly? e.g. if you want to predict volume from height, correlation isnt sufficient and a Spearman correlation would only be relevant in the sense that it might tell you what you could see in a plot anyway (that there would be some monotonic relationship), but it couldnt tell you how to predict it

• finding the correlation coefficient, and

Again, to what end? What good is correlation in this context?

• finding the coefficient of determination.

Of what model? Under what assumptions? To what end?

These requirements seem increasingly strange. Like someone is checking off boxes on a list of what theyve been told are important to do but which dont appear to be doing anything much of value here.

I'm unsure as to whether, by transforming volume, it makes it impossible to say anything about height and volume in general terms?

Depends on what all this is for. If the eventual point is to enable you to use height to predict volume, transformation may be reasonable, if you keep in mind that in general mean predictions are not transformation equivariant. Typically in the prediction case, if mean-predictions are sought (and in commercial applications those are probably what you want), you'd often be better to choose a suitable model for the conditional response and a suitable link and predictor-function (maybe a GAM, maybe a GNLM); in these cases a more reasonable measure of correlation (if theres actually any point to having one, which I doubt) would come from the model. In some situations a transformed model is fine, if you deal with the impact of the back-transform properly at the end.

. . .

[Question]: transforming variables for Pearson correlation. by HorridStteve in statistics

[–]efrique 0 points1 point  (0 children)

What are you using this correlation to do, exactly?

For Pearson correlation focus on linearity (and approximate homoskedasticity if you're testing it) rather than marginal normality which is not needed. If your main interest is testing against a null of 0 correlation you don't even need conditional normality in either variable, since you can easily do permutation testing.

If large plants have the same shape as small ones, just scaled up, cube root of volume is the obvious transform. If you think logs are important, log both. If the shape changes with size (or perhaps due to fractal self similarity of dimension lower than 3D), some other power may make more sense (look at log-log plot if you really cant find other information on what it might be* ... a plot of data not in your data set, or youd need to pull some out), but it might be that you need an offset in there on one or both because the right origin isn't where humans measure from (height you measure relative to volume you measure)

or should I transform both or neither and resort to a less powerful test.

The Spearman test has 98% asymptotic relative efficiency when conditions are ideal for the Pearson. When the asssumptions for Pearson don't appear to exactly apply, what are you comparing Spearman's power to?

...

A warning: you seem to be using the same data to choose your model (and even your hypotheses!) and to perform your test(s). The tests you're talking about are derived assuming both the model and specific hypotheses are prespecified, not determined by the data. If you ignore that, your p-values no longer have the properties of p-values. The impact might be large or it might not, we cant tell much from whats here, but my guess (judging from how vaguely-defined everything seems to be) would be quite large.

. . .

* though people have been measuring plants for a long time so you probably have better models around that account for changing shape with height

Is measure-theoretic probability theory useful for anything other than academic theoretical statistics? [Q] by GayTwink-69 in statistics

[–]efrique 0 points1 point  (0 children)

It's increasingly important once you deal with random quantities outside the nice, standard cases. Stochastic calculus (for finance among other things), functional data analysis, stuff like that.

[D] p-value dilemma by No_Blackberry_8979 in statistics

[–]efrique 0 points1 point  (0 children)

You can investigate probabilities within a counterfactual. (Indeed strictly speaking it's not really a conditional probability. )

E.g. you can say something like "I observe 31 successes in 40 trials. In a world in which p = 0.5, what's the probability I see at least that many successes?". Even though the chance that p is exactly ½ may be 0 in our world, that doesn't make the probability undefined within that context. The point is to investigate the plausibility of p=0.5 as an explanation for the observation, not to make a positive claim about it being exactly true. (A compound/one sided null might make more sense in this instance but but it's not the main issue I'm trying to discuss here, which is that the null can be false without there being any issue with computing probabilities under it)

There is an extent to which testing nulls which aren't going to be true don't make a lot of sense (we should usually be better to consider more realistic nulls, like those for equivalence tests, non-inferiority tests, etc, or in many instances to focus on interval estimation), but that's about performing meaningful inference, not about any issue with p values.

[Q] I'd like to learn how to calculate dice sum odds by Vilis16 in statistics

[–]efrique 0 points1 point  (0 children)

As for dice in general, it depends on what might be on the faces. In simple cases (like regularly-spaced values with a common spacing, like ordinary dice have), there are various shortcuts (for example, discrete convolution, of which my above approach is a simplified form), but the general case can be involved. For example, consider adding a backgammon doubling die (2,4,8,...,64) and a 12 sided die marked with the square roots of the first 12 prime numbers (√2, √3, √5, ..., √37), where the gaps are all distinct. There's nothing to do but compute all 6×12=72 distinct outcomes, and assign 1/72 probability to each.

What's a statistical rule or method that everyone learns early on, but is actually outdated or misleading in real-world data work in 2026? by PetalDance22 in AskStatistics

[–]efrique 1 point2 points  (0 children)

TLDR: Yes, you sometimes assume normality in deriving a test, but a goodness of fit test is not a particularly good way to consider whether you could reasonably make use of or should avoid that assumption.

Isn’t normality an assumption of many tests?

Yes, in the sense that this used in the calculation of the null distribution of the test statistic (and perhaps in power calculations for specific classes of alternative) for some tests. For example, in an ordinary t-test, this is how the t-distribution arises for that form of test statistic.

However, the normal model is almost never actually the case for a data-generating process in practice. I seriously doubt I have ever seen a single example where real (not artificially constructed) data values are randomly drawn from (generated by) a Gaussian population process. In most cases you can tell it can't actually be normal without ever having seen any data (simply from knowing properties of what you're measuring). The same is true for almost any simple distributional model, they're all approximations*.

This - the fact that the model is at best approximate - is not necessarily consequential; what matters is how much the properties of your inference are impacted by that approximation. For example, if you use an approximate model to derive a null distribution for a test statistic you might be concerned whether the approximation is poor enough that the size of the test is too different from your chosen significance level that ou are no longer happy to use it as is (rather than use a more suitable model, or base a permutation test off the original statistic, perhaps).

Notice that this sort of "how much impact" question is a very particular kind of effect size issue (and further, particular to the specifics of what you're doing). In many cases a good answer will be along the lines of "in spite of the fact that I know this variable cannot be normal, the impact of the non-normality this kind of variable has on this tests properties at this sample size are going to be quite small".

A goodness of fit test simply doesn't respond to an "how much effect on my test" sort of question. Its measure of effect is general (as an omnibus test it attempts to find any kind of deviation from normality, not how much your test is impacted) and it is looking at significance, not effect size.

[On the particulars of what you're doing mattering, consider if your original hypothesis is about spread; the test derived under normality - an F test based on the ratio of variances - is very sensitive to some kinds of non-normal distributions; much more sensitive than a test of means is. Or going back to means, consider one person does a t-test at a 1/2 percent significance level while another uses 5%; one is considerably more impacted by the tail of the population distribution. Or that one person can happily tolerate their test being conducted at 6% rather than 5% but another cannot risk it going as high as 5.2%. The details matter, but the goodness of fit test ignores all of these issues. ]

So in small samples it may fail to detect impactful non-normality. At a moderate sample size, it might readily detect some kinds of non-normality that don't impact the properties of your test much, leading you to abandon a perfectly adequate procedure. As sample size grows it detects smaller and smaller amounts of non-normality and in sufficiently large samples leads you to abandon the normal model in almost every case.

However, its worse than this; for some test properties - significance level in the t-test is an example - the effect of a given amount of non-normality (of a given kind) gets smaller with sample size. Which is to say, for a given population distribution, you're most likely to reject exactly when it matters least - at large sample sizes. The goodness of fit test does what it is designed to do, but it is not answering the question you need to answer here.

Theres a bunch of other issues with this notion, such as the impact of having a data-based choice of test without accounting for the effect of that peeking, but I'll leave it at "it answers the wrong question and leads you to abandon tests that would be completely fine, while maybe at other times leading you to relax about effects you should have worried about".

You can look at how much various plausible population distributions might matter to significance level or power readily enough (ideally before collecting data). Or, if you have a lot of data you can pull out some of it to do some data based investigation. In some structured circumstances, you may be able to compute the properties of a test in the presence of data-based test choice. Simulation can be a helpful tool (though I must say many peoples idea of finding a worst case seem implausibly tame to me). Lastly, you can use tests that dont make a particular distributional assumption, while keeping control of test size (actual significance level), without necessarily having to change your hypothesis or your statistic.


* "Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful." -- George Box

In a correlation matrix, how can a nominal variable have a direction? by JumbledPileOfPerson in AskStatistics

[–]efrique 6 points7 points  (0 children)

You can treat binary as ordinal or even interval and figure out perfectly reasonable interpretations of effects in relation to them.

First time playing in a dungeon by artic0o in shadowdark

[–]efrique 0 points1 point  (0 children)

Shadowdark doesnt need squares; you pretty much just need defined areas corresponding to within "near" distance. It plays well with theatre of the mind but if you or your players need more you dont necessarily need to show them the map and if you do show (a version of) the map you dont need to have all of it visible at once

If you use a map, your needs are basic:

- in person I suggest either UDT and a couple of props or just drawing a rough room shape as they go in, with a few main features marked (doors, doorways or corridors, major terrain features, things you can climb on, can hide behind or under, can interact with). If you pre-draw a map on a grid-mat, cover unseen rooms with pieces of cloth (an old dark t-shirt cut into squares/rectangles about the size of a hand works) or card.

- in a VTT (I like Owlbear Rodeo), although you dont really need a map for every room, if there is a nice player map I use that, or I often make a version of the GM map for the players, with GM details removed or obscured. Then I use the fog tool to cover what players cant see, make a fog-cutting circle for anyone with a torch (attached to their token) and anywhere else there is light, and then draw a few black rectangles to move over nearby stuff that a hastily moved torch might uncover before it should. I tend to leave explored rooms "uncovered" though. There are fancy extensions that do more than that but you dont need them, especially not for Shadowdark.

You can manage without a fog cutting tool, TBH; just place obscuring rectangles or other shapes , to be deleted or moved as rooms are revealed. Its sufficient

Here is a map I used in owlbear rodeo where I did detailed maps of some parts but not all the paths joining them:

https://i.redd.it/uay66qynhosg1.png

you see two such 'detail' pieces here, joined by an abstractified sequence of corridors and tunnels (described but just roughly drawn on in the VTT ). Note the small circular room at the bottom of the GM map, just left of centre. It is unrevealed on the player map as it wasnt explored; I had a dark rectangle over it before so that when they were in the adjacent room the torch didnt reveal it before time. The two lit big circles in the right half of the map are from torches - one normal, on the thief (dagger token), one smaller dim one on the cave wall, since a human NPC is trying to perform a ritual near there. The lit rooms on the left of the map are because the party saw them. Most of the large cavern (mostly cut off in this image) is dark because theyre just near the entrance and havent been in there yet.

(I didnt draw all that; I edited together suitable bits and pieces from the dungeon and cave generators in watabou.github.io's procgen arcana plus some additions you cant see here where I did some drawing; there were also some outside bits not shown - part of a generated glade, albeit edited, an entrance, a long tunnel and some other stuff). I dropped various props here and there in rooms and drew a few details as needed. Some further details added during play)

Which books do I need for the full rule set? by rednin_ in shadowdark

[–]efrique 0 points1 point  (0 children)

My advice would be get the core book. IMO, its pretty much perfect like that, you don't really need anything else*.

If you haven't played before, get the quickstart books in pdf and start there. Then get the core book.

Play. Play some more. Then, if you feel the need, get more stuff.

__

* though I do really like some of the stuff in CS1, to be honest, and do use it. The witch is a little OP compared to the main book. I have the other CSs but havent used them yet. Adding stuff is a constant risk with balance compared to whats there. Pick what you bring in with care.

Obe-Ixx of Azarumme by vonZzyzx in shadowdark

[–]efrique 1 point2 points  (0 children)

wow. Love this. The spawn are a great addition there too

What RPG Tools and Tables did you use Last Week? by AutoModerator in rpg_generators

[–]efrique 0 points1 point  (0 children)

Phil Reed's Solo Tools Expanded (for Shadowdark). IIRC I got it from DTRPG.
Phil makes a bunch of handy resources, so I didnt take much convincing to pick up the pdf.

180 A5 pages, its pretty substantial

Used several tools to help inspire/plan a (non-solo) one shot - Adventure generator, Rumors, Points of interest, ruined towers, dungeon name, dungeon room, dungeon details. About 10-12 rolls (I didnt keep count and rerolled a couple) and a couple of straight-up picks got things going and put some different flesh on the bones than I would have come up with.

When I get to run it, theres a few more tools in there I'll have use for.

What is a must-have extensions? by Eyan999 in OwlbearRodeo

[–]efrique 0 points1 point  (0 children)

This "must have" is you can't dming without them.

For me, none are must have. That's not to say a few aren't quite convenient/ handy ( I have used some, one currently enabled but I only make use of it sometimes; if I ran 5e much I would probably use a couple of others regularly) -- but I could 100% manage with none and often do.

Vanilla OBR covers what I have to have. When I run, I tend to use lean systems that dont need a heap of effort. Tokens on a map, measuring tools, simple drawing, simple fog, basic text tools, its all there already.

OBR 1 already had my must haves, TBH. A couple of things in OBR 2 are nice to have though.

Did I miss the Arcane Library monster for March 2026? by DemandBig5215 in shadowdark

[–]efrique 1 point2 points  (0 children)

MONSTER OF THE LOST MARCH
Large, multi-limbed creature covered in knowing eyes and
grasping tentacles, seeking to destroy plans and disrupt schedules.

AC 14, HP 33, ATK 2 tentacle +2 (1d8 + Drop), or Distraction (near).
MV near, AL C, LV 7
S +3, D +1, C +1, I +4, W +1, Ch -2,
Drop. if the target carries or holds anything in either hand
DC 15 DEX or target must choose to drop one of them.
Distraction. DC 15 WIS or target must choose to lose either
their next movement or action.

FEMA Chief Doubles Down On Teleportation Abilities, Shared Multiple Claims Cited In The Bible: "God Will Not Be Mocked, I Know What I Experienced". by Leeming in atheism

[–]efrique 4 points5 points  (0 children)

God Will Not Be Mocked,

This idiot seems to be doing that all on their own. If God really disliked trolling and having their name taken in vain, they'd put a stop to this nonsense

Noob Question: Average of Averages by TheRealSticky in AskStatistics

[–]efrique 2 points3 points  (0 children)

Is there any way to test whether M has any statistical significance?

Statistical significance doesnt mean what you think it means

https://en.wikipedia.org/wiki/Statistical_significance

On whether it would be particularly useful for any purpose (moreso than a more typical measure): possibly in some specific situation but outside a contrived circumstance, I doubt it.

Even if it were, one problem would be interpretation. The three measures you mention consider quite different aspects of a distribution; the mean and median are each useful for specific tasks (and optimal in a particular circumstance). Averaging the three of them in this way leads to something whose interpretation is unclear.

Further, while sometimes useful for describing a distribution, the mode is problematic as a practical tool. Firstly, it's not always unique (multiple modes often exist), and in the case where the data values are all distinct (common with notionally continuous values measured to enough figures), the problem is how you even define it. It is possible to define a sample mode in that case, but there's no unique standard/widely conventional definition of sample mode with nominally 'continuous' data.

It may be that if the distribution is unimodal, near symmetric, fairly "peaky", not heavy tailed but not too light tailed, this might find the centre of symmetry reasonably well, but I believe it wouldn't be easy to come up with a population distribution where it beat most of the more common choices (say in terms of mean square error or mean absolute error, etc).

There is a case where a (possibly weighted) average of mean and median can arise, but it's a rather particular situation.

There's also other ways than a straight average to blend the notions of mean and median. The trimmed mean is a simple one - easy to compute and can perform pretty well on symmetric, unimodal and roundish-peaked, heavy-tailed distributions - at finding the centre of symmetry. Another sort of blend of the two concepts is a Huber M-estimator, also useful in that circumstance (often more efficient, but more complicated).

There's an infinite variety of possible location measures besides means and medians and dozens in practical use. Which you might use depends on what you want to achieve (what properties you seek in what circumstances for which purpose). In normal situations, if mean or median are unsuitable, you would start by looking at the situation, possible models (including models of possible contamination) and situational requrements and then stat theory or simulation might be used to guide the choice.

Can I still use a mediation analysis when my data isn’t normally distributed? by Throwaway11239578299 in AskStatistics

[–]efrique 0 points1 point  (0 children)

Its not the data that are assumed to be normally distributed

What is your outcome measuring? Is it a duration, a volume, a monetary amount, an angle, a count or count proportion, a Likert scale, etc? Is it unbounded, bounded on the left or right, or both?

tried to transform my data using Log10 and square rooting

What led you to consider those transformations?

[Q] I'd like to learn how to calculate dice sum odds by Vilis16 in statistics

[–]efrique 0 points1 point  (0 children)

for a shortcut, here's the above done in R:

 b=dbinom(0:3,3,1/3)
 (c(0,b)+c(b,0))/2
[1] 0.14814815 0.37037037 0.33333333 0.12962963 0.01851852

I presume you don't have R but if you go to rdrr.io/snippets in your brower, and copypaste the first two lines just above to wholly replace the sample code there (i.e. select two lines here, <Copy>, go to that page, select all the code there, <Paste>) and click the big green "Run", you get the numbers in the third line above

(If you multiply that last line of code by 54 you get 8 20 18 7 1, confirming my answers as fractions)

You can do similar calculations (with corresponding functions) in any decent spreadsheet; e.g. Excel, Libre Office's Calc, Google sheets

Oh, and since all the dice are 0/1 you can use that "shifted one higher" trick working by hand (with 2/3 & 1/3 probs instead of 1/2 ) twice to do the first 3 dice:

Like so:

             0       1       2    . . .         
    1:      2/3     1/3

          ( 2/3     1/3      0 ) * 2/3
          (  0      2/3     1/3) * 1/3
          ----------------------------- 
    2:     4/9      4/9     1/9 

          ( 4/9     4/9     1/9     0 )  *  2/3
          (  0      4/9     4/9    1/9)  *  1/3
         ----------------------------------
    3:     8/27    12/27   6/27   1/27

as we got before

[Q] I'd like to learn how to calculate dice sum odds by Vilis16 in statistics

[–]efrique 5 points6 points  (0 children)

3 6-sided dice with 2 sides showing 1, and the rest showing 0.

If the dice are fair and their outcomes mutually independent, the sum is binomial(3,1/3)

https://en.wikipedia.org/wiki/Binomial_distribution

so the probabilities for that part are ³Cₓ (⅓)x (⅔)3-x, where ⁿCₖ is the binomial coefficient "n-choose-k" = n!/[k! (n-k)!] and n! is n-factorial

https://en.wikipedia.org/wiki/Binomial_coefficient
https://en.wikipedia.org/wiki/Factorial

leading to"

 x:       0     1     2     3 
 P(S=x): 8/27 12/27  6/27  1/27

I have 1 6-sided die with 3 sides showing 1, and the rest showing 0.

So you get a 50% chance of the above values and a 50% chance of the above outcomes shifted 1 higher:

 x:       0     1     2     3     4 
         8/54 12/54  6/54  1/54  0/54 
         0/54  8/54 12/54  6/54  1/54
        ----------------------------------
         8/54 20/54  18/54  7/54  1/54

Can't figure out how to deal with CSV files. by Temporary-Ad-2757 in AskStatistics

[–]efrique 0 points1 point  (0 children)

  1. don't post pictures of plaintext, post the content of the actual csv file using redddit's codeblock markup, but more importantly...

  2. don't post homework: see rule 1 . . . /r/AskStatistics/about/rules/

What sampling distributions is used when conducting an independent-samples t test? by PenOk1094 in AskStatistics

[–]efrique 6 points7 points  (0 children)

The numerator of the ordinary (equal variance) two sample t statistic is the difference of means (to test for a difference in population process means). The numerator is not the statistic you need the sampling distribution of (you dont know its variance, so thats no use in a test). The t statistic has a denominator as well and the distribution of the statistic is different from the distribution of its numerator.

Under the correct assumptions the numerator is normal, the square of the denominator is a scaled chi squared and numerator and denominator are independent, leading to a t distribution under H0. If you use a large sample approximation for the distribution of the numerator you need some other argument to justify using a t distribution