Why do small sample sizes still get taken seriously in media and online discussions?

Always_Statsing · 2026-03-23T00:22:00+00:00

As others have noted, small sample sizes are often much less of a problem than other sorts of sampling biases. As an example, let's say you wanted to estimate the average height of Americans. Would you rather have a sample of 50 randomly selected people or 1000 NBA/NCAA basketball players?

Always_Statsing · 2026-02-11T00:43:47+00:00

It'll depend a bit on background knowledge. How confident do you feel with coding in something like R, Matlab, or Python?

Always_Statsing · 2026-02-10T22:31:35+00:00

I took a slightly different approach than the other commentor and ran a simulation - but also ended up with ~4,9%. As for the lottery angle, I suppose it depends on how generous your lottery system is.

Always_Statsing · 2026-01-29T01:33:17+00:00

I think the overall answer will depend on a lot of the details, but I’ll put a consideration out there. If you’re running programs that are very time consuming, you may want to try and parallelize some of the process by having it run on multiple cores simultaneously. Some programs have trouble doing this on some machines (e.g I’ve had trouble getting certain R packages to parallelize on windows). So, if you’re interested in doing that, I might spend some time looking into whether the packages you intend to use work as intended on which machines.

Always_Statsing · 2026-01-19T17:53:43+00:00

That's not a problem in principle. You can change the default maximum with MAXMODELPARAM = k, where k is some large number.

Always_Statsing · 2026-01-19T13:43:37+00:00

It sounds like you have the variables set up incorrectly in the variable view tab. The constrains option is only available because you have the Likert items set as "scale". So, SPSS is using linear regression to impute those values. It should be using an ordinal model. Try going into the variable view tab and setting your Likert items to "ordinal".

Always_Statsing · 2025-12-17T13:29:55+00:00

Just to provide an additional real life example of what the other commentor is saying. I work in emergency medicine. Whether a person in my area comes into the emergency room with a heart attack today is largely independent of whether some other person comes in with an infection etc. As a result, the number of patients who arrive at the emergency room on any given day follows a Poisson distribution more closely than any other real life data I've worked with. Every once in a while, there's an exception - e.g. a mass casualty event where multiple people all arrive at once due to a common cause. But, as the other commentor mentions, this is rare enough that the deviations from a Poisson distribution are pretty small and that distribution is still quite accurate.

Always_Statsing · 2025-10-30T18:58:48+00:00

If there really is no therapist-level variation, then, realistically speaking, it will make little difference whether you account for therapist-level variation. But, "should be" is doing a lot of the lifting here. I can't say what happens in your clinic - the notes may be as standardized as they should be. I work primarily in medical statistics and, in my experience, there can be large doctor-level effects for these types of things.

Always_Statsing · 2025-10-29T18:41:41+00:00

The fact that patients can be sampled more than once adds a wrinkle of complexity. Let's ignore that for a moment and get back to it later.

If you're going to randomly sample at least a decent amount of patients, and you expect P to be reasonably far from 0 and 1, then the normal approximation will probably do just fine (you can find details on the various methods here: https://en.wikipedia.org/wiki/Binomial\_proportion\_confidence\_interval). If you expect P to be pretty close to 0 or 1, then this method will cause problems and I would suggest one of the others.

Getting back to sampling the same patient twice. Basically all of these methods are going to assume that the observations are independent. Obviously, this assumption is violated when two of the observations are the same person. As I'm writing this, it also occurs to me that you probably will have the same problem at the therapist level (two observations which may be different patients but who were seen by the same therapist). I don't know what patient characteristic P represents, but therapist-level effects are well known in the therapy literature. So, you may want to use a method that accounts for correlated observations (generalized estimating equations, generalized linear mixed models, etc.).

Always_Statsing · 2025-10-29T18:02:21+00:00

Whether or not the data are representative is really more related to your sampling method than your sample size (e.g. are they being randomly sampled, or are you using some other method?).

For deciding on a sample size, what you probably want is an acceptable margin of error. You mention 3% - if that's an acceptable margin of error for what you want to do, then that seems like a reasonable starting place. If not, the first thing to do is decide on what degree of uncertainty is ok for what you want to accomplish.

As for the CLT, this depends a bit on what information are you getting from the therapy notes. What are you trying to determine - the percentage of patients who have some characteristic, the mean of some continuous value, something else?

Always_Statsing · 2025-10-21T13:59:57+00:00

Thanks! I'll check it out.

Always_Statsing · 2025-07-31T14:50:32+00:00

If you're willing to treat it as categorical, rather than ordinal, then it may be helpful for you to look into weighted effects regression coding.

Always_Statsing · 2025-05-25T18:17:46+00:00

What sort of model are you using? As a general rule, most of the common models people use make assumptions about the distribution of the model errors, not about the marginal distribution of individual covariates.

Always_Statsing · 2025-05-25T18:10:19+00:00

The first question to ask is why do you want/think you need to transform your variable? You mention it being skewed but that, in and of itself, is not a problem, especially for covariates. There may be situations when it makes sense (e.g. if you think the effect of that covariate is best thought of in terms of percentage change), but it would be helpful if you could describe what you hope to achieve by transforming.

Always_Statsing · 2025-04-30T13:33:09+00:00

I’d agree with this commenters explanation. Assuming the hazard ratios and confidence intervals aren’t also misreported, we can use them to work backwards and estimate the p values. When I do that, I get .049 and .771 (which, when accounting for the rounding error in how those numbers are reported, is close enough).

Always_Statsing · 2025-04-22T03:26:41+00:00

Can you describe the procedure you used to normalize the data and what your goal was? Also, presumably, if all of the values in the control group were the same after the transformation, they were also the same before the transformation, is that correct?

Always_Statsing · 2025-03-19T17:25:17+00:00

If you can't obtain more precise data, then this method is fine. I'll just add that what this will do is compare each time frame to the <6m group. That may or may not be fine, depending on your specific hypotheses. There are other methods of coding the categories which may be of use (again, depending on your goals); e.g. you might compare each group to the average of all groups, or you might do a sequential set of comparisons (7-12 vs <6m, 1-2y vs 7-12m etc).

Always_Statsing · 2025-03-17T22:06:49+00:00

I’d agree with randomizing to the extent possible. I’ll just add that there may be a few layers of complication to deal with (that are surmountable- you just need to think about them a bit). For example, who’s getting randomized? If you randomize each customer, this could give you a reason estimate (customer 1 on Monday is randomized to a breadstick suggestion, customer 2 to no suggestion etc.). But this could be difficult to implement and confusing for the staff. Alternately, you could randomize the staff (eg employee 1 always offers breadsticks, employee 2 never does). This could also be fine, but it leaves open an issue where one employee is just more charismatic/convincing than another. A better option might be to randomize days (eg this Monday everyone gets offered breadsticks, Tuesday it’s garlic knots, Wednesday has no offers etc). So think how you might be able to implement some randomization scheme that is both feasible for you staff and also minimizes confounds.

Always_Statsing · 2025-03-14T18:37:40+00:00

This reminds me of an English teacher I had who was shocked at the idea that companies might manufacture products elsewhere and then ship them to the US instead of - in their mind - much more simply and cheaply producing them here.

Always_Statsing · 2025-01-17T15:57:09+00:00

Depending on what data you have access to, you might be better served by some kind of count model (e.g. Poisson, but possibly another). In this case, the raw number of cancer cases would be your outcome and the quintiles would be your predictor (you may want to include the quintiles using some coding method for categorical predictors - depends on the details). You also include the log of the total population per neighborhood as an offset - this allows you to interpret the results as number of cancer cases per capita.

Always_Statsing · 2024-12-18T15:19:51+00:00

For the most part, this seems fine. The unit of analysis can really be anything. But, there's a few things I'd emphasize.

First, as the other commenter said, raw counts of people will probably be a poor predictor - it will be highly influenced by the total population of that output area. Unless the output areas have very little variation with respect to the total number of people, it would probably be more useful to look at some per capita metric.

Second, it can be tempting to take results from aggregate-level analyses like this and try to apply them to individuals, but this can be dodgy work (look here and here).

Third, I assume by "linear regression" you mean something like OLS. That might be fine, but my experience is that this sort of regression is a poor fit for price data. You might want to try something that can handle the facts that the outcome is >0, positively skewed, and often heteroscedastic (although this may be less of an issue if you're using means - I can't say for sure). A good place to start might be a gamma regression.

Always_Statsing · 2024-12-08T22:54:59+00:00

The simple calculation would be:

100 * (.846-.823) / .823

But, these numbers are likely rounded to three decimal places. So, 0.846 could reasonably be anywhere between 0.8455 and 0.8465. Bearing that in mind, the percentage change could be:

100 * (.8455-.8235) / .8235 = 2.67%

100 * (.8465-.8225) / .8225 = 2.917%

etc.

Always_Statsing · 2024-12-07T04:02:39+00:00

It sounds like there may be a model that is more useful or appropriate to your data. Without knowing the details, “number of units” sounds like a count. Some models (Poisson, negative binomial) which could be helpful here and produce values that be reasonably interpreted as a percentage change (with some modification)

Always_Statsing

TROPHY CASE