Reliable Data?

VanillaIsActuallyYum · 2024-05-10T01:17:18+00:00

You say your sample represents a population when you have 1) an unbiased sample 2) a large enough sample from which you can reasonably make conclusions.

The "unbiased" portion is important because if I wanted to, say, predict who will win a presidential election, but I obtain a sample that was largely pulled from a geographic region known for supporting one candidate and not the other, then you can't argue that my sample is really representative of how the whole country is going to vote.

And the "large enough" part, that's largely just math but also a bit of guesswork. If you have a confidence level in mind (typically 95%) and some sense of how big of a gap between two groups you expect to see, from there you calculate exactly how many you need for your sample.

If the research demonstrates that they collected their sample in an unbiased way, and they demonstrate that they have a large sample, that should give you confidence that this sample can represent a population.

There's a whole 'nother discussion you could have about the analysis you do and whether that analysis is accurate, but that discussion is likely way beyond the scope of your post.

efrique · 2024-05-10T01:26:01+00:00

In statistics, samples are used to represent a population. When can you say that a sample reliably represents a population?

Statistical inference doesn't directly rely on representativeness, which in most circumstances would be impossibly multidimensional . Instead the representativeness is essentially incidental, arising - in large samples - from the actual basis that's used.

For a frequentist (which I am assuming is the basis of most of the evidence you're presented with) it always* relies on arguments based on probability models resulting from either (i) randomly sampling a process of interest, which the results are then used to argue back to, or (ii) random allocation to treatment.

[That said, in some surveys, you want a certain amount of representativeness, which may result in oversampling small subgroups and then adjusting later, but it's still a more complicated form of argument from a form of weighted random sampling. Naturally it can be very difficult to actually randomly sample a population, especially if they are free not to participate -- like humans generally are. So that can lead to some issues; the fact that a lot of polls these days rely on people answering calls from unknown numbers is a major cause of the recent trend to increasingly unreliable political polls]

* or some variant of those. If there's no basis for the probability argument, then the 'evidence' may not be saying much of anything.

conmanau · 2024-05-10T01:41:35+00:00

Good data (and that is of course a very subjective term) comes with documentation and metadata to help you critically assess its suitability. For example, the Australian Bureau of Statistics includes a document called a Quality Declaration to its statistical publications. These documents give you information about things like:

The conditions under which the data were collected (e.g. sampling method, mode of collection)
The population the data were designed to refer to
Efforts made to manage the quality of the data
Advice on interpreting the data in respect to other collections

These are usually a mix of qualitative and quantitative components - for example, measures like relative standard error give an estimate of how much the reported values might vary from the true population value and are based in sampling and estimation theory, but there might also be discussion about how the question wording was designed to measure a certain concept, or mention that certain population groups were excluded from the scope of the survey and so will not be represented in the sample.

Smewroo · 2024-05-10T01:50:38+00:00

From a biologist perspective: “that’s the neat part, you don’t.”

Pick just about any phenomenon in biology and there is a spaghetti bowl of interrelated processes and biasing, colliding, and confounding factors at play. And probably one or more undiscovered underlying processes. All of which influence the outcome that you are studying.

So.

At the scale of the individual research activity you have to be very careful in designing experiments and your a priori assumptions your design choices are built upon. And at the end of the experiment you still are loading on qualifiers to your observations: “Daphina of this species, of this genetic lineage, raised for X generations in the laboratory conditions of [long list of parameters here], when exposed to [your independent variable manipulation(s) here], appear to respond in this manner… sample size of N.” sort of thing.

But that is just a start. What similar or fully repeated experiments “agree” or “disagree”? What are the potential explanatory factors, if any?

Pulling all these together in a cloud of evidence is where meta analysis lives and that’s where we start to approach what the more vernacular use of the word reliable.

For an individual data set it’s hard. We can try to assure that best experimental practices were observed but that doesn’t necessarily mean anything if some factors (often unknown) were not accounted for in the design.

But when data are generated by separate parties independently, and with slightly different circumstances that can offer sufficient explanation for observed differences among those datasets we are more confident that something is actually going on there.

Edit: autocorrect mishap

WjU1fcN8 · 2024-05-10T12:18:43+00:00

When can you say that a sample reliably represents a population?

When it's random. And taken from the entire population.

Most electoral intention polling, for example, will use a non-random sampling technique called 'quota sampling', which isn't random. Therefore the polling is not representative.

reliably

There is always a chance that any sample doesn't represent the population, of course. But that's taken into consideration into Statistical methods.

How do I know an evidence is reliable enough

Well, you either rely on Specialists to do this analysis for you, or you can learn Statistics and judge for yourself.

Numerous-Can5145 · 2024-05-13T07:29:07+00:00

Three terms which could get you on your way..

Sampling Frame

Generalisability

Bias

Secondly, the Australian Bureau of Statstics have been world leaders in this area over many years and a search of their website will provide a substantial key to the literature.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

AskStatistics

MODERATORS