Das Foto zeigt Schokolade
© Ewa Julia Zyablova, unsplash

Directly Inferring Statistics – A wholesome way to understand statistics

Doing statistics is e.g. like this: We do a survey and ask 100 people whether they like chocolate. Then we transfer the result to the population.

But what do we know about the people we didn’t ask? Remember back in the day being a student: Did you understand, why statistics work at all? Bringing you information about people you have never seen?

With directly inferring statistics, we now have a way to do statistics in a mathematically correct and intuitive way from day one in stochastics class because all we need at the beginning is to count blue and red balls in boxes.

Directly inferring statistics is so easy to understand because it replicates what we humans do every day anyway: We assign probabilities to possible populations.

Moreover, directly inferring statistics is easy to teach because the principle remains exactly the same from beginning to end: we divide the number of certain samples by the number of all samples.

When I explain this method, I always start with 5 boxes, each containing 4 balls, which are either blue or red. All possible proportions of red balls are represented. (We could also pay attention to the proportions of blue balls, but out of pure arbitrariness we decide to use the red balls).

I then draw 3 balls from one of the boxes with replacement and ask: From which box did I draw? If, for example, one blue and 2 red balls have been drawn, the audience says: „It was probably drawn from B3. It could also have been drawn from B1. But that would be improbable.“ If the sample consists of 3 blue balls, the answer is always: „B0 is most likely and B3 is least likely.“

We humans think like this. We assign probabilities to populations.

Normally we apply fancy methods to infer from the sample to the population like maximum likelihood estimation and confidence intervals. These methods are proven and lead to reasonable results. But it is still quite hard to understand, why it works at all.

This can also be seen in the frequently encountered erroneous assumption, that the confidence level indicates the probability with which the actual population proportion is located in the confidence interval. As a result, there is often a misinterpretation of the p-value in hypothesis testing, which is mistaken for the probability that the hypothesis is true.
The American Statistical Association has apparently also recognized this problem so that in 2016 it felt compelled to publish a statement on the existing misunderstandings.

(American Statistical Association Releases Statement on Statistical Significance and p-Values:
http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108#.Vt2XIOaE2MN)

As for misunderstandings see also:
„A confidence interval is not a probability, and therefore it is not technically correct to say the probability is 95% that a given 95% confidence interval will contain the true value of the parameter being estimated.“(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2947664/)
and
„A 95% confidence level does not mean that for a given realized interval there is a 95% probability that the population parameter lies within the interval (i.e., a 95% probability that the interval covers the population parameter).“
(https://en.wikipedia.org/wiki/Confidence_interval#Common_misunderstandings)