During the fall of 2016, I took an online course in statistical inferences (website Coursera; with the title ‘Improving your statistical inferences’, taught by Daniël Lakens of Eindhoven University of Technology), which I also highly recommend to anyone that is confused about terms like p-values, Type I and Type II error rates, effect sizes, etc. I learned a lot about my own misconceptions of these terms and I thought my findings might be useful to someone else. In this text, however, I will only focus on one of these terms – the almighty p-value.
So, what is a p-value? The formal definition of a p-value is the probability of obtaining the observed or more extreme data, assuming the null hypothesis is true (Lakens, 2016). So, a significant p-value means that the obtained data is surprising, given there is no true effect. Another essential term that is often wrongfully equated with the p-value is the alpha level: this is the limit that represents the highest percentage of false positives that are allowed in an experiment and it is set before conducting the experiment. In practice, upon obtaining a p-value, it is compared to the alpha level and if it is smaller than the level, it means a statistically significant result has been found. The most widely used alpha level is probably 5% (this means that if a study is repeated a hundred times, and each time, the obtained p-value is compared to the alpha level, in 5 cases at the most, in the long run, a statistically significant result will be obtained when there is no real effect in the population), but it is, really, a very arbitrarily chosen limit. Although it can be said that it does represent some sort of a compromise between falsely rejecting the null hypothesis (Type I error) and not detecting an effect when there is one to be found (Type II error), it is important that one determine this balance based on the characteristics of a specific study. Neyman & Pearson (1933, in: Lakens, 2016) support this in stating that: “determining how the balance must be struck should be left to the investigator”.
A Type I error is when an effect is found when there is none present in the population (the Type I error rate is conceptually the same as the alpha level). A Type II error is when an effect is not found when it exists in the population (the exact opposite of Type I error). It is important to mention that controlling the error rate in any experiment or study is very important because if it isn’t controlled, the error rate might inflate dramatically and one cannot be sure whether an effect has truly been found or not (for example, when multiple comparisons are carried out, when optional stopping is used etc.). In the past, a lot of emphasis in the scientific literature has been put on reducing Type I errors, and very little on reducing Type II errors, but lately, more emphasis has been put on the latter. This is because nowadays, in the scientific literature, replications are very common, if not an absolute must. This means that whenever an experiment is conducted, it will probably eventually be repeated by other people and if the original result was a false positive, other people will not be able to replicate the same results, and it will become clear that the original effect was found because of random fluctuations. However, if an effect is not found in a specific study when there actually is one, to begin with, the initial idea might not be tested further because others will think it is not important, when, in reality, it might be! Essentially, what happens is that one could miss out on a result that could potentially have considerable real-life consequences.
At this point, I would like to highlight an important phrase of a definition that I stated before – the definition of the alpha level. This phrase is “in the long run”. It is important to include this term in the interpretation of p-values because at the time of an experiment, when a p-value less than the set alpha is found, it is not possible to know whether the experiment that is carried out is within the percentage where the effect is correctly identified or whether the result is simply a false positive. What is known, however, is that in the long run, if the same exact experiment would have been repeated an innumerable amount of times, the maximum amount of times that a false positive would be obtained is the alpha percentage that was set at the beginning of the study. In other words, if the alpha level was set at 5% and a p-value was found that is less than that, it is certain that in the long run, one would be wrong only 5% of the time, and not more. However, one cannot say that in the current experiment, one is in the 95% that correctly identifies an effect when there is one to be found because it is simply not possible to know. So, in a single study, a true effect was either found or not – there is no in-between. However, when the effects are examined in the long run, the probability starts to make sense – it refers to a long-term frequency (Lakens, 2016).
So, what does the phrase “in the long run” mean in practice? It means that one can never bet his/her money on an existence of a true effect on the basis of a single study. Science has, after all, been built in a way that findings can be tested multiple times. Only after an effect has been shown repeatedly, over a long period of time, can it be claimed that a true effect exists. Fisher, for example, stated that “A phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher, 1937, p. 13 in Lakens, 2016). He goes on to say that “No isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon.” (Fisher, 1937, p. 13 in Lakens, 2016).
What does it then mean if a p-value less than the set alpha level is found? Well, let’s first see what happens when the p-value is greater than the alpha level. First of all, this does not mean that there is no effect in the population. There could be an effect, but this effect may be small and large samples are needed to detect small effects. So, one of the possible reasons for why an effect wasn’t found is that the sample size was too small. Another study could then be conducted, a larger one, and one then observes a significant finding and it can be seen that there is, in fact, a smaller effect. Another possible explanation could be the random variation of p-values – even though one would have usually detected a true effect, occasionally, a non-significant result will be observed. One could then, again, conduct another study, identical to the first one, and a significant finding will be observed and it can be seen that there is a true effect.
Consequently, when an effect isn’t detected, this prevents one from being able to answer the original question of whether there is a true effect or not. While one can answer this question when the p-value is smaller than the alpha level, one cannot really answer this question when the p-value is bigger than the alpha level – even though it might be convenient to say that there is no effect, this is incorrect: in reality, it is simply not known.
What about if an effect is found? One can “act” as if the data is not noise and can infer that there probably really is an effect in the population because the probability of a false positive is relatively low (but it is not null!).
However, what cannot be said is that if the p-value is less than 0.05, this means the effect is 95% likely to be true. If one is using p-values, it can only be said that the data is surprising, given the null hypothesis is true (which is the definition of a p-value). To find the probability of a theory being true, one has to use Bayesian statistics.
Furthermore, when it comes to testing data, an important distinction also should be made between two different probabilities. It is important to realise that the probability of a hypothesis being true, given certain data, is not the same as the probability of finding the data or more extreme data, given a hypothesis. Beside the fact that these two probabilities can vary widely, the first probability can only be defined using Bayesian statistics, as said. In practice, this means that it is wrong to say that, after finding a p-value that is less than 0.05, a certain theory is true or that a certain hypothesis has been proven. What one can say, however, is that, for example, the data matches the hypothesis. One should therefore be careful when using the words “theory” and “data” in defining different probabilities.
Lastly, there is a more efficient way to address the question of whether there is a true effect in the population: p-curve analysis. It is a way to see whether it is more probable that there is a certain effect in the population or whether it is more probable that there isn’t. The basic idea is that it can be determined, using simulation studies, what distributions of p-values look like when there is an effect (Figure 2) vs. what distributions of p-values look like when there is no effect (Figure 1). When multiple studies are carried out to find a specific effect (usually from different researchers), one can combine these studies and form a distribution of p-values (only p-values that are 0.05 or smaller are used) and then compare it to the pre-determined distributions. It can then be seen which pre-determined distribution is more similar to the formed distribution.
Figure 1. A distribution when there is no true effect in the population (Lakens, 2016).
Figure 2. A distribution when there is a true effect in the population (Lakens, 2016).
Figure 3. Comparing the distributions (Lakens, 2016).
In Figure 3, the black line maps the p-values that were found on the topic of elderly priming. As can be seen, the pattern does not follow the distribution when there is a true effect in the population. However, it is important to note that our conclusion does not necessarily mean that there is no true effect in the population. It only means that there is little support for the existence of this effect in the literature. In this way, p-curve analysis is a very useful method to see which effects are worth investigating in more detail.
I would like to end this text with a small disclaimer about p-values. Although they can be very useful if used correctly, and although their use has more or less become the norm, p-values can also be very unreliable. I’ve tried to very lightly hint at this throughout the text. Finally, I deem it appropriate to end this text with the following quote: “Statistical tests should be used with discretion and understanding, and not as instruments which themselves give the final verdict.” (Neyman & Pearson, 1928 in Lakens, 2016).
Lakens, D. (2016). Improving your statistical inferences. Coursera. Retrieved from: https://www.coursera.org/learn/statistical-inferences
 Because otherwise this text would never end.
 Optional stopping is a practice where you first collect data and perform a statistical test. If you don’t have a significant finding, you collect more data and perform a statistical test and again if you don’t find an effect, then you collect more data and you continue this until you have a significant finding. Although it is appealing, this practice highly inflates your Type I error rate (if uncontrolled).
 This concept of “in the long run” is also known as the frequentist approach, which includes the statistical procedures used most often today (null-hypothesis significance testing, confidence intervals, etc. ) as opposed to, for example, the likelihood approach or the Bayesian approach.
 You can perform the analysis on the website: p-curve.com
 The YouTube video “Dance of the p-values” demonstrates this point very lucidly: https://www.youtube.com/watch?v=5OL1RqHrZQ8&t=1s
*Opomba: Članek je objavljen v angleščini, ker je namenjen in želi doseči tudi druge jezikovne skupine (med njimi so tudi posamezniki, ki ne govorijo slovensko).