top of page

How to choose the level of significance: Illustrative examples

Updated: Mar 21, 2023



In statistical testing, it is a usual convention to adopt the 0.05 level of significance. While other conventional levels such as 0.01 and 0.10 are often adopted, 0.05 is most widely and almost routinely used. The question is whether the (almost) universal use of 0.05 level is scientific or even sensible? The question is no.

In this and subsequent articles, I would like to explain how the level of significance should be chosen, in careful consideration of the various factors of hypothesis testing. In this article, the question is explained using three illustrative examples.

The choice means that you control the probability of Type I error, which is the probability of rejecting the true null hypothesis. This choice is highly consequential to the outcome of testing and subsequent decision-making, because it sets the critical values to reject or not to reject the null hypothesis. For example, for the z-test (two-tailed), the critical values at the 0.05 (or 5%) level of significance are ±1.96, and the critical values at the 10% level are ±1.64. That is, a lower level of significance sets a higher bar to reject the null hypothesis. An important point to note is that these levels of significance are simply benchmarks, and that they are not based on any scientific justification or principle whatsoever. Many statistical textbooks and lectures tell virtually nothing as to how a rational researcher should make such a choice, given the circumstances and contexts of the research. These conventional values are simply arbitrary, and their mindless application can lead to misleading decisions with serious consequences.


Consider a decision in a court of law as a hypothesis testing problem, where H0: the defendant is not guilty; H1: the defendant is guilty. There are two types of errors in this decision-making as the following table shows:


Type I error occurs when a “guilty” verdict is delivered to an innocent person, and Type II error is made if a “not guilty” verdict is delivered to a defendant who is in fact guilty. Their probabilities can be defined as:

α ≡ Prob(Type I error) and β ≡ Prob(Type II error).

Hence, if one sets α to 0.05, this means that Type I error is controlled with the probability of 5%. That is, out of 1000 legal trials, incorrect decisions of Type I are allowed to occur 50 times.

A trade-off between α and β is well-known. A lower (higher) value of α means a higher (lower) bar to reject H0, so it will increase (reduce) the probability of Type II error (β). That is, one cannot reduce both error probabilities at the same time, given the other factors (such as sample size) held constant.

Suppose the trial is held at a criminal court where a guilty verdict leads to a death penalty. The consequence of Type I error is dire with an innocent defendant being sent to death row. Type II error also has a consequence, but it is not as dire as that of Type I error. If the court adopts the 0.05 level of significance, this means that they allow 5 innocent defendants to death row out of 100 such trials. In order to avoid such mistakes occurring too often, the criminal court adopts a much taller bar for a guilty verdict, which is “beyond reasonable doubt”. Such a heavy burden of proof may translate into the value of α as small as 0.001. That is, the court tolerates one incorrect decision out of 1000 trials. The 0.05 level of significance may simply too lenient to meet the burden “beyond reasonable doubt”, and its consequences will be too serious, sending too many innocent defendants to death rows.


Now, as the second example, consider the following null and alternative hypotheses: H0: Climate is not changing; H1: Climate is changing.

Type I error in this case is judging that the climate is changing while it actually is not, while Type II error is judging that the climate is not changing when in fact it is. We all know that Type II error could have serious consequences, a lot more than Type I error. Type II error will encourage little or inadequate actions for climate change that is actually happening; while Type I error may implement the actions that can save our planet, even if the climate is not changing.


Consider two researchers facing the same problem of hypothesis testing. Suppose if α is set at 0.05 (action taken by Researcher 1), a Type II error occurs with the probability 0.80. While if it is set at 0.80 (action taken by Researcher 2), a Type II error occurs with a much lower probability of 0.05.

Which researcher is more rational in this situation? Given that Type II error is much more costly than Type I error, it is more reasonable to control the Type II error with a lower probability. Hence, in this case, 0.80 level of significance should be chosen, which will lower the bar for rejecting H0 substantially. The implication is that we should be alarmed by any feint sign of climate change, since rejection of H0 will introduce many beneficial actions to save the planet, irrespective of whether the climate is actually changing.


As the third and final example, consider the following null and alternative hypotheses: H0: Patient is not pregnant; H1: Patient is pregnant.

Type I error here is judging that the patient is pregnant when in fact she is not; and Type II error is judging that the patient is not pregnant when in fact she is. Type II error in this case is more serious because it can endanger the lives and welfare of both mother and baby. Suppose there are two medical tests (Tests A and B) for pregnancy available for the doctor with the following error probabilities:

In making a choice between the two alternative tests, the doctor should be highly cautious of making Type II error, due to its serious consequences. As such, it is more reasonable to use Test B, because it is associated with a lower chance of Type II error, which is more consequential than Type I error. Hence, a more sensible choice for the level of significance is 0.20 in this case, which is associated with Test B.


The main message of this article is that a popular level of significance (0.05, 0.10, 0.01) may not necessarily be the optimal choice that will bring a desirable outcome. Mindless or routine use of a conventional level is arbitrary and can lead to a decision with serious consequences.

The researchers should make a careful choice for this value in their decision-making, in full consideration of the context of the research. The key factors include the sample size, the losses from Type I and II errors, and prior belief. In the second article to follow, a more mathematical approach will be taken to the optimal choice of the level of significance, under a range of factors of hypothesis testing.






Comments


bottom of page