Non-Parametric Tests for Beginners (Part 1: Rank and Sign Tests)

With examples and R codes

Non-parametric test is an important branch of inferential statistics. Yet, it is not widely used, nor fully understood, by many data scientists and analysts. It is a natural alternative to a parametric test such as the conventional t-test, with a range of advantages and a strong potential to be highly useful for modern applications such as A/B testing.

Non-parametric tests are constructed based on the ranks or signs of data points, or using the re-sampling methods such as the bootstrap. In this post, those based on ranks and signs are discussed with examples and R code. The bootstrap methods will be discussed in the second part of the series. I should like to thank Venkat Raman whose recent LinkedIn post motivated this post.

1. Parametric vs. Non-parametric tests

Inferential statistic or hypothesis testing is conducted with the following key elements:

The null and alternative hypotheses (H0 and H1)
Test statistic
Sampling distribution of the test statistic under H0
Decision rule (p-value or critical value, at a given level of significance)

Parametric tests

They include well-known tests such as the t-test, F-test, and chi-squared test. A typical feature of a parametric test is that

it requires estimation of the unknown parameters such as the mean and variance; and
its sampling distribution follows a normal distribution or other distributions derived from it (e.g., a F-distribution or chi-squared distribution).

To ensure the normality of the sampling distribution, the population should follow a normal distribution. If it is non-normal, then the sampling distribution is approximated by a normal distribution when the sample size is sufficiently large. This is called asymptotic approximation, whose validity is based on the central limit theorem under a range of parametric assumptions.

Non-parametric tests

Non-parametric tests calculate the test statistics and their sampling distributions in different ways from their parametric counterparts:

They are obtained using fully data-dependent methods such as the ranks and signs of data points, without having to estimate the population parameters.
A non-parametric test has an exact sampling distribution, in the sense that it can be obtained without resorting to any approximation. The distribution is either fully known analytically or can be obtained computationally using a Monte Carlo simulation.

The advantages of a non-parametric test include the following:

It does not require strong parametric assumptions such as the normality of the population;
It does not require an asymptotic approximation to its sampling distribution;
Since sampling distribution is exact, the level of significance (the probability of Type I error) is always correct in repeated sampling (no size distortion);
Its p-values and critical values are also exact; and
Its power (probability of rejecting a false null hypothesis) is often higher than its parametric alternatives, especially when the sample size is small.

Its main disadvantage is that computation of the exact sampling distribution (and also exact p-value and critical values) can be time-consuming, when the sample size is large or massive. However, this is a minor issue in modern times where computational power is becoming increasingly stronger. In addition, many non-parametric tests adopt analytic formulae or efficient algorithms, which can accurately approximate their exact p-values or critical values, when the computational burden is heavy.

2. Simple non-parametric tests

Sign test for the median

Consider a variable X generated purely randomly from its population. Using its sample realization (X1, …, Xn), a researcher wishes to test for

H0: median = 0; H1: median ≠ 0.

Under H0, each X values should be positive (or negative) with the probability of 0.5. Alternatively, the expected number of positive cases of X under H0 is n/2.

Let the test statistic T(X,n) be the total number of the cases where X > 0. The sampling distribution of T(X,n) under H0 follows a binomial distribution with n trials, each with the probability of success (p) equal to 0.5, denoted as B(n, p = 0.5). The distribution B(n=20, p = 0.5) is plotted as below:

The above is the exact sampling distribution of the test statistic T(X,n) under H0, when n = 20. If the observed value of T(X,n) is close to 10, then the null hypothesis cannot be rejected. The exact p-value of the test can be calculated using the binom.test function in R.

As an example, consider the following X and Y values with n = 20.

Table 1 (Positive = 1 if X > 0; Positive = 0 otherwise)

From Table 1 above, T(X) = 12 and T(Y) = 18, with the median of X being 0.36 and that of Y 1.67. It is clear that X is highly compatible with H0 while Y is not. The exact p-value of the test for X is 0.5034 and that for Y is 0.0004, which can be obtained using the R function as below:

x = c(-0.63, 0.18,-0.84,1.60,0.33, -0.82,0.49,0.74,0.58,-0.31,
      1.51,0.39,-0.62,-2.21,1.12,-0.04,-0.02,0.94,0.82,0.59)

y=c(1.14,0.54,0.01,-0.02,1.26,-0.29,0.43,0.82,1.90,1.51,
    1.83,2.01,1.37,2.54,3.55, 3.99,5.28,5.41,3.69,2.85)

# Test statistics
Tx=sum(0.5*(sign(x)+1)); Ty=sum(0.5*(sign(y)+1))

# Sign test
binom.test(x=Tx,n=20,p=0.5); binom.test(x=Ty,n=20,p=0.5)

When n is large, the sampling distribution still exactly follows B(n, p = 0.5). The distribution, however, approaches a normal distribution with the mean 0.5n and variance 0.25n. As a result, a normal distribution can be used as an approximation to the exact distribution B(n, p=0.5), when n is large.

2. Rank test for randomness

A simple test for randomness of a set of time series observations can be conducted using the rank of data. The rank is calculated as the ranking values of the sample observations (X1, …, Xn) ordered in an ascending order. That is, the value 1 is assigned to the smallest value of X; the value of 2 is assigned to the next smallest value of X; and so on, until the value n is assigned to the largest value.

The null hypothesis is that a time series is purely random against the alternative that it is not. Bartels (1982) proposed a test statistic of the following form:

Equation (1)

where Ri is the rank of the ith value (Xi) in the sequence of n observations. Under the null hypothesis, (R1, …, Rn) follows any permutation of (1, …., n) with an equal probability. This is because if the time series observations are purely random, its ranks should also be purely random.

Based on this, the exact distribution of RV can be simulated with the following R code:

nit=50000   # number of Monte Carlo iterations
n=20        # Sample size

# Calculating RV statistic
RV=matrix(NA,ncol=1,nrow=nit)
for (i in 1:nit) {
ranking <- sample(1:n, n, replace = FALSE)
RV[i,] = sum(diff(ranking)^2)/(n*(n^2-1)/12)
}

# Histogram
hist(RV)

# Critical Values replicating the values in Table 2 of Bartels (1982)
quantile(RV,probs = c(0.01,0.05,0.10))

1%       5%      10% 
1.013534 1.285714 1.439098

The above R code generates the exact sampling distribution of RV under H0 when n = 20, which is plotted below:

Exact Sampling Distribution of RV

The exact p-value or critical values are obtained from the above distribution in a usual way. Note that the critical values (given with the R code above) are almost identical to those tabulated by Bartels (1982).

The null hypothesis of pure randomness is rejected if the calculated RV statistic given in equation (1) is less than the critical value at a level of significance. This is because a purely random series has its ranks values also purely random, which will lead to a large value of RV statistic (see the example below).

When the sample size is large or massive, the above simulation can still be conducted to generate the exact sampling distribution without a heavy computational burden. Bartels (1982) also provides approximation formulae for these exact critical values.

As an example, consider X and Y in Table 1 plotted as below:

The variable X appears to be random around 0, while Y shows an upward trend which is a feature of a time series that is not purely random. The following R code plots X and Y, and calculates the RV statistics with their p-values:

# plots
plot.ts(x,col="red",lwd=2,main="X"); abline(h=0)
plot.ts(y,col="red",lwd=2,main="Y"); abline(h=0)

# RV statistics and p-values
library(trend)
bartels.test(x); bartels.test(y)

The RV statistic for X is 2.21 with the p-value of 0.6844; while that of Y is 0.32 with the p-value of 0.0000. This means that the null hypothesis that X is purely random cannot be rejected at a conventional level of significance, but that for Y is rejected.

The calculation is also illustrated in the table below:

Being a purely random series, X has its rank values purely random and highly variable (with a large value in the numerator of RV). In contrast, Y is not purely random and its rank values do not change very much. As a result of this property, the RV statistic for X is much larger than that of Y.

3. Wilcoxon tests

Wilcoxon tests (McDonald, 2014) are non-parametric alternatives to Welch’s two-sample t-tests. The null hypothesis is that the median values of two populations are equal, against alternative that they are not. There are two versions of the test:

Wilcoxon rank-sum test (also called Mann–Whitney–Wilcoxon test), when X and Y are independent; and
Wilcoxon signed-rank test, when X and Y are paired.

Let (X1, …, Xn) and (Y1, …, Ym) be the random samples from the respective populations. The test statistic for the independent sample (Wilcoxon rank-sum test) is given by

where S(X,Y) = 1 if X > Y; S(X,Y) = 0.5 if X = Y; and S(X,Y) = 0 if X < Y.

The statistic for the case of dependent samples (Wilcoxon signed-rank test) is calculated as

where Zi = Xi — Yi; sgn(Zi) =1 if Zi > 0, sgn(Zi) = — 1 if otherwise; and Ri is the rank of |Zi| (absolute value of Zi). Note that there are different versions of T statistic being used, but they are all equivalent.

For both U and T statistics, the exact sampling distributions under H0 can be obtained by a Monte Carlo simulation or can be approximated.

For X and Y given in Table 1, the R code for the Wilcoxon tests are given by


# Wilcoxon rank-sum test (U)
wilcox.test(x,y,mu=0,paired = FALSE,exact=TRUE)

# Wilcoxon signed rank test (T)
wilcox.test(x,y,mu=0,paired = TRUE,exact=TRUE)

where H0: μ = 0 and μ = median(X) — median(Y). The U test statistic is 67.5 with the p-value of 0.0004; and T statistic is 11 with the p-value of 0.0001. Hence, at the 5% level of significance, both tests reject the null hypothesis that the medians are equal.

This post has reviewed three simple non-parametric tests based on ranks and signs. The main difference between non-parametric and parametric tests is how the test statistics and their sampling distributions under H0 are calculated. That is,

The test statistic and the sampling distribution of a non-parametric test are obtained using the fully data-dependent methods such as the ranks and signs, without having to estimate the unknown population parameters.
They are obtained without resorting to any parametric assumptions or asymptotic approximations based on the central limit theorem.
The sampling distribution of a non-parametric test is exact. As a result, the test is conducted without any size distortion, and its p-value and critical values are exact.
A non-parametric test often shows better statistical properties than its parametric counterparts (e.g., a higher statistical power and no size distortion), especially when the sample size is small or when the assumptions of parametric tests are violated.

It is strongly recommended that the researchers adopt these non-parametric tests in their applications (such as the A/B tests) as an alternative to the parametric tests. In this post, several simple non-parametric tests are presented with examples and R codes.

References: Bartels, R. (1982). The rank version of von Neumann’s ratio test for randomness. Journal of the American Statistical Association, 77(377), 40–46.

McDonald, J. H. (2014). Handbook of biological statistics. New York. http://www.biostathandbook.com/wilcoxonsignedr

Jae H. Kim