Non-Parametric Tests

Wilcoxon Rank Sum Test

The Wilcoxon Rank Sum test is a non-parametric hypothesis test where the null hypothesis is that there is no difference in the populations (i.e., they have equal medians).

This test does assume that the two samples are independent, and both $n_1$ and $n_2$ are at least $10$. It should not be used if either of these assumptions are not met.

The test involves first ranking the data in both samples, taken together. Each data element is given a rank, $1$ through $n_1 + n_2$, from lowest to highest -- with ties resolved by ranking tied elements arbitrarily at first, and then replacing rankings of tied elements with the average rank of those tied elements.

So for example, ranking the data below $$\begin{array}{l|cccccccc} \textrm{Sample A} & 12 & 15 & 17 & 18 & 18 & 20 & 23 & 24\\\hline \textrm{Sample B} & 14 & 15 & 18 & 20 & 20 & 20 & 24 & 25\\ \end{array}$$ results in the following ranks $$\begin{array}{ccc} \textrm{value} & \textrm{initial rank} & \textrm{final rank}\\\hline 12 & 1 & 1\\ 14 & 2 & 2\\ 15 & 3 & 3.5\\ 15 & 4 & 3.5\\ 17 & 5 & 5\\ 18 & 6 & 7\\ 18 & 7 & 7\\ 18 & 8 & 7\\ 20 & 9 & 10.5\\ 20 & 10 & 10.5\\ 20 & 11 & 10.5\\ 20 & 12 & 10.5\\ 23 & 13 & 13\\ 24 & 14 & 14.5\\ 24 & 15 & 14.5\\ 25 & 16 & 16\\ \end{array}$$

Suppose $n_1$ denotes the size of the smaller sample and $n_2$ denotes the size of the other sample. Now define the following: $$\mu_R = \frac{n_1(n_1+n_2+1)}{2} \quad \textrm{ and } \quad \sigma_R = \sqrt{\frac{n_1 n_2 (n_1 + n_2 + 1)}{12}}$$ If $R$ is the sum of the ranks associated with elements from the sample of size $n_1$, then $$z = \frac{R - \mu_R}{\sigma_R}$$ is a test statistic that follows a standard normal distribution.

Kruskal-Wallis Test (i.e., H Test)

The Kruskal-Wallis Test (named after William Kruskal and W. Allen Wallis) can be used to test the claim (a null hypothesis) that there is no difference in the populations (i.e., they have equal medians) when there are 3 or more independent samples, provided they meet the additional assumption that the sample sizes are all at least 5.

To perform the test, we first rank all of the samples together, and then add the ranks associated with each sample.

Letting $R_i$ be the sum of the ranks for sample $i$, of size $n_i$, $N$ be the sum of all sample sizes $n_i$, and $k$ be the number of samples, the following test statistic

$$H = \frac{12}{N(N+1)}\left[\sum_{i=1}^k \frac{R^2_i}{n_i} \right] - 3(N+1)$$ follows a $\chi^2$ distribution with $k-1$ degrees of freedom.

This is a right-tailed test.

To see why this test statistic takes the form it does, consider the following:

Recall that a $\chi^2$-distribution is the distribution of a sum of the squares of independent standard normal random variables.

Under a presumption that the sample sizes, $n_i$, are not too small (remember, we required $n_i \ge 5$ for each sample), the $\overline{R_i}$ jointly will be approximately normally distributed.

(Note, we have relaxed our typical requirement that $n \ge 30$ down to $n \ge 5$ as the associated population is uniform.)

To make $H$ a sum of squares of standard normal random variables, we use $z$-scores for each observed average rank in a natural way:

$$H \approx \sum_{i=1}^k \left( \frac{\textrm{observed average rank} - \textrm{expected average rank}}{\displaystyle{\left(\frac{\textrm{standard deviation of ranks}}{\sqrt{n_i}}\right) }} \right)^2$$

Given the null hypothesis that there is no difference between the populations with regard to their medians, we can expect the ranks 1 to $N$ seen in the samples are distributed uniformly. Recalling that the expected value and variance of such a uniform distribution $X$ are given by $$E(X) = \frac{N+1}{2} \quad \quad \textrm{ and } \quad \quad Var(X) = (SD(X))^2 = \frac{N^2-1}{12}$$ we make the following substitutions:

$$H \approx \sum_{i=1}^k \frac{n_i \left[\overline{R_i} - \frac{N+1}{2} \right]^2}{\frac{N^2 - 1}{12}}$$

Adding a factor of $(N-1)/N$ to correct bias (much like Bessel's correction), we have:

$$H = \frac{N-1}{N}\sum_{i=1}^k \frac{n_i \left[\overline{R_i} - \frac{N+1}{2} \right]^2}{\frac{N^2 - 1}{12}}$$

From here, we just use algebra to rewrite $H$ in a form more convenient for calculation:

$$\begin{array}{rcl} H &=& \displaystyle{\frac{N-1}{N}\sum_{i=1}^k \frac{n_i \left[\frac{R_i}{n_i} - \frac{N+1}{2} \right]^2}{\frac{N^2 - 1}{12}}}\\\\ &=& \displaystyle{\frac{12}{N^2-1} \cdot \frac{N-1}{N} \cdot \sum_{i=1}^k \left[ n_i \left( \frac{R_i^2}{n_i^2} - \frac{R_i}{n_i}(N+1) + \frac{(N+1)^2}{4}\right) \right]}\\\\ &=& \displaystyle{\frac{12}{N(N+1)} \cdot \sum_{i=1}^k \left(\frac{R_i^2}{n_i} - R_i (N+1) + \frac{(N+1)^2}{4} n_i \right)}\\\\ &=& \displaystyle{\frac{12}{N(N+1)} \cdot \left[ \sum_{i=1}^k \frac{R_i^2}{n_i} - \sum_{i=1}^k R_i(N+1) + \sum_{i=1}^k\frac{(N+1)^2}{4}n_i \right]}\\\\ &=& \displaystyle{\frac{12}{N(N+1)} \cdot \left[ \sum_{i=1}^k \frac{R_i^2}{n_i} - (N+1)\sum_{i=1}^k R_i + \frac{(N+1)^2}{4}\sum_{i=1}^kn_i \right]}\\\\ &=& \displaystyle{\frac{12}{N(N+1)} \cdot \left[ \sum_{i=1}^k \frac{R_i^2}{n_i} - (N+1) \cdot \frac{N(N+1)}{2} + \frac{(N+1)^2}{4} \cdot N \right]}\\\\ &=& \displaystyle{\frac{12}{N(N+1)} \cdot \left[ \sum_{i=1}^k \frac{R_i^2}{n_i} \right] - 6(N+1) + 3(N+1)}\\\\ &=& \displaystyle{\frac{12}{N(N+1)} \cdot \left[ \sum_{i=1}^k \frac{R_i^2}{n_i} \right] - 3(N+1)}\\\\ \end{array}$$