Chebyshev's Theorem

Amazingly, even if it is inappropriate to use the mean and the standard deviation as the measures of center and spread, there is an algebraic relationship between them that can be exploited in any distribution.

This relationship is described by Chebyshev's Theorem:

For every population of $n$ values and real value $k \gt 1$, the proportion of values within $k$ standard deviations of the mean is at least

$$1 - \frac{1}{k^2}$$

As an example, for any data set, at least 75% of the data will like in the interval $(\overline{x} - 2s, \overline{x} + 2s)$.

To see why this is true, suppose a population of $n$ values consists of $n_1$ values of $x_1$, $n_2$ values of $x_2$, etc. (i.e., $n_i$ values of each different $x_i$ in the population).

Now suppose $\sigma$ is the standard deviation of this population, $\mu$ is it's mean, and $k \gt 1$ is some positive real number.

Consider the variance of the population:

$$\begin{array}{rcl} \sigma^2 &=& \displaystyle{\frac{\sum (x_i - \mu)^2 \cdot n_i}{n} \quad \textrm{where the sum ranges over all distinct } x_i}\\ &\ge& \displaystyle{\frac{\sum (x_i - \mu)^2 \cdot n_i}{n} \quad \textrm{where the sum ranges over only those } x_i \textrm{ where } |x_i - \mu| \ge k\sigma}\\ &\ge& \displaystyle{\frac{\sum k^2 \sigma^2 \cdot n_i}{n} \quad \textrm{since, if } |x_i - \mu| \ge k\sigma \textrm{ then } (x_i - \mu)^2 \ge k^2 \sigma^2}\\ &=& \displaystyle{k^2 \sigma^2 \cdot \frac{\sum n_i}{n}}\\ &=& \displaystyle{k^2 \sigma^2 \cdot p_{outside} \quad \textrm{ where } p_{outside} \textrm{ is the proportion of the population outside } k \textrm{ standard deviations of } \mu} \end{array}$$

Dividing both sides by $\sigma^2$, we have

$$1 \ge k^2 \cdot p_{outside}$$

Equivalently, if $p_{within}$ is the proportion of the population within $k$ standard deviations of the mean,

$$1 \ge k^2 \cdot (1-p_{within})$$

Solving for $p_{within}$, we have

$$p_{within} \ge 1 - \frac{1}{k^2}$$

Of course, if $k \le 1$, the result is trivial, as every proportion is greater than a negative value. As such, the result is typically stated in the context of $k \gt 1$


The result can be easily extended to say something similar about the proportion of values within $k$ standard deviations of $\overline{x}$ in a sample. We begin with a sample that has $n_i$ occurrences of each different $x_i$ in the sample. Then, apart from replacing every occurrence of $\mu$ with $\overline{x}$ and every occurrence of $\sigma$ with $s$, the only essential difference lies in the second step below:

$$\begin{array}{rcl} s^2 &=& \displaystyle{\frac{\sum (x_i - \overline{x})^2 \cdot n_i}{n-1} \quad \textrm{where the sum ranges over all distinct } x_i}\\ &\ge& \displaystyle{\frac{\sum (x_i - \overline{x})^2 \cdot n_i}{n}}\\ &\ge& \displaystyle{\frac{\sum (x_i - \overline{x})^2 \cdot n_i}{n} \quad \textrm{where the sum ranges over only those } x_i \textrm{ where } |x_i - \overline{x}| \ge ks}\\ &\ge& \displaystyle{\frac{\sum k^2 s^2 \cdot n_i}{n} \quad \textrm{since, if } |x_i - \overline{x}| \ge ks \textrm{ then } (x_i - \overline{x})^2 \ge k^2 s^2}\\ &=& \displaystyle{k^2 s^2 \cdot \frac{\sum n_i}{n}}\\ &=& \displaystyle{k^2 s^2 \cdot p_{outside} \quad \textrm{ where } p_{outside} \textrm{ is the proportion of the population outside } k \textrm{ standard deviations of } \overline{x}} \end{array}$$

Dividing both sides by $s^2$, we have

$$1 \ge k^2 \cdot p_{outside}$$

The rest of the argument is identical to that used for the entire population.