Tech Tips: Descriptive Statistics

To find statistics and other information that can be used to describe a data set -- which include measures of center (mean, median, mode, midrange), measures related to spread (variance & standard deviation, interquartile range, range), Pearson's Skewness Index, and the identification of outliers ...

R

Measures of Center

Presuming data is numeric and has been stored in a vector named data, one can find the mean with
```
mean(data)  
```
and the median with
```
median(data)
```
There is not a built-in function in R to calculate the midrange, but one can easily write such a function:
```
midrange = function(v) {
  return((max(v) + min(v))/2)
}
```
A moment should be taken here to point out the (hopefully obvious) nature of the functions used in the above definition:

max(v)
returns the maximum element of a vector $v$

min(v)
returns the minimum element of a vector $v$

There is also not a built-in function in R to calculate the mode(s) of a data set, but writing one again proves straight-forward (albeit a bit longer). We do, unfortunately, need to name our function something besides mode(), as R does have a function with this name (it has to do with the "storage mode" of an object, as opposed to a measure of center for a data set).
```
getmode = function(v) {
  uniques = unique(v)
  counts = tabulate(match(v,uniques))
  max.count = max(counts)
  return(uniques[counts == max.count])
}
```
Nicely, the above function works even when the vector is not numeric.

As two examples:
```
> data = c(1,4,2,6,3,3,4,4,2,5,2)
> getmode(data)
[1] 4 2

> names = c("bob","fred","fred","susan","chuck","alice","chuck")
> getmode(names)
[1] "fred"  "chuck"
```
Let's take a moment and look at what the functions used in the above definition actually do.

unique(v)

When $v$ is a vector, this function returns a vector identical to $v$, except with duplicate elements removed.

match(v1,v2)

This function returns a vector of the positions of (first) matches of v1 elements in v2

tabulate(v)

This function counts how many 1's, 2's, 3's, etc., appear in $v$. As an example,

> tabulate(c(2,3,3,5))
[1] 0 1 2 0 1   # <-- i.e., there are 0 ones, 1 two, 2 threes, 0 fours, and 1 five

v = c(1,4,2,6,3,3,4,4,2,5,2)

uniques = unique(v)                   # <-- uniques   = c(1, 4, 2, 6, 3, 5)
counts = tabulate(match(v,uniques))   # <-- counts    = c(1, 3, 3, 1, 2, 1)
max.count = max(counts)               # <-- max.count = 3
return(uniques[counts == max.count])  # returns c(4,2)

Measures of Spread

Again presuming the numeric data of interest has been stored in a vector named data, one can calculate the variance, standard deviation, and interquartile range (IQR) with var(data), sd(data), and IQR(data), as shown in the example below:

> data = c(1,2,2,20,4,5,7,3)

> var(data)
[1] 38

> sd(data)
[1] 6.164414

> IQR(data)
[1] 3.5

Of course, if one needed the population variance or population standard deviation, one could simply multiply the variance by the appropriate constant. The length() function in R, which returns the number of elements in a vector, and the sqrt() function, which calculates square roots, can both prove useful to this end, as shown in the example:

> data = c(1,2,2,20,4,5,7,3)

> n = length(data)  # note, here n is 8
> pop.var = var(data) * (n-1) / n
> pop.var
[1] 33.25

> pop.sd = sqrt(pop.var)
> pop.sd
[1] 5.766281

min()

max()

range()

> data = c(1,2,2,20,4,5,7,3)

> min(data)
[1] 1

> max(data)
[1] 20

> range(data)
[1]  1 20

summary()

rivers

> rivers
  [1]  735  320  325  392  524  450 1459  135  465  600
 [11]  330  336  280  315  870  906  202  329  290 1000
 [21]  600  505 1450  840 1243  890  350  407  286  280
 [31]  525  720  390  250  327  230  265  850  210  630
 [41]  260  230  360  730  600  306  390  420  291  710
 [51]  340  217  281  352  259  250  470  680  570  350
 [61]  300  560  900  625  332 2348 1171 3710 2315 2533
 [71]  780  280  410  460  260  255  431  350  760  618
 [81]  338  981 1306  500  696  605  250  411 1054  735
 [91]  233  435  490  310  460  383  375 1270  545  445
[101] 1885  380  300  380  377  425  276  210  800  420
[111]  350  360  538 1100 1205  314  237  610  360  540
[121] 1038  424  310  300  444  301  268  620  215  652
[131]  900  525  246  360  529  500  720  270  430  671
[141] 1770

> summary(rivers)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  135.0   310.0   425.0   591.2   680.0  3710.0

Identifying Outliers

quantile()

As an example of how this function works, suppose the variable grades gives us scores earned on some test in a statistics class, as shown below:

> grades = c(86,92,100,93,89,95,79,98,68,62,71,75,88,86,93,81,100,86,96,52,15)

quantile(grades,0.25)

Similarly, quantile(grades,0.75) will give $Q_3$.

Using these, we can quickly identify any outliers with the following:

> lower.bound = quantile(grades,0.25)-1.5*IQR(grades)
> upper.bound = quantile(grades,0.75)+1.5*IQR(grades)
> c(lower.bound,upper.bound)
25% 75%
 48 120

Then, using subsetting, we can identify any specific outliers that might be present:

> grades[grades < lower.bound | grades > upper.bound]
[1] 15

Of course, another test for outliers involves finding those data values that fall outside of 3 standard deviations from the mean. This is also done in a straight-forward way, as shown in the example below:

> grades = c(86,92,100,93,89,95,79,98,68,62,71,75,88,86,93,81,100,86,96,52,15)

> lower.bound = mean(grades) - 3*sd(grades)
> upper.bound = mean(grades) + 3*sd(grades)
> c(lower.bound,upper.bound)
[1]  21.58463 140.79632

> grades[grades < lower.bound | grades > upper.bound]
[1] 15

Pearson's Skewness Index

The following function can be used to calculate Pearson's Skewness Index for a given data set:

pearson.skew = function(v) {
  return(3*(mean(v)-median(v))/sd(v))
}

Excel:

The most important functions in Excel, when it comes to describing data, are:

$AVERAGE()$
$MIN()$
$MAX()$
$QUARTILE.INC()$ [or $QUARTILE.EXC()$]
$MEDIAN()$
$VAR.S()$ [or $VAR.P()$]
$STDEV.S()$ [or $STDEV.P()$] and
$MODE.MULT()$

The following worksheet illustrates how these functions can be used to calculate measures of center (mean, median, mode, midrange), measures of spread (standard deviation, interquartile range, range), Pearson's Skewness Index, and the "fences" used in determining if a value represents an outlier in accordance with both the means test and IQR test:

Here are the relevant formulas for the cells in column F:

F3:"=AVERAGE(C:C)"           # calculates the mean of all values in column C
F5:"=MIN(C:C)"               # calculates the minimum value in column C

F6:"=QUARTILE.INC(C:C,1)"    # calculates Q1 for the data in column C. 
                             # note: Excel includes a QUARTILE.EXC function 
                             # as well, although both of these may differ
                             # slightly from hand calculations of Q1 and Q3

F7:"=MEDIAN(C:C)"            # calculates the median of all values in column C

F8:"=QUARTILE.INC(C:C,3)"    # calculates Q3 for the data in column C.
                             # similar to F6 above

F9:"=MAX(C:C)"               # calculates the maximum value in column C

And here are the formulas for cells in column I:

I3:"=VAR.S(C:C)"             # calculates the sample variance of column C values
                             # note: if column C had represented a population
                             # instead, one should use the VAR.P() function
                             # for the population variance

I4:"=STDEV.S(C:C)"           # calculates the sample standard deviation of
                             # column C values. if this column had represented
                             # a population instead of a sample, one should use
                             # STDEV.P(C:C) instead for the population standard
                             # deviation

I6:"=F9-F5"                  # the range of the data (i.e, max - min)

I7:"=CONCATENATE("(",F5," to ",F9,")")"    # builds a string of text from the  
                                           # smaller strings and/or values given
                                           # to it

I9:"=F8-F6"                  # the IQR of the data (i.e., Q3 - Q1)

I11:"=F5-1.5*I9"             # the lower outlier fence, in accordance with the IQR 
                             # cut-off of Q1 - 1.5*IQR

I12:"=F9+1.5*I9"             # the upper outlier fence, in accordance with the IQR
                             # cut-off of Q3 + 1.5*IQR

I14:"=F3-3*I4"               # the lower outlier fence, in accordance with the 
                             # sample mean outlier test cut-off of x.bar - 3*s

I15:"=F3+3*I4"               # the upper outlier fence, in accordance with the
                             # sample mean outlier test cut-off of x.bar + 3*s

I17:"=3*(F3-F7)/I4"          # Pearson's skewness index: 3*(mean-median)/standard dev.

Finally, column $K$ shows the modes of the data (remember, there can be more than one).

The way some versions of Excel handle modes can be a bit clunky. As of September 2018, a better way is being investigated -- but until then, here's how you can replicate the above..

First, make a conservative guess for the number of modes your data might have (i.e., guess higher than you think it should be), and then highlight this many cells in a column. In the worksheet above, I allowed for up to 9 modes, selecting $K4:K12$.

Then, while all of these cells are selected, click in the Formula Bar. You are now in a position to type an "array formula". Type "=MODE.MULT(C:C)", and then to signify this should be treated as an array formula, hit Ctrl-Shift-Enter. The result is to display all of the modes in the cells you selected. If the number of cells selected is too small, some modes are not shown. If the number of cells is too big, you will see some "#N/A" error values. Like I said -- clunky.