R Factors and Tables

One often has to deal with categorical variables in statistics (i.e., variables at the nominal or ordinal level of measurement). In R, these are best dealt with through the use of factors.

For example, fertilizers typically have three main ingredients, nitrogen (N), phosphorous (P), and potassium (K). Perhaps one is conducting an experiment to determine which of these ingredients best promotes root development, and has four treatment groups (one for each ingredient, and a control group that receives none of the ingredients).

Plants numbered 1 through 12 are randomly assigned to one of the four treatment groups so that each group ends up with 3 members. We could represent this process with the vector named f, as shown below -- where the treatment given to plant $i$ corresponds to the $i^{th}$ element of the vector:

f = c("K","K","none","N","P","P","N","N","none","P","K","none")

To make R aware that the values listed are values associated with a categorical variable (which are called levels in R), we convert this vector into a factor with the factor() function:

fertilizer = factor(f)

Asking R to show the contents of f and fertilizer suggests there is a subtle difference between the two variables, as shown below:

> f
 [1] "K"    "K"    "none" "N"    "P"    "P"    "N"    "N"    "none" "P"    "K"    "none"

> fertilizer
 [1] K    K    none N    P    P    N    N    none P    K    none
Levels: K N none P

First, it is clear that R is no longer considering the elements of the factor as strings of characters, given the absence of double-quotes. Second (and more importantly), additional information in the form of "Levels: K N none P" is given. The levels shown correspond to the unique values seen in the vector $f$ (i.e., the categories that represent the treatment groups).

There are other differences between a vector and a factor, which we can see if we use the str(x) function. This function in R displays a compact representation of the internal structure of any R variable $x$. Let's see what happens when we apply it to both f and fertilizer:

> str(f)
 chr [1:12] "K" "K" "none" "N" "P" "P" "N" "N" "none" "P" "K" "none"

> str(fertilizer)
 Factor w/ 4 levels "K","N","none",..: 1 1 3 2 4 4 2 2 3 4 ...

Note how in the factor fertilizer, the levels "K", "N", "none", and "P" are replaced by numbers 1, 2, 3, and 4, respectively. So internally, R only stores the numbers (indicating the level of each vector element) and (separately) the names of each unique level. (Interestingly, even if the vector's elements had been numerical, the levels are stored as strings of text.)

The way R internally stores factors is important when we want to combine them. Consider the following failed attempt to combine factors a.fac and b.fac:

> a.fac = factor(c("X","Y","Z","X"))
> b.fac = factor(c("X","X","Y","Y","Z"))

> factor(c(a.fac,b.fac))
[1] 1 2 3 1 1 1 2 2 3
Levels: 1 2 3

Notice how we lost the names associated with the different levels. There is a way to restore them -- but it would be better not to lose them in the first place! The as.character() function can help here. This function can be used to force a factor back into a vector whose elements are the corresponding strings of text associated with its levels. For example, as.character(factor(c("X","Y"))) returns a vector equivalent to c("X","Y").

To combine two factors (with the same levels), we force them both back to vectors in the way just described, combine the vectors with c(), and then convert the result back into a factor -- as shown below:

a.fac = factor(c("X","Y","Z","X"))
b.fac = factor(c("X","X","Y","Y","Z"))


You can of course, also change the levels associated with a factor, using levels() as the following suggests.

> a.fac = factor(c("X","Y","Z","X"))
> a.fac
[1] X Y Z X
Levels: X Y Z

> levels(a.fac) = c("A","B","C")
> a.fac
[1] A B C A
Levels: A B C

How does the addition of factors as a data type in R help us do statistical work, you ask? Well, let us continue with the fertilizer example above as we attempt to answer that question. Suppose that the increase in root growth (measured in millimeters) for each plant is recorded after 3 weeks of treatment. These lengths are recorded in a vector named growth:

growth = c(10,12,8,13,18,19,11,11,9,21,10,10)

Suppose we are interested in the mean growth in each of our treatment groups.

Armed with only vectors f and growth, we would need to create four additional vectors of positions in $f$ that matched each string "K", "N", "none", and "P", and then use these to create four more vectors of growth values for these categories via subsetting. Then, we would need to find the mean of each of these vectors. Sounds like a lot of work, right? Well, factors -- and one of the "apply" functions -- make this process simple in the extreme...

We can use the tapply(x,f,g) function in R to apply a function to the values of a vector belonging to different categories, as determined by a factor. Here, $x$ is the vector, $f$ is the factor, and $g$ is the function to be applied. In other words, all we need to type is:

> tapply(growth,fertilizer,mean)
       K        N     none        P 
10.66667 11.66667  9.00000 19.33333 
Voila! It looks like phosphorus (P) really helps with root growth!

If we wanted to stop short of finding the means associated with the four fertilizer levels, and instead simply split up the growth vector into four vectors, each consisting only of the values in growth associated with a specific fertilizer level, we can use the split() function:

> split.data = split(growth,fertilizer)
Then, we can access the vectors associated with each level of fertilizer by following the variable name split.data with a dollar sign ($) and the level name (P, N, none, or K), as shown below:
> split.data$P
[1] 18 19 21

> split.data$N
[1] 13 11 11

> split.data$none
[1]  8  9 10

> split.data$K
[1] 10 12 10
Another useful feature of factors is that one can impose an order on the levels when the factor is created. This is frequently useful when creating a factor to represent an ordinal variable. An example is shown below:
> p = c("Bears","Bears","Tigers","Bears","Lion","Tigers","Lion")

> prizes = factor(p,levels=c("Lion","Tigers","Bears"),ordered=TRUE)
> prizes
[1] Bears  Bears  Tigers Bears  Lion   Tigers Lion  
Levels: Lion < Tigers < Bears

> sort(prizes)  # among other things, we can now sort the prizes factor in
                # accordance with this explicit order (instead of the default 
                # alphabetical order)

[1] Lion   Lion   Tigers Tigers Bears  Bears  Bears 
Levels: Lion < Tigers < Bears

Finally, as one more useful function to consider (although there are many others) before transitioning to a discussion about tables in R, the cut() function allows us to create factors from numerical data by cutting up the continuum containing the data into different "bins", much like the breaks argument of the hist() function are used to establish the various rectangular bars/bins shown in a histogram. Indeed, the corresponding argument to the cut() function is also called breaks. Here's an example:

> xs = runif(10)
> xs
 [1] 0.8675517 0.2721003 0.5774774 0.3887704 0.8033977 0.3176221 0.7910806
 [8] 0.6419176 0.8728865 0.4013302

> bin.ids = cut(xs,breaks=seq(from=0,to=1,by=0.1),labels=FALSE)
> bin.ids   # i.e., which bin did each x in xs fall into?
 [1] 9 3 6 4 9 4 8 7 9 5
Note, the labels=FALSE parameter above makes R return a vector of simple integer codes that can then be turned into a factor.

If instead labels=TRUE, a factor is returned with levels expressed in interval notation form (i.e., "(a,b]" form) by default, or with levels as specified by some vector $v$ through an optional levels=v parameter passed to the cut() function, as seen below:

> xs = runif(10)
> xs
 [1] 0.8675517 0.2721003 0.5774774 0.3887704 0.8033977 0.3176221 0.7910806
 [8] 0.6419176 0.8728865 0.4013302

> bin.ids = cut(xs,breaks=seq(from=0,to=1,by=0.1),
+                  labels=c("A","B","C","D","E","F","G","H","I","J"))
> bin.ids
 [1] I C F D I D H G I E
Levels: A B C D E F G H I J

If one wishes the result of cut() to have an implicit order (like the "Lions, Tigers, and Bears" example above), one merely needs to add the argument ordered_result = TRUE when it is called.

The result of applying cut() can then be turned into a table showing the frequency of data values in each bin. "What's a table?", you say -- funny you should ask...


Factors can also be used to create tables in R, another important data type in terms of its relationship to statistics.

As an example, suppose that a sample of 7 people are asked the following questions in a study of workplace risk of tetanus infections:

The answers to each question for subjects 1 through 7 are given by the following factors:
Q1 = factor(c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes"))
Q2 = factor(c("Maybe","Maybe","Yes","Maybe","No","Yes","No"))
Thinking that there might be a relationship between these two variables, we wish to construct a contingency table -- where the levels of one variable form the column headers and the levels of the other variable form the row headers, with the body of the table indicating how many subjects were associated with each possible pair of levels.

To create such a table in R, we simply use the table() command, as shown below:

> t = table(Q1,Q2)
> t
Q1          Maybe No Yes
  Always        1  1   0
  Never         0  0   1
  Sometimes     2  1   1
Tables can be made from 1, 2, or many more factors. Recalling the example used in the previous section to show what the cut() function does, note how a table made from the single associated factor that results, gives the frequency count for each level of this factor:
> xs
 [1] 0.8675517 0.2721003 0.5774774 0.3887704 0.8033977 0.3176221 0.7910806
 [8] 0.6419176 0.8728865 0.4013302

> bin.ids = cut(xs,breaks=seq(from=0,to=1,by=0.1),labels=FALSE)
> bin.ids
 [1] 9 3 6 4 9 4 8 7 9 5

> table(factor(bin.ids))
3 4 5 6 7 8 9 
1 2 1 1 1 1 3 
Getting back to the two-dimensional table t resulting from the answers to questions Q1 and Q2, let us explore some more ways tables can be used:

First -- and very similar to vectors -- one can extract individual values (or subsets of values) from a table. As an example, note that t[3,1] gives one the value $2$, located in the 3rd row, 1st column.

If one wishes to extract the entire 1st column, one simply leaves out the row number (but still uses the comma):

> t[,1]
   Always     Never Sometimes 
        1         0         2 
If one desires instead to extract (as a new table) columns 2 and 3 of t, one can use
> t[,2:3]
Q1          No Yes
  Always     1   0
  Never      0   1
  Sometimes  1   1

If only the 3rd row is wanted, one simply leaves out the column number, and so on...

> t[3,]
Maybe    No   Yes 
    2     1     1
Note, the results above for t[,1] and t[3,] are actually vectors -- the extra words that are shown result because the vector elements have been given names. This is not something peculiar to tables, however -- any vector can have its elements given names using the names() function, as the following suggests:
> x = c(1,2,3)
> x
[1] 1 2 3

# executing the names() function tells us that x 
# currently has no names attached to it
> names(x)   

# the following gives the elements of x names "a", "b", and "c"
# which we can see here results in x being displayed differently
> names(x) = c("a","b","c")  
> x    
a b c 
1 2 3 

# executing the names() function after giving x names,
# reveals the names given to it
> names(x)   
[1] "a" "b" "c"

# a vector that has names allows one to subset not by 
# numerical position, but by name
> x["b"]     

# one can remove the names using a NULL assignment
> names(x) = NULL  
> x
[1] 1 2 3
> names(x)
One can also produce new tables from existing ones. For example, suppose we wanted to see a table of relative frequencies instead of counts. Much like one might do with a vector, we simply divide the table by the sum of its elements:

> t/sum(t)
Q1              Maybe        No       Yes
  Always    0.1428571 0.1428571 0.0000000
  Never     0.0000000 0.0000000 0.1428571
  Sometimes 0.2857143 0.1428571 0.1428571

When working with contingency tables we often have need of marginal totals (i.e., either row or column sums in a two-dimensional table). One way to accomplish this is through the use of the apply() function, which allows us to apply any given function (here the sum() function) to the values in the table associated with each value of a given variable.

> apply(t,1,sum)
   Always     Never Sometimes 
        2         1         4
Note, the second parameter being a 1 above tells R to find the sums of the values in the table associated with each value of the first table variable, Q1. That is to say, when the second parameter is a 1, R finds the row totals. Had we used a 2 instead, we would see the column totals:
> apply(t,2,sum)
Maybe    No   Yes 
    3     2     2

However, R supplies another function, called addmargins() that can find both of these vectors (and the grand total) in one command:

> addmargins(t)
Q1          Maybe No Yes Sum
  Always        1  1   0   2
  Never         0  0   1   1
  Sometimes     2  1   1   4
  Sum           3  2   2   7

As one final useful function, note that as.vector() can collapse a table or factor into a vector. In the case of factors or tables with only one row, the result is obvious:

> f = factor(c("bob","fred","bob","bob","alice"))
> as.vector(f)
[1] "bob"   "fred"  "bob"   "bob"   "alice"

Q1 = factor(c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes"))

> table(Q1)
   Always     Never Sometimes 
        2         1         4

> as.vector(table(Q1))
[1] 2 1 4

In the case that a table has two rows, the columns are concatenated together to form one long vector, as seen below:

Q1 = factor(c("Sometimes","Sometimes","Never","Always","Always","Sometimes","Sometimes"))
Q2 = factor(c("Maybe","Maybe","Yes","Maybe","No","Yes","No"))

> table(Q1,Q2)
Q1          Maybe No Yes
  Always        1  1   0
  Never         0  0   1
  Sometimes     2  1   1

> as.vector(table(Q1,Q2))
[1] 1 0 2 1 0 1 0 1 1