R Data Frames

Consider the following data set that might be used to find variables (perhaps causal variables, but without an experiment we can't say) that correlate well to scoring high on a statistics test:

InitialsDays In ClassHrs StudiedHad Stat in H.S.Grade
AR106FALSEA
BT105FALSEA
DY87TRUEC
SW96FALSEA
DW103TRUEA
CC51TRUED
BO107FALSEA
LL96FALSEA
BW30TRUEF

Note, that each column has a "value" for every student (and thus every row). However, the types of "values" in each column differ. In some the values are numerical. In others, logical (i.e., TRUE/FALSE). In others, categorical and expressible as factors.

A data frame is the ideal variable type in R for such a set of data. A data frame in R is a list whose elements are equal-length vectors. (Here, each vector would correspond to a column above.) Importantly, each vector can be a collection of elements of a different type (e.g., numerical, logical, strings, etc.)

We can create the above data set as a data frame using the data.frame function, as shown below:

> initials = c("AR","BT","DY","SW","DW","CC","BO","LL","BW")
> days.in.class = c(10,10,8,9,10,5,10,9,3)
> hrs.studied = c(6,5,7,6,3,1,7,6,0)
> had.stat = c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,TRUE)
> grade = c("A","A","C","A","A","D","A","A","F")
> df = data.frame(initials,days.in.class,hrs.studied,had.stat,grade)

> df
  initials days.in.class hrs.studied had.stat grade
1       AR            10           6    FALSE     A
2       BT            10           5    FALSE     A
3       DY             8           7     TRUE     C
4       SW             9           6    FALSE     A
5       DW            10           3     TRUE     A
6       CC             5           1     TRUE     D
7       BO            10           7    FALSE     A
8       LL             9           6    FALSE     A
9       BW             3           0     TRUE     F

Accessing Data Frame Values

We can access entire rows or columns of a data frame in a straight-forward way, reminiscent of similar work with tables, as the below examples demonstrate:

> df$had.stat    # <-- here, we access the entire had.stat column
[1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE


> df[,3]         # <-- we can specify which column using position
                 #     as well.  here, we display the 3rd column
[1] 6 5 7 6 3 1 7 6 0


> df[5,]         # <-- we can also pull all of the data for a 
                 #     particular subject (i.e., a row) by specifying
                 #     the number of the row
  initials days.in.class hrs.studied had.stat grade
5       DW            10           3     TRUE     A

One can both retrieve the names of the components of a data frame and change them to be something different using the names() function. Note, it is possible for the components names associated with a data frame to have spaces in their associated strings of text. When this happens, one should surround the component name with single quotes when using it to access a subset of a data frame.

> df = data.frame(a=c(1,2,3,4,5),b=factor(c("A","A","B","C","A")))
> df
  a b
1 1 A
2 2 A
3 3 B
4 4 C
5 5 A

> names(df)
[1] "a" "b"

> names(df)=c("id","letter used")  # we rename the "a" component to be "id"
                                   # and the "b" component to be "letter used"

> df[df$'letter used'=="A",]
  id letter used
1  1           A
2  2           A
5  5           A

Importantly, and using the same mechanisms already seen with matrices, data frames allow one to filter data. For example, to create a new data frame that consists of all of the data for students that were in class for less than 7 days, we can do the following:

> df[df$days.in.class < 7,]
  initials days.in.class hrs.studied had.stat grade
6       CC             5           1     TRUE     D
9       BW             3           0     TRUE     F

Note the comma after the 7, which tells R to display all of the (column) variables for each such student.

If we were only interested in the days.in.class and grade variables for these students, we can display only this information easily as well:

> df[df$days.in.class < 7,c("days.in.class","grade")]
  days.in.class grade
6             5     D
9             3     F

As there are the same number of elements in each row and column of a data frame -- making it very "matrix-like" -- many of the mechanisms of matrices also apply to data frames. For example, we can cbind() a new component to an existing data frame and rbind() the rows of two data frames together, as the following two examples demonstrate:

> df = data.frame(nums=1:3,logicals=c(TRUE,TRUE,FALSE))
> df
  nums logicals
1    1     TRUE
2    2     TRUE
3    3    FALSE
 
> chars = c("a","b","c")
> df = cbind(df,chars)
> df
  nums logicals chars
1    1     TRUE     a
2    2     TRUE     b
3    3    FALSE     c
 
> df.more = data.frame(nums=4:6,logicals=c(FALSE,TRUE,FALSE),chars=c("d","e","f"))
> df.more
  nums logicals chars
1    4    FALSE     d
2    5     TRUE     e
3    6    FALSE     f
 
> df = rbind(df,df.more)
> df
  nums logicals chars
1    1     TRUE     a
2    2     TRUE     b
3    3    FALSE     c
4    4    FALSE     d
5    5     TRUE     e
6    6    FALSE     f

Importing and Exporting Data Frames

A tab-delimited text file similar to what is shown below can be easily imported as a data frame in R.

   X    Y
1  1.7  A
2  1.3  A
3  9.2  B
4  2.1  A
5  8.7  B
6  5.6  C

If you are using R Studio and the text file is on your local machine, you can simply click/choose "Import Dataset: From Text (base)" in the Environment panel (top right panel, by default), where you will be prompted to locate the file in question. At this point, you can tweak the available settings so that the data is imported just like you want it to be. Notably, one can set the name of the variable that will store the resulting data frame at this point. Finally, click "Import".

If your data is in an Excel file -- no worries -- you can import that too. Suppose you have an Excel file that looks like the following:

In the "Files" panel of R Studio (bottom right panel, by default), navigate to your Excel file, left-click it, and choose "Import Dataset..." Then, click the "Update" button to see the data and ensure that R is interpreting things correctly. This also gives you the chance to set the name of the variable which will hold the data frame, the sheet of your workbook to import, what range you wish to consider, etc. Finally, click "Import" when you are done tweaking the settings to make this data frame available to you for the rest of your R calculations. (Note: the first time you do this, you may be prompted to install a package named "readxl" first. This is normal, and safe to do -- and once done, you shouldn't have to do it again.)

If you wish to import a text file located on the web at some URL, you can use the read.table() function, as suggested by the following:

df = read.table("http://math.oxford.emory.edu/site/math117/rDataFrames/example_data.txt")

To export a data frame as a text file, use the write.table() function instead. As an example, suppose we have a data frame stored in the variable my.data.frame. We can export this with:

write.table(my.data.frame,"my.data.frame.txt",sep="\t")

Note, "my.data.frame.txt" will be the name of the file produced. Also, the "\t" tells R that the values in the file should be tab-delimited (i.e., separated by tab characters).