Exercises - Data Frames

  1. Create a data frame named dframe in accordance with the following table. Note, person should be a vector of strings, while sex and funny should be factors with the observed nominal levels and ordinal levels, respectively.

    $$\begin{array}{l|l|l} \textrm{person} & \textrm{sex} & \textrm{funny}\\\hline \textrm{Stan} & M & \textrm{High}\\ \textrm{Francine} & F & \textrm{Med}\\ \textrm{Steve} & M & \textrm{Low}\\ \textrm{Roger} & M & \textrm{High}\\ \textrm{Abigail} & F & \textrm{High}\\ \textrm{Klaus} & M & \textrm{Med} \end{array}$$

    Now do or find the following:

    1. Add an age column where the ages of Stan, Francine, Steve, Roger, Hayley, and Klaus are 41, 41, 15, 1600, 21, and 60, respectively. (Yes, Roger is very old.)

    2. Reorder the columns of the data frame so that they occur in the following order: person, age, sex, funny.

    3. Make a new data frame similarly constructed and named dframe2 in accordance with the table below.

      $$\begin{array}{l|l|l|l} \textrm{person} & \textrm{age} & \textrm{sex} & \textrm{funny}\\\hline \textrm{Fredrico} & 42 & M & \textrm{High}\\ \textrm{Roberta} & 37 & F & \textrm{Med}\\ \textrm{Alfred} & 19 & M & \textrm{Low}\\ \textrm{Bruce} & 35 & M & \textrm{High} \end{array}$$

      Then combine these two data frames into a single data frame called mydataframe.

    4. Write a single line of code that will extract from mydataframe just the names and ages of any records where the individual is female and has a level of funniness of "Med" or "High".

    # (a)
    > p = c("Stan","Francine","Steve","Roger","Abigail","Klaus")
    > s = factor(c("M","F","M","M","F","M"))
    > f = factor(c("High", "Med", "Low", "High", "High", "Med"),
                 ordered = TRUE,
                 levels=c("Low","Med","High"))
    
    > dframe = data.frame(person=p, sex=s, funny=f)
    > dframe
        person sex funny
    1     Stan   M  High
    2 Francine   F   Med
    3    Steve   M   Low
    4    Roger   M  High
    5  Abigail   F  High
    6    Klaus   M   Med
    
    > # (a) 
    > dframe$age = c(41,41,15,1600,21,60)
    > dframe
        person sex funny  age
    1     Stan   M  High   41
    2 Francine   F   Med   41
    3    Steve   M   Low   15
    4    Roger   M  High 1600
    5  Abigail   F  High   21
    6    Klaus   M   Med   60
    
    > # (b) 
    > dframe = dframe[TRUE,c(1,4,2,3)]
    > dframe
        person  age sex funny
    1     Stan   41   M  High
    2 Francine   41   F   Med
    3    Steve   15   M   Low
    4    Roger 1600   M  High
    5  Abigail   21   F  High
    6    Klaus   60   M   Med
    
    > # (c) 
    > p = c("Fredrico","Roberta","Alfred","Bruce")
    > a = c(42,37,19,35)
    > s = factor(c("M","F","M","M"))
    > f = factor(c("High","Med","Low","High"),
    +            ordered=TRUE,
    +            levels=c("Low","Med","High"))
     
    > dframe2 = data.frame(person=p, age=a, sex=s, funny=f)
    > dframe2
        person age sex funny
    1 Fredrico  42   M  High
    2  Roberta  37   F   Med
    3   Alfred  19   M   Low
    4    Bruce  35   M  High
     
    > mydataframe = rbind(dframe,dframe2)
    > mydataframe
         person  age sex funny
    1      Stan   41   M  High
    2  Francine   41   F   Med
    3     Steve   15   M   Low
    4     Roger 1600   M  High
    5   Abigail   21   F  High
    6     Klaus   60   M   Med
    7  Fredrico   42   M  High
    8   Roberta   37   F   Med
    9    Alfred   19   M   Low
    10    Bruce   35   M  High
    
    > # (d) 
    > mydataframe[(mydataframe$sex == "F") & (mydataframe$funny > "Med"),1:2]
       person age
    5 Abigail  21
    
  2. A data frame named diet has components gender (M or F), age (an integer), treatment.group (diet or placebo), weight.before (possibly a decimal value), and weight.after (also possibly a decimal value).

    1. Write a statement in R that adds a component weight.loss to the data frame diet whose values are the amounts of weight each subject lost during the program.

    2. Describe what the following code will do:

      tapply(diet$weight.loss, diet$treatment.group, hist)

    1. diet$weight.loss = diet$before - diet$after

    2. This will split the data in the diet$weight.loss vector into two vectors -- one containing the values associated with the diet treatment and another containing the values associated with the placebo treatment group. Then, the hist() function will be applied to each, producing two histograms of weight loss -- one for each treatment group.

  3. Import the Excel file "mydata.xlsx" into R as a data frame named apples.

    1. Download the the "mydata.xlsx" file to your local machine
    2. Click on the "File" tab in the panel in the lower right corner of R-Studio
    3. Navigate to the "mydata.xlsx" file
    4. Left-click the file and choose "Import Dataset..."
    5. Click the "Update" button to see the spreadsheet in the "Import Excel Data" dialog box
    6. Change the "Name:" field to apples
    7. Click the "Import" button
    8. Type "apples" in the console window to verify the import happened correctly
  4. Import the text file people.txt into R as a data frame named people.

    First copy the link and save it to a variable named url. Then, use the read.table() function to create a data frame. Importantly, as the text file has column headers in the first line, use the argument header=TRUE so that R treats these as component names in the data frame, rather than the data itself.

    url = "https://math.oxford.emory.edu/site/math117/..."  # <-- full address 
                                                            #     not shown as
                                                            #     it is quite
                                                            #     long 
    people = read.table(url,header=TRUE)
    
  5. After creating the data frame people above, display the age and birth month (in that order) of all males in the dataset.
    people[people$Sex == "M",c("Age","BirthMonth")]
    
  6. Points earned for several students are given below. Construct a data frame to hold this data, and then add a column to that data frame called grade where the grades are assigned in the following way: A for the top 10% of point totals, B for the next 20%, C for the middle 40%, D for the 20% below that, and F for the bottom 10%.

     studentId pts
             1  86
             2  84
             3  81
             4  90
             5  85
             6  79
             7  79
             8  75
             9  66
            10  86
            11  73
            12  84
            13  83
            14  68
            15  78
            16  93
            17  82
            18  75
            19  80
            20  87
            21  82
            22  66
            23  91
            24  79
            25  81
    
    # first, create the data frame...
    
    > studentId = 1:25
    > pts = c(86, 84, 81, 90, 85, 79, 79, 75, 66, 86, 73, 84, 83, 
              68, 78, 93, 82, 75, 80, 87, 82, 66, 91, 79, 81)
    > df = data.frame(studentId, pts)
    
    # then, use cut to create a factor for the grades
    
    > mini = min(pts)-1
    > maxi = max(pts)+1
    > grade = cut(pts,breaks=c(mini,quantile(df$pts,c(0.10,0.30,0.70,0.90)),maxi),
                  labels=c("F","D","C","B","A"))
    
    # finally, add the grades to the data frame
    
    > df$grade = grade
    > df
       studentId pts grade
    1          1  86     B
    2          2  84     C
    3          3  81     C
    4          4  90     A
    5          5  85     C
    6          6  79     D
    7          7  79     D
    8          8  75     D
    9          9  66     F
    10        10  86     B
    11        11  73     F
    12        12  84     C
    13        13  83     C
    14        14  68     F
    15        15  78     D
    16        16  93     A
    17        17  82     C
    18        18  75     D
    19        19  80     C
    20        20  87     B
    21        21  82     C
    22        22  66     F
    23        23  91     A
    24        24  79     D
    25        25  81     C