R and Excel Review A2

What is the output if the following is executed in R?

x = list(1:3,1:5,1:7)
sapply(x,median)

2 3 4

What is the output if the following is executed in R?

m=matrix(1:8, nrow=2)
m=m[,1:3]
apply(m,2,sum)

3  7 11

What is the output if the following is executed in R?

f=factor(c("A","A","A","B","B","C"))
x=1:6
tapply(x,f,length)

A B C
3 2 1

Describe what the function mystery() does. Your description should specify the roles of x, n, and a in the function in the context of a statistics course.

mystery=function(x,n,a){
   z=-qnorm(a/2)
   se=sqrt((x/n)*(1-x/n)/n)
   lb=x/n-z*se
   ub=x/n+z*se
   return(c(lb,ub))
}

The function returns a confidence interval for a proportion. The sample has x successes in n trials. The confidence level is 1-a.

Describe what the function enigma() does. In particular, explain what the variables o e and a represent.

enigma = function(o,e,a) {
  x = sum((o-e)^2/e)
  p = 1-pchisq(x,df=length(o)-1)
  return(ifelse(any(e < 5),
                "don't ask me!",
                ifelse(p < a,
                       "seems different",
                       "doesn't seem different")))
}

The function runs a goodness-of-fit test with significance level a where o is the vector of observed frequencies and e is the vector of expected frequencies. The function returns "don't ask me" if the assumptions are not met, "seems different" if the null hypothesis is rejected, and "doesn't seem different" if the null is not rejected.

A 6-sided die lands showing a six with probability 40% and shows the other 5 sides each with 12% probability. Write R code that produces a vector which simulates rolling this die 100 times.

sample(1:6, size=100, replace=TRUE, prob=c(0.12,0.12,0.12,0.12,0.12,0.4))

Write an R function, coins(n,p) that returns a sequence of "H" and "T" representing $n$ random flips of a coin that lands heads with probability p.

coins=function(n,p){
    sample(c("H","T"), n, replace=TRUE, prob=c(p,1-p))
}

Write an R function named spinner( ) that returns a sequence of "R", "G", or "B" representing spins of a spinner whose arrow lands on red 60% of the time, green 30% of the time, and blue 10% of the time. The sequence should continue until the arrow lands on blue.

spinner=function( ){
   sequence=c()    
   while (TRUE){
      sequence = c(sequence, sample(c("R","G","B"), 1, prob=c(0.6,0.3,0.1)))
      if (sequence[length(sequence)]=="B") 
         return(sequence)
   }
}

Write an R function spades(n) that simulates drawing n cards from a standard deck and returns a table showing the number of each suit drawn.

spades=function(n){
   deck=rep(c("S","H","D","C"), 13)
   hand=sample(deck, n, replace=FALSE)
   return(table(hand))
}

Consider the following game:

You have a coin that lands heads 60% of the time and a fair six-sided die. The coin is flipped.

If the coin lands heads, then the coin is flipped again and you win $2 if it lands heads again, nothing if it lands tails.

Otherwise (the first flip was tails), the die is rolled and you win the number showing on the die.

Write a function game() in R that simulates playing this game one time and returns your winnings.
Write a function approximated.game.winnings(n) that approximates the expected winnings by simulating the game $n$ times and finding the mean winnings.

game=function(){
   if (runif(1) < 0.6) {
      return(ifelse(runif(1) < 0.6, 2,0))
   } else return(sample(1:6,1))
}

approximated.game.winnings=function(n){
   sum(replicate(n, game()))/n
}

Suppose that grades is a vector of (numerical) grades earned on a recent class test.

Define a corresponding factor position that is one of: "low", "middle", or "high" depending on whether a grade is more than one standard deviation below the mean, within one standard deviation of the mean, or more than one standard deviation above the mean, respectively.
Construct a data frame with components grades and position.

breaks=c(min(grades)-1, mean(grades)-sd(grades), mean(grades)+sd(grades), max(grades)+1)
position=cut(grades, breaks=breaks, labels=c("low","middle","high"))

df=data.frame(grades, position)

Write both R code and an Excel formula that finds each of the following:

the $p$-value for a 2-tailed t-test with test statistic 2.34 and 10 degrees of freedom.
the critical values for a 2 tailed t-test with 10 degrees of freedom and significance level 0.05.
the $p$-value for a Kruskal-Wallis test of 5 samples with test statistic 3.7.
the critical value for a Kruskal-Wallis test of 5 samples with significance level 0.05.
the $p$-value for a 2-tailed variance test with $dfN=10$, $dfD=12$, and test statistic 2.5.
the critical value for an ANOVA test with $dfN=3$, $dfD=24$, and significance level 0.05.
the $p$-value for a 1-tailed z-test with test statistic -1.56.
the critical value for a left-tailed z-test with significance level 0.01.

a) R:      (1-pt(2.34,10))*2
   Excel:  (1-T.DIST(2.34,10))*2

b) R:      qt(0.025,10), qt(0.975,10)
   Excel:  T.INV(0.025,10), T.INV(0.975,10)

c) R:      1-pchisq(3.7, 4)
   Excel:  1-CHISQ.DIST(3.7, 4)

d) R:      qchisq(0.95,4)
   Excel:  CHISQ.INV(0.95,4)

e) R:      (1-pF(2.5,10,12))*2
   Excel:  (1-F.DIST(2.5,10,12))*2

f) R:      qF(0.95,3,24)
   Excel:  F.INV(0.95,3,24)

g) R:      pnorm(-1.56)
   Excel:  NORM.DIST(-1.56)

h) R:      qnorm(0.01)
   Excel:  NORM.INV(0.01)

A farmer is considering increasing the amount of time the lights in his hen house are on. Ten hens were selected, and the number of eggs each produced over a week was recorded under normal and increased lighting conditions.

The data is shown below as two vectors in R (where the $i^{th}$ element in each corresponds to the eggs produced by hen $i$).

normal_light=c(4,3,8,7,6,4,9,7,6,5)
increased_light=c(6,5,9,7,4,5,10,6,9,6)

At $\alpha=.05$, can it be concluded that the increased light time changed egg production? Write code in R that would test this claim.

This is paired data, as the $i^{th}$ element in each vector corresponds to the same hen.

We can proceed in either of the two ways shown below:

differences=normal_light - increased_light
qqnorm(differences)
t.test(differences, mu=0)

or (more simply)...

t.test(normal_light, increased_light, mu=0, paired=TRUE)

To determine whether a significant difference exists in the lengths of fish from two hatcheries, 11 fish were randomly selected from hatchery A, and 10 fish were randomly selected from hatchery B. Their lengths, in centimeters, are given. You wish to do the following things:

Check the normality of each data set with a Q-Q plot.
Use a variance test to select an appropriate method,
Then, test the claim that there is no difference in the fish lengths for the two hatcheries with $\alpha=.05$.

However, exactly how you proceed depends on the way in which you are given the data...

Write R code to accomplish the tasks above if given the data in vector form as shown below:

hatchery_A=c(12.4,12.7,12.9,13.3,14.2,14.3,14.3,14.8,14.8,15.3,15.3)
hatchery_B=c(10.7,12.2,12.8,13.9,14.1,14.3,14.6,15.6,16.8,18.1)

Write R code to accomplish the tasks given above if given the data as a data frame fish with 2 components. The component in this data frame named "length" gives the length of the fish and the component "hatchery" is a factor that gives the hatchery where it was raised (i.e. A or B), as shown below.

   length hatchery
1    12.4        A
2    12.7        A
3    12.9        A
4    13.3        A
5    14.2        A
6    14.3        A
7    14.3        A
8    14.8        A
9    14.8        A
10   15.3        A
11   15.3        A
12   10.7        B
13   12.2        B
14   12.8        B
15   13.9        B
16   14.1        B
17   14.3        B
18   14.6        B
19   15.6        B
20   16.8        B
21   18.1        B

Construct the data frame fish described in part (b) from the vectors given in part (a).

qqnorm(hatchery_A)
qqnorm(hatchery_B)
var.test(hatchery_A,hatchery_B)
wilcox.test(hatchery_A,hatchery_B)

fish.list=split(fish$length, fish$hatchery)
qqnorm(fish.list$A)
qqnorm(fish.list$B)
var.test(fish.list$A,fish.list$B)
wilcox.test(fish.list$A,fish.list$B)

# Note: the following is an alternate way to do the hypothesis test:
var.test(fish$length ~ fish$hatchery)
wilcox.test(fish$length ~ fish$hatchery)

hatchery=factor(rep(c("A","B"),c(length(hatchery_A),length(hatchery_B))))
length=c(hatchery_A,hatchery_B)
fish=data.frame(length,hatchery)

A researcher wishes to try three different techniques to lower the blood pressure of individuals diagnosed with high blood pressure. The subjects are randomly assigned to three groups; the first group takes medication, the second group exercises, and the third group follows a special diet. After four weeks, the reduction in each person's blood pressure is recorded. Assume that the data in each group is approximately normal.

The data is shown below. Use R to decide if there is a significant difference between the techniques used to lower blood pressure, at $\alpha=0.05$. (Don't forget to check the assumptions of the appropriate hypothesis test!) If you find a significant difference, use R to perform pairwise tests to determine where the difference lies.

medication=c(9,10,12,13,15)
exercise=c(0,2,3,6,8)
diet=c(4,5,8,9,12)

var.test(medication,exercise)
var.test(medication,diet)
var.test(exercise,diet)
group=factor(rep(c("medication","exercise","diet"),c(5,5,5)))
pressure=c(medication,exercise,diet)
summary(aov(pressure ~ group))
library(DescTools)
ScheffeTest(aov(pressure ~ group))

Random samples of 3 brands of chocolate chip cookies are obtained and the number of chips in each cookie is recorded. Assume the distributions are approximately normal.

A=c(12,13,13,14,14,15,17)
B=c(10,12,14,15,18,20,21)
C=c(9,10,10,11,13,14,14)

What R code will determine if the variance of Brand A is significantly different from the variance of Brand B at $\alpha=.05$?
What R code will test the claim that the number of chocolate chips differs among the 3 brands at $\alpha=.05$? Choose an appropriate test based on the fact that variances are significantly different.

var.test(A,B)

kruskal.test(list(A,B,C))

A study was conducted to determine whether there is a relationship between strength and speed. A sample of 20-year-old males was selected. Each was asked to do push-ups and to run a specific course. The number of push-ups and the time it took to run the course (in seconds) are given below.

pushups=c(5,8,10,10,11,15,18,23)
time=c(61,65,45,56,62,48,49,50)

What R code will produce a scatter plot for this data and draw the regression line?
Use R to determine whether there is a significant relationship between the number of push-ups and the course time at 0.05 significance level. Assume the assumptions for the parametric test are met.
Use R to predict the course time of a 20-year-old male who can do 18 push-ups.

plot(pushups,time)
abline(lm(time ~ pushups))

cor.test(pushups,time)

#no significant correlation
mean(time)

A consumer group compared ratings of toaster ovens (on a scale of 1 to 10) to price for a random sample of 7 ovens. Use R to determine, at significance level 0.05, if there a correlation between ratings and prices.

rating=c(3,5,5,5,7,9,9)
price=c(25,49,30,59,55,35,70)

cor.test(rating, price, method="spearman")

The times (in minutes) it took ten white mice and ten brown mice to navigate a maze are stored in vectors named white and brown in R, as shown.

white = c(13,15,17,18,19,21,22,24,25,28)
brown = c(10,14,15,16,17,17,19,20,23,26)

Use R to produce a Q-Q plot to check the normality of each sample.
Use R to test the claim that $\sigma_{white}^2 = \sigma_{brown}^2$.
Use R to test the claim that there is no difference in the navigation times for the two kinds of mice.

qqnorm(white)
qqnorm(brown)

var.test(white,brown)

t.test(white,brown,var.equal = TRUE)

The manufacturer of a 6-sided die claims that the die lands showing a six with probability 40% and shows the other 5 sides each with 12% probability.

An experiment using one of these dice yields the observed frequencies given below. Write R code that tests the manufacturer's claim.
$$\begin{array}{l|cccccc} \hbox{Number showing}&1&2&3&4&5&6\cr \hbox{Frequency}&5&6&3&8&5&21 \end{array}$$
The data (values from 1 to 6) from an experiment using one of these dice are stored in the variable rolls. Write R code that tests the manufacturer's claim.

observed=c(5,6,3,8,5,21)
E=c(.12,.12,.12,.12,.12,.4)*sum(observed)
all(E>=5)
chisq.test(observed,p=c(.12,.12,.12,.12,.12,.4))

chisq.test(table(rolls),p=c(.12,.12,.12,.12,.12,.4))

A random sample of times for a swimming event is given below. Use R to find a 90% confidence interval for the mean time for this event.

times=c(154.61,158.03,164.22,165.19,165.64,168.62,170.08,173.17,174.48,
        175.62,175.82,176.47,176.58,177.68,180.33,183.63,185.71,186.49)

t.test(times,conf.level = 0.90)$conf.int

A sample of coin flips is collected from three different coins.

The results are below. Use R to conduct one hypothesis test to test the claim that all three coins have the same probability of landing heads.
$$\begin{array}{cccc} &\textrm{Coin A}&\textrm{Coin B}&\textrm{Coin C}\\ \textrm{Heads}&88&93&110 \\ \textrm{Tails} &112&107&90 \end{array}$$
Suppose the results from the sample are stored in a data frame flips with 2 factor components. The component "coin" gives the coin that was used (A, B, or C) and the component "side" gives the result of the flip (H or T).

Use R to conduct one hypothesis test to test the claim that all three coins have the same probability of landing heads.

coin.flips=matrix(c(88,93,110,112,107,90),nrow=2,byrow=TRUE)
chisq.test(coin.flips)$expected   #check the assumptions
chisq.test(coin.flips)

chisq.test(table(flips$coin,flips$side))$expected  #check assumptions
chisq.test(table(flips$coin,flips$side))

In a 2004 survey of undergraduate students throughout the United States, 89% of the respondents said they owned a cell phone. Recently, in a survey of 1200 randomly selected undergraduate students across the United States, it was found that 1098 of such students own a cell phone.

Use R to decide if the proportion of undergraduate students who own a cell phone now significantly different from 89%.

x=1098
n=1200
p=.89
n*p>=5
n*(1-p)>=5
binom.test(x,n,p)

A study investigated survival rates for in-hospital patients who suffered cardiac arrest. Among 58,593 patients who had cardiac arrest during the day, 11,604 survived and were discharged. Among 28,593 patients who suffered cardiac arrest at night, 4139 survived and were discharged. Use R to test, at a 0.01 significance level, the claim that the survival rates are the same for day and night.

all(c(11604,4139,58593-11604,28593-4139)>=5)  #check assumptions
prop.test(c(11604,4139),c(58593,28593),conf.level=.99)