Tech Tips: Correlation

To find the correlation coefficient $r$, and conduct a test of the significance of the correlation ...

• R: One can find the correlation coefficient $r$, as defined below, in R by using the cor() function.

$$r=\frac{\sum z_x z_y}{n-1} = \frac{s_{xy}}{s_x s_y}$$

As an example, suppose we are interested in measuring the correlation between the price of pizza and the price of a subway ticket in New York. We compile our observations into two vectors, as shown below:

> pizza = c(0.15,0.35,1.00,1.25,1.75,2.00)
> subway = c(0.15,0.35,1.00,1.35,1.50,2.00)
> r = cor(pizza,subway)
> r
[1] 0.9878109


Now, to see if the correlation coefficient is significant, we find the appropriate test statistic (after checking the appropriate assumptions, of course):

> t = r*sqrt((length(pizza)-2)/(1-r^2))
> t
[1] 12.69203


Finally, we can compute a $p$-value for the test by using the pt() function.

> p.value = 2*(1-pt(t,length(pizza)-2))
> p.value
[1] 0.0002219544


Seeing a $p$-value substantially smaller than $\alpha = 0.05$, we reject the null hypothesis. The correlation between the price of a slice of pizza and subway fares in New York is highly significant.

Of course, as always, R provides a quicker way to do the above parametric significance of correlation test:

> pizza = c(0.15,0.35,1.00,1.25,1.75,2.00)
> subway = c(0.15,0.35,1.00,1.35,1.50,2.00)
> cor.test(pizza,subway)

Pearson's product-moment correlation

data:  pizza and subway
t = 12.692, df = 4, p-value = 0.000222
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8886647 0.9987251
sample estimates:
cor
0.9878109


Notably, this same function can perform a non-parametric Spearman's rank correlation test as well, in the case that the assumptions of Pearson's correlation test are not met.

(In Spearman's test, the $x$ and $y$-values are separately ranked first, and then Pearson's correlation coefficient is computed for the ranks. This coefficient is then used as a test statistic. We'll discuss this test in greater detail later.)

As an example, suppose a consumer group compares ratings of toaster ovens to price for a random sample of ovens, shown below.

$$\begin{array}{r|ccccccc} \hbox{Model} &A&B&C&D&E&F&G\\\hline \hbox{Rating(1-10)} & 3 & 4 & 6 & 5 & 7 & 10 & 9\\ \hbox{Price(\)}&25&49&30&59&55&35&70\\ \end{array}$$

We desire to know whether there is a correlation between ratings and prices. Noting that the ratings are ordinal and Pearson's correlation test has an assumption of ratio or interval level data, we use Spearman's rank correlation test instead:

> rating = c(3,4,6,5,7,10,9)
> price = c(25,49,30,59,55,35,70)
> cor.test(rating,price,method="spearman")

Spearman's rank correlation rho

data:  rating and price
S = 34, p-value = 0.3956
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.3928571