Processing math: 100%
     

The Line of Best Fit

The line of best fit is a linear function used to predict the y-value for a given x value after a significant correlation has been found between the two random variables they represent. We denote the prediction for a given xi by ˆyi. Given that the relationship between any ˆy and x is a linear one, there must be constants m and b, such that:

ˆy=mx+b

This being a line of "best fit", the particular values of these constants m and b are such that they minimize the sum of the squared errors:

E=i(ˆyiyi)2

To find these constants m and b, we note the following:

E=i(ˆyiyi)2=i(mxi+byi)2=i(m2x2i+2bmxi+b22mxiyi2byi+y2i)=m2ix2i+2bmixi+nb22mixiyi2byi+iy2i

That might look intimidating, but remember that all of the sigmas are just constants, formed by adding up various combinations of the x and y coordinates of the original points. In fact, collecting like terms reveals that E is just a parabola with respect to m or b:

E(m)=(ix2i)m2+(2bixi2xiyi)m+(nb22biyi+iy2i) E(b)=nb2+(2mixi2iyi)b+(m2ix2i2mixiyi+iy2i)

Further, both of these parabolas open upward since the coefficients on the m2 and b2 terms are both positive (the sum of x2i must be positive unless all of the x-coordinates are 0, and of course n, the number of points, is positive).

Since the parabolas open upwards, each one has a minimum at its vertex. Recalling that the vertex of y=ax2+bx+c occurs at x=b/(2a), we have a vertex at:

m=2bxi+2xiyi2x2i=xiyibxix2i b=2mxi+2yi2n=yimxin

Now we have two linear equations in terms of m and b. Substitute one into the other to solve this system of equations -- perhaps the second into the first -- and the solution is revealed:

m=nxiyi(xi)(yi)nx2i(xi)2andb=(x2i)(yi)(xi)(xiyi)nx2i(xi)2

The formula for m is bad enough, but the formula for b is a monstrosity. However, there is no need to deal with (or even find in the first place) this expression for b, as our earlier (and far simpler) expression for b previously had m as its only variable with an unknown value -- and now m is known! Consequently:

m=nxiyi(xi)(yi)nx2i(xi)2andb=yimxin

With a little more algebra, we can express m and b in the following way (see if you can prove it):

m=(xi¯x)(yi¯y)(xi¯x)2andb=¯ym¯x

where ¯x and ¯y are the averages of all the x-coordinates and all the y-coordinates, respectively.

Finally, recalling the formula for the covariance sxy, we can also write this as:

m=sxys2xandb=¯ym¯x