The Line of Best Fit

The line of best fit is a linear function used to predict the $y$-value for a given $x$ value after a significant correlation has been found between the two random variables they represent. We denote the prediction for a given $x_i$ by $\widehat{y}_i$. Given that the relationship between any $\widehat{y}$ and $x$ is a linear one, there must be constants $m$ and $b$, such that:

$$\widehat{y} = mx + b$$

This being a line of "best fit", the particular values of these constants $m$ and $b$ are such that they minimize the sum of the squared errors:

$$E = \sum_i (\widehat{y}_i - y_i)^2$$

To find these constants $m$ and $b$, we note the following:

$$\begin{array}{rcl} E &=& \displaystyle{\sum_i (\widehat{y}_i - y_i)^2}\\\\ &=& \displaystyle{\sum_i ({mx}_i + b - y_i)^2}\\\\ &=& \displaystyle{\sum_i (m^2 x_i^2 + {2bmx}_i + b^2 -{2mx}_iy_i - {2by}_i + y_i^2)}\\\\ &=& \displaystyle{m^2 \sum_i x_i^2 + 2bm\sum_i x_i + {nb}^2 - 2m\sum_i x_i y_i - 2b\sum y_i + \sum_i y_i^2} \end{array}$$

That might look intimidating, but remember that all of the sigmas are just constants, formed by adding up various combinations of the $x$ and $y$ coordinates of the original points. In fact, collecting like terms reveals that $E$ is just a parabola with respect to $m$ or $b$:

$$E(m) = \left( \sum_i x_i^2 \right)m^2 + \left(2b\sum_i x_i - 2 \sum x_i y_i \right) m + \left({nb}^2 - 2b\sum_i y_i + \sum_i y_i^2 \right)$$ $$E(b) = {nb}^2 + \left( 2m\sum_i x_i - 2\sum_i y_i \right) b + \left( m^2 \sum_i x_i^2 - 2m\sum_i x_i y_i + \sum_i y_i^2 \right)$$

Further, both of these parabolas open upward since the coefficients on the $m^2$ and $b^2$ terms are both positive (the sum of $x_i^2$ must be positive unless all of the $x$-coordinates are $0$, and of course $n$, the number of points, is positive).

Since the parabolas open upwards, each one has a minimum at its vertex. Recalling that the vertex of $y=ax^2+bx+c$ occurs at $x=-b/(2a)$, we have a vertex at:

$$m = \frac{-2b\sum x_i + 2\sum x_i y_i}{2\sum x_i^2} = \frac{\sum x_i y_i - b \sum x_i}{\sum x_i^2}$$ $$b = \frac{-2m\sum x_i + 2\sum y_i}{2n} = \frac{\sum y_i - m\sum x_i}{n}$$

Now we have two linear equations in terms of $m$ and $b$. Substitute one into the other to solve this system of equations -- perhaps the second into the first -- and the solution is revealed:

$$m = \frac{n \sum x_i y_i - (\sum x_i)(\sum y_i)}{n\sum x_i^2 - (\sum x_i)^2} \quad \textrm{and} \quad b=\frac{(\sum x_i^2)(\sum y_i)-(\sum x_i)(\sum x_i y_i)}{n\sum x_i^2 - (\sum x_i)^2}$$

The formula for $m$ is bad enough, but the formula for $b$ is a monstrosity. However, there is no need to deal with (or even find in the first place) this expression for $b$, as our earlier (and far simpler) expression for $b$ previously had $m$ as its only variable with an unknown value -- and now $m$ is known! Consequently:

$$m = \frac{n \sum x_i y_i - (\sum x_i)(\sum y_i)}{n\sum x_i^2 - (\sum x_i)^2} \quad \textrm{and} \quad b=\frac{\sum y_i - m\sum x_i}{n}$$

With a little more algebra, we can express $m$ and $b$ in the following way (see if you can prove it):

$$m = \frac{\sum (x_i-\overline{x})(y_i - \overline{y})}{\sum (x_i - \overline{x})^2} \quad \textrm{and} \quad b = \overline{y} - m\overline{x}$$

where $\overline{x}$ and $\overline{y}$ are the averages of all the $x$-coordinates and all the $y$-coordinates, respectively.

Finally, recalling the formula for the covariance $s_{xy}$, we can also write this as:

$$m = \frac{s_{xy}}{s_x^2} \quad \textrm{and} \quad b = \overline{y} - m\overline{x}$$