Quick Sort Analysis

Best Case

The best case for the quick sort occurs when each partition splits the array into two equal halves. Then, the overall cost for sorting $n$ items is given by:

Since the size of each half is $n/2$, the recurrence relation for the cost function (for comparisons) is closely related to the one governing the merge sort:

$$C(n) = 2 C(\frac{n}{2}) + n \quad ; \quad C(1) = 0$$

The result is similar, with the cost function in terms of comparisons for the quicksort also being $O(n \ln n)$.

Worst Case

The worst case for the quick sort occurs when the partition does not split the array (i.e., when one set has no elements at all). Ironically, this happens when the array is sorted!

In this situation, the cost of the sort can be resolved into:

Thus, the recurrence relation in this case is slightly different:

$$C(n) = C(n-1) + n + 1\quad ; \quad C(1) = 0$$

Solving the recurrence relation is straight forward:

$$\begin{array}{rcll} C(n) &=& C(n-1) + (n + 1)& \\\\ &=& [C(n-2) + n] + (n + 1)&\\\\ &=& [C(n-3) + (n-1)] + n + (n + 1)&\\\\ &=& \cdots\\\\ &=& C(1) + 3 + \cdots + (n-2) + (n-1) + n + (n + 1)&\\\\ &=& 0 + 3 + \cdots + (n-2) + (n-1) + n + (n + 1)& \scriptsize{\textrm{now steal $3$ from the $n+1$ to produce a familiar sum}}\\\\ &=& [1 + 2 + 3 + \cdots + (n-2) + (n-1) + n] + (n-2)&\\\\ &=& \displaystyle{\frac{n(n+1)}{2} + (n-2)}&\\\\ &\sim& \displaystyle{\frac{n^2}{2}}&\\\\ \end{array}$$

In big-oh notation, the cost in this worst case is then $O(n^2)$.

Since the worst cases occurs when the array is ordered (or highly ordered), we can give ourselves a probabilistic guarantee that this doesn't happen by purposefully shuffling the elements of the array before using the quick sort algorithm. This comes at a slight extra cost, but is worth it to minimize to the point of negligibility the possibility of a $O(n^2)$ cost.

Average Case

To derive the average number of comparisons needed to quicksort an array of $n$ distinct keys, consider the following:

First, let us denote this average cost by $C_n$ instead of the more traditional $C(n)$. This will help us visually parse the next few expressions without getting lost in a bunch of nested parentheses.

Now, we again consider the cost due to partitioning versus the recursive cost:

As such, we have a recurrence relation defined by $C_0 = C_1 = 0$ and for $n \ge 2$: $$C_n = (n+1) + \frac{2(C_0 + C_1 + C_2 + \cdots + C_{n-1})}{n}$$ Let's clear the fractions by multiplying by $n$: $$n C_n = n(n+1) + 2(C_0 + C_1 + \cdots + C_{n-2} + C_{n-1})$$ We can eliminate the great majority of $C_k$ terms in the above by observing what this recurrence implies about $C_{n-1}$: $$(n-1) C_{n-1} = (n-1)n + 2(C_0 + C_1 + \cdots + C_{n-2})$$ Seeing the great overlap in terms seen in these last two equations, let us subtract one from the other: $$n C_n - (n-1) C_{n-1} = 2n + 2 C_{n-1}$$ Now comes the interesting part. Note that we can rearrange the terms and divide by $n(n+1)$ to obtain $$\frac{C_n}{n+1} = \frac{C_{n-1}}{n} + \frac{2}{n+1}$$ Look at the first term on the right side. Do you see how structurely it is identical to the first term on the left -- a cost for $k$ elements divided by $(k+1)$. This gives us the ability to make multiple substitutions until we arrive at the base case:

$$\begin{array}{rcl} \displaystyle{\frac{C_n}{n+1}} &=& \displaystyle{\frac{C_{n-1}}{n} + \frac{2}{n+1}}\\\\ &=& \displaystyle{\left[\frac{C_{n-2}}{n-1} + \frac{2}{n}\right] + \frac{2}{n+1}}\\\\ &=& \displaystyle{\left[ \left[ \frac{C_{n-3}}{n-2} + \frac{2}{n-1} \right] + \frac{2}{n}\right] + \frac{2}{n+1}}\\\\ &\cdots&\\\\ &=& \displaystyle{\frac{C_1}{2} + \frac{2}{3} + \frac{2}{4} + \frac{2}{5} + \cdots + \frac{2}{n+1}} \end{array}$$ Recalling that $C_1 = 0$, and then multiplying both sides by $(n+1)$, we find $$C_n = 2(n+1) \left(\frac{1}{3} + \frac{1}{4} + \frac{1}{5} + \cdots + \frac{1}{n+1} \right)$$

Notice the right-most factor $(1/3+1/4+1/5+\cdots+1/(n+1))$ can be approximated by the area under the curve $y = 1/x$, as the image below suggests (for $n=7$).

Finding such areas is routine in calculus, resulting in the following here: $$\begin{array}{rcll} C_n &\sim& \displaystyle{2(n+1) \int_3^{n+1} \frac{1}{x}\,dx}&\\\\ &\sim& 2(n+1) \ln n&\\\\ &\approx& 1.39 n \log_2 n& \scriptsize{\textrm{recalling the change of base theorem and observing that } 2 \ln 2 \approx 1.39} \end{array}$$

Thus, on average the quick sort results in a number of comparisons that is roughly $\sim 1.39 n \log_2 n$.

That means one has about $39$ percent more comparisons than merge sort. However, quick sort is faster than merge sort in practice, because there is less data movement during the sort.

Improvements

We've already mentioned above that initially shuffling the array helps protect us from the worst-case comparison cost. We can additionally improve on the quick sort algorithm with a few other modifications: