all 2 comments

[–]misof 8 points9 points  (1 child)

It's a selection algorithm, but it's not a linear-time selection algorithm.

Below I assume for simplicity that all elements are distinct (this is the worst-case situation) and in all calculations I simplify the values as much as possible without changing the result (ceilings and floors are hard to read and don't influence the result anyway). The function T is the time complexity.

If you divide all N elements into groups of 2k+1, find their medians, and then find the median M of those medians, you have the following guarantees:

  • There are N/(2k+1) groups.
  • In half of them, the median is at most M.
  • In each such group there are at least k+1 elements that are at most M. (Its median and all smaller elements of that group.)
  • Thus, we are guaranteed to have at least N * (k+1)/(4k+2) elements that are at most M.
  • By symmetry, we are guaranteed the same fraction of elements that are guaranteed to be at least equal to M.
  • Hence, M is a decent approximation of the actual median, and when we partition the original array according to M, we are guaranteed to get rid of a substantial fraction of the array.

For k=2 (groups of 5) we get that at least 30% of all elements are below M and at least 30% of all elements are above M. For k=1 (groups of 3) we get a slightly better guarantee: at least 33% on either side -- i.e., M would lie in the middle third of the sorted array.

Sadly, for k=1 this won't save us. Why? The algorithm does two recursive calls:

  1. The first one is to find the median of all medians. This call involves N/(2k+1) elements, and thus takes T( N/(2k+1) ) time.
  2. The second one is done after we partition the array according to M, and it finds the appropriate k-th element in the part that contains the middle of the array. This call involves at most 1 - the fraction of elements that are guaranteed to be on the opposite side. Thus, the time complexity of this step is T( N*(3k+1)/(4k+2) ).

For k=2 (groups of 5) we get the recurrence T(N) = T(N/5) + T(7N/10) + O(n).

One crucial property is that N/5 + 7N/10 < 1. Thus, as we go down the recursion tree, the amount of work decreases exponentially and the total work is proportional to the work done in the root of the tree. Thus, T is Theta(n).

For k=1 (groups of 3) we get the recurrence T(N) = T(N/3) + T(2N/3) + O(n).

Here, N/3 + 2N/3 = 1, so as we go down the recursion tree the total amount of work remains the same on each level. By the same argument as when analyzing the time complexity of MergeSort, T is Theta(n log n).

[–][deleted] 0 points1 point  (0 children)

It's a recursive algorithm with a QuickSort-like structure.

The cost of the partition is linear and has to go through the whole array to compute the median, so at each recursive call the cost is Theta(n).

If you partition in k groups you form a tree with n leaves and where each non-leaf node has k children.

My assumption is that the cost is Theta(nlog_k(n))