Understanding Merge Sort

tl,dr: Merge Sort relies on the efficiency of combining two already-sorted lists to sort lists in a quick and reasonably understandable manner.

The Basic Algorithm

Sorting lists is hard, but combining two sorted lists is easy. That's the guiding principle of Merge Sort, one of the fundamental sorting algorithms we use today. Unlike insertion sort, Merge Sort algorithms don't like to insert new items into a list - in fact, they split the entire input into lists of length 1, and then merge them all together, just to avoid having to!

The Merge Step

The most important part of a Merge Sort is the Merge Step, where the computer takes two sorted lists of near-equal length and combines them to produce a longer sorted list. Because Merge Sort relies on merges instead of insertion, it performs a lot of merging, so it's important that we make this step as efficient as we can.

My merge implementation expects an input array, along with the start and end indexes of two of its consecutive subarrays, which are already sorted. It creates a temporary array (outputv), merges the two subarrays into into it, and then copies the contents of outputv back into inputv. The result is the same array as before, but now the two subarrays are combined into one!

void merge(
       long * inputv, // the input array
       const int first_index, // the start index of the first subarray
       const int last_of_first_subarray, // its last index
       const int last_index // the last index of the second subarray
       // note that we don't need the start index of the second subarray,
       // because this function will only be run on consecutive subarrays
    ) {

    // we create these two variables so that we can change them during the
    // algorithm, but still have the originals stored for later.
    int i1 = first_index;
    int i2 = last_of_first_subarray + 1; // this is the start index
                                         // of the second subarray

    long outputv[outputc]; 
    // a temporary array to construct the sorted subarray in

    int outputc = last_index -
                  first_index + 1;
    // the number of elements that will end up in outputv

    for (int i = 0; i < outputc; i++) {
        // this if condition is a bit complex
        // because it needs to ensure that i1
        // and i2 never overflow their respective
        // subarrays
        if (
            (i1 <= last_of_first_subarray &&
            inputv[i1] < inputv[i2]) ||
            !(i2 <= last_index)
        ) {
            // either the first subarray has a smaller first
            // element, or all elements of the second are gone
            outputv[i] = inputv[i1];
            i1++; // move on to next element of first subarray
        } else {
            // either the second subarray has a smaller first
            // element, or all elements of the first are gone
            outputv[i] = inputv[i2];
            i2++; // move on to next element of second subarray
        }
    }

    // finally, copy the elements of the temporary buffer 
    // back in to the original array
    for (int i = 0; i < outputc; i++) {
        inputv[first_index + i] = outputv[i];
        // we make sure to start at the first index -
        // we're only overwriting the original two subarrays
    }

}

The Splitting Step

Now that we have a merge function, ask yourself a question. "How would I go about sorting a list if I was only allowed to call merge a bunch of times?" Turns out, not only is this possible, but it's actually one of the fastest ways to sort a list!

To perform our merge-only sort, we're going to need to somehow divide the input list into its individual elements, and then merge all of those into lists of length two, and then merge those into lists of length four, and so on. If we do this right, we'll eventually end up with just one list: the sorted list!

This is a great place to use recursion --- calling merge recursively will allow us to perform both the splitting and the merging, with surprisingly little code! Let's look at the pseudocode in the CLRS book.

MERGE-SORT(A, p, r)
  if p < r // if this array has more than one element
    q = floor((p + r)/2) // find the middle
    MERGE-SORT(A, p, q) // sort the first half
    MERGE-SORT(A, q + 1, r) // sort the second half
    MERGE(A, p, q, r) // merge the two halves

The MERGE-SORT procedure expects an array A, a start index p, and an end index r. It will only operate on the values between p and r, so we can use those values to control which part of the array it works on--- when we call MERGE-SORT for the first time, we'll set p to 0 and r to the last index of A, so it sorts the whole list!

When I say we're going to "divide" the input list into its individual elements, I don't mean we're going to actually split it into separate lists---it's much faster if we just use our p and r values for that. Let me explain. When we call MERGE-SORT on a list, it doesn't create any more lists; it just calls itself two more times, and assigns one half of its territory to each. Well, each of those child MERGE-SORTS will do the same thing, and now we'll have four grandchild MERGE-SORT calls, each operating on one fourth of the input list! Eventually, we'll have one MERGE-SORT call for every item in the list.

This is where our if p < r condition comes in. When MERGE-SORT is called with equal p and r values, it knows that it only has one item to work on, because the first item in its subarray is also the last. In that case, its work is done, so it returns! Once a parent MERGE-SORT knows that its child MERGE-SORTs are done, it moves on to the next step: it takes the subarrays that its children have sorted, and merges them. For example, if we call MERGE-SORT on a list of two elements, it will call itself twice, once for the first element and once for the second. Those calls to MERGE-SORT will realize they're being called on lists of one element, and return, because a list of one element is already sorted. Once they finish, the 2-item MERGE-SORT will take the two one-element subarrays and merge them, which will sort the list!

If MERGE-SORT is called on a list of length 4, it will divide it into 2 sections, call MERGE-SORT on each, and then merge the two, leaving the 4-item list sorted. What's more, even though MERGE-SORT works by dividing each subarray into halves, it can sort a list with an odd number of items just as easily, if you set it up right. More on that later. For now, let's put this together in C.

void merge_sort(long * inputv, int p, int r) {
    if (p >= r) return;
    // early return if the input subarray is length 1 or less

    int q = (p + r) / 2;
    // note that C defaults to floor division on integers
    merge_sort(inputv, p, q);
    // sort the first subarray
    merge_sort(inputv, q + 1, r);
    // sort the second subarray
    merge(inputv, p, q, r);
    // combine the two
}

To sort a long * called inputv with length inputc, we use: merge_sort(inputv, 0, inputc - 1). Putting it all together in a program:

#include "../clilonglist.h" // my command-line input utility functions
#include <stdlib.h>
#include <stdio.h>

void merge(long * inputv,
       const int first_index,
       const int last_of_first_subarray,
       const int last_index) {

    int i1 = first_index; 
    int i2 = last_of_first_subarray + 1;

    int outputc = last_index -
        first_index + 1;
    long outputv[outputc];

    for (int i = 0; i < outputc; i++) {
        if ( 
              (i1 <= last_of_first_subarray &&
              inputv[i1] < inputv[i2]) ||
              !(i2 <= last_index)
              ) {
            outputv[i] = inputv[i1];
            i1++;
        } else {
            outputv[i] = inputv[i2];
            i2++;
        }
    }

    for (int i = 0; i < outputc; i++) {
        inputv[first_index + i] = outputv[i];
    }

}

void merge_sort(long * inputv, int p, int r) {
    if (p >= r) return;

    int q = (p + r) / 2;
    merge_sort(inputv, p, q);
    merge_sort(inputv, q + 1, r);
    merge(inputv, p, q, r);
}

int main(int argc, char * argv[]) {

    int inputc = argc - 1;
    long * inputv = extract_long_list(inputc, argv);
    // use my utility functions to grab the 
    // command line arguments as an array of longs

    inputv = merge_sort(inputv, 0, inputc - 1);

    print_long_list(inputc, inputv);

    free(inputv);
}

As always, you can find this code and all my helper code on Github.

Let's run our merge sort.

./merge_sort 7 4 5 2 3 6 1 1 2 3 4 5 6 7

It works!

Avoiding Off-By-One errors

The most difficult part of this algorithm is making sure all of you indices line up---making sure that the two subarrays that you sort with merge_sort are the same ones that are passed to merge. It's pretty easy to end up with a mismatch, one more or one less item where there shouldn't be, and a demonstrably non-sorted output list. So, let's make some things clear to make sure this doesn't happen.

1. merge and merge_sort expect array indices, not array lengths.

We define merge as

void merge(long * inputv,
       const int first_index,
       const int last_of_first_subarray,
       const int last_index) {

I try to name these variables in such a was as to make it obvious that they are intended to be indices---we expect to be able to access those positions in inputv and find something there. It's easy to mess this up. For example, I could accidentally call merge(inputv, 0, inputc / 2, inputc), where inputc is the size of the input list. The problem is, the last parameter of merge is not supposed to be the length of inputv; it's supposed to be the last index in inputv. If I make the incorrect call above, the computer will try to access inputv[inputc], which will not be one of the values in the list. Needless to say, this will not be good.

The same is true of the merge_sort function, which leads us to our next potential footgun:

2. Make sure merge_sort divides the list fairly

Let's take a look at our splitting code in merge_sort.

int q = (p + r) / 2;
merge_sort(inputv, p, q);
merge_sort(inputv, q + 1, r);

This little tidbit of code can cause huge issues if you don't take the time to think about what exacly q is. Is it the end of the first subarray, or the start of the second subarray? Treating it like the wrong one will break our code, so let's look at it. We see that q is defined as (p + r) / 2. We're dividing the integer p + r by the integer 2, so our C code will default to floor division: C wants the result to be an integer, so it rounds down. As such, if we apply this function to the entirety of a 7-element array, meaning that p is 0 and r is 7, q will end up being 0 + 7 = 7 divided by 2, rounded down, which will equal 3. Well, that means that we should use it as the end of the first subarray, so we are correct to call the first merge_sort as merge_sort(inputv, p, q) and the next, which should start at the index after q, as merge_sort(inputv, q + 1, r). I'm sure you can see that calling

int q = (p + r) / 2;
merge_sort(inputv, p, q - 1);
merge_sort(inputv, q, r); // q is the start of the second subarray?

would be a disaster, because, although our merge implementation could probably handle lists of very unequal length (albeit inefficiently), things would go all haywire when we tried to split a list of length 2 into... a list of length 0 and a list of length 2, which would create an infinite loop, and a stack overflow (the crash, not the website named after it).

Be aware of these pitfalls, and your journey towards Merge Sort will be much more pleasant.

Conclusion

Merge Sort is a strong contender for the title of "Current Best Sorting Algorithm", because of its theoretically-optimal O(n lg n) running time. Our implementation of it can still be improved; for example, one could note that our merge function creates a new buffer (outputv) every time it is called, and modify our code to simply create one outputv array, for all calls to merge to use. The extra buffer requirement is, in fact, one of the main downsides of merge sort over other algorithms like quicksort, and the reduction of that requirement is an area of research. Maybe you can be the one to defeat it!