The Basic Algorithm
Sorting lists is hard, but combining two sorted lists is easy. That's the guiding principle of Merge Sort, one of the fundamental sorting algorithms we use today. Unlike insertion sort, Merge Sort algorithms don't like to insert new items into a list - in fact, they split the entire input into lists of length 1, and then merge them all together, just to avoid having to!
The Merge Step
The most important part of a Merge Sort is the Merge Step, where the computer takes two sorted lists of near-equal length and combines them to produce a longer sorted list. Because Merge Sort relies on merges instead of insertion, it performs a lot of merging, so it's important that we make this step as efficient as we can.
My merge
implementation expects an input array, along with the start and end indexes of two of its consecutive subarrays, which
are already sorted.
It creates a temporary array (outputv
), merges the two subarrays into into it, and then copies the contents of outputv
back
into inputv
. The result is the same array as before, but now the two subarrays are combined into one!
void merge(
long * inputv, // the input array
const int first_index, // the start index of the first subarray
const int last_of_first_subarray, // its last index
const int last_index // the last index of the second subarray
// note that we don't need the start index of the second subarray,
// because this function will only be run on consecutive subarrays
) {
// we create these two variables so that we can change them during the
// algorithm, but still have the originals stored for later.
int i1 = first_index;
int i2 = last_of_first_subarray + 1; // this is the start index
// of the second subarray
long outputv[outputc];
// a temporary array to construct the sorted subarray in
int outputc = last_index -
first_index + 1;
// the number of elements that will end up in outputv
for (int i = 0; i < outputc; i++) {
// this if condition is a bit complex
// because it needs to ensure that i1
// and i2 never overflow their respective
// subarrays
if (
(i1 <= last_of_first_subarray &&
inputv[i1] < inputv[i2]) ||
!(i2 <= last_index)
) {
// either the first subarray has a smaller first
// element, or all elements of the second are gone
outputv[i] = inputv[i1];
i1++; // move on to next element of first subarray
} else {
// either the second subarray has a smaller first
// element, or all elements of the first are gone
outputv[i] = inputv[i2];
i2++; // move on to next element of second subarray
}
}
// finally, copy the elements of the temporary buffer
// back in to the original array
for (int i = 0; i < outputc; i++) {
inputv[first_index + i] = outputv[i];
// we make sure to start at the first index -
// we're only overwriting the original two subarrays
}
}
The Splitting Step
Now that we have a merge
function, ask yourself a question.
"How would I go about sorting a list if I was only allowed to
call merge
a bunch of times?" Turns out, not only is this
possible, but it's actually one of the fastest ways to sort
a list!
To perform our merge-only sort, we're going to need to somehow divide the input list into its individual elements, and then merge all of those into lists of length two, and then merge those into lists of length four, and so on. If we do this right, we'll eventually end up with just one list: the sorted list!
This is a great place to use recursion --- calling merge
recursively
will allow us to perform both the splitting and the merging, with surprisingly
little code! Let's look at the pseudocode in the CLRS book.
MERGE-SORT(A, p, r)
if p < r // if this array has more than one element
q = floor((p + r)/2) // find the middle
MERGE-SORT(A, p, q) // sort the first half
MERGE-SORT(A, q + 1, r) // sort the second half
MERGE(A, p, q, r) // merge the two halves
The MERGE-SORT
procedure expects an array A
, a start index p
, and an
end index r
. It will only operate on the values between p
and r
,
so we can use those values to control which part of the array it works on---
when we call MERGE-SORT
for the first time, we'll set p
to 0
and r
to
the last index of A
, so it sorts the whole list!
When I say we're going to "divide" the input list into its individual elements,
I don't mean we're going to actually split it into separate lists---it's much faster
if we just use our p
and r
values for that. Let me explain. When we call
MERGE-SORT
on a list, it doesn't create any more lists; it just calls itself
two more times, and assigns one half of its territory to each. Well, each of those
child MERGE-SORTS
will do the same thing, and now we'll have four grandchild MERGE-SORT
calls,
each operating on one fourth of the input list! Eventually, we'll have one MERGE-SORT
call
for every item in the list.
This is where our if p < r
condition comes in. When MERGE-SORT
is called with equal p
and r
values, it knows that it only has one item to work on, because the first item in its subarray is
also the last. In that case, its work is done, so it returns! Once a parent MERGE-SORT
knows that
its child MERGE-SORT
s are done, it moves on to the next step: it takes the subarrays that its children
have sorted, and merges them. For example, if we call MERGE-SORT
on a list of two elements, it will
call itself twice, once for the first element and once for the second. Those calls to MERGE-SORT
will
realize they're being called on lists of one element, and return, because a list of one element is already
sorted. Once they finish, the 2-item MERGE-SORT
will take the two one-element subarrays and merge
them,
which will sort the list!
If MERGE-SORT
is called on a list of length 4, it will divide it into 2 sections, call MERGE-SORT
on each,
and then merge
the two, leaving the 4-item list sorted. What's more, even though MERGE-SORT
works by dividing
each subarray into halves, it can sort a list with an odd number of items just as easily, if you set it up right.
More on that later. For now, let's put this together in C.
void merge_sort(long * inputv, int p, int r) {
if (p >= r) return;
// early return if the input subarray is length 1 or less
int q = (p + r) / 2;
// note that C defaults to floor division on integers
merge_sort(inputv, p, q);
// sort the first subarray
merge_sort(inputv, q + 1, r);
// sort the second subarray
merge(inputv, p, q, r);
// combine the two
}
To sort a long *
called inputv
with length inputc
, we use: merge_sort(inputv, 0, inputc - 1)
. Putting it
all together in a program:
#include "../clilonglist.h" // my command-line input utility functions
#include <stdlib.h>
#include <stdio.h>
void merge(long * inputv,
const int first_index,
const int last_of_first_subarray,
const int last_index) {
int i1 = first_index;
int i2 = last_of_first_subarray + 1;
int outputc = last_index -
first_index + 1;
long outputv[outputc];
for (int i = 0; i < outputc; i++) {
if (
(i1 <= last_of_first_subarray &&
inputv[i1] < inputv[i2]) ||
!(i2 <= last_index)
) {
outputv[i] = inputv[i1];
i1++;
} else {
outputv[i] = inputv[i2];
i2++;
}
}
for (int i = 0; i < outputc; i++) {
inputv[first_index + i] = outputv[i];
}
}
void merge_sort(long * inputv, int p, int r) {
if (p >= r) return;
int q = (p + r) / 2;
merge_sort(inputv, p, q);
merge_sort(inputv, q + 1, r);
merge(inputv, p, q, r);
}
int main(int argc, char * argv[]) {
int inputc = argc - 1;
long * inputv = extract_long_list(inputc, argv);
// use my utility functions to grab the
// command line arguments as an array of longs
inputv = merge_sort(inputv, 0, inputc - 1);
print_long_list(inputc, inputv);
free(inputv);
}
As always, you can find this code and all my helper code on Github.
Let's run our merge sort.
- ./merge_sort 7 4 5 2 3 6 1
1 2 3 4 5 6 7
It works!
Avoiding Off-By-One errors
The most difficult part of this algorithm is making sure all of you indices line up---making sure that the two
subarrays that you sort with merge_sort
are the same ones that are passed to merge
. It's pretty easy to
end up with a mismatch, one more or one less item where there shouldn't be, and a demonstrably non-sorted output list.
So, let's make some things clear to make sure this doesn't happen.
1. merge and merge_sort expect array indices, not array lengths.
We define merge
as
void merge(long * inputv,
const int first_index,
const int last_of_first_subarray,
const int last_index) {
I try to name these variables in such a was as to make it obvious that they are intended to be indices---we expect to be able
to access those positions in inputv
and find something there. It's easy to mess this up. For example, I could accidentally call
merge(inputv, 0, inputc / 2, inputc)
, where inputc
is the size of the input list. The problem is, the last parameter of
merge
is not supposed to be the length of inputv
; it's supposed to be the last index in inputv
. If I make the incorrect
call above, the computer will try to access inputv[inputc]
, which will not be one of the values in the list. Needless to say,
this will not be good.
The same is true of the merge_sort
function, which leads us to our next potential footgun:
2. Make sure merge_sort divides the list fairly
Let's take a look at our splitting code in merge_sort
.
int q = (p + r) / 2;
merge_sort(inputv, p, q);
merge_sort(inputv, q + 1, r);
This little tidbit of code can cause huge issues if you don't take the time to think about what exacly q
is. Is it the end
of the first subarray, or the start of the second subarray? Treating it like the wrong one will break our code, so let's look
at it. We see that q
is defined as (p + r) / 2
. We're dividing the integer p + r
by the integer 2
, so our C code will
default to floor division: C wants the result to be an integer, so it rounds down. As such, if we apply this function
to the entirety of a 7-element array, meaning that p
is 0
and r
is 7
, q
will end up being 0 + 7 = 7
divided by 2
,
rounded down, which will equal 3
. Well, that means that we should use it as the end of the first subarray, so we are correct
to call the first merge_sort
as merge_sort(inputv, p, q)
and the next, which should start at the index after q
, as
merge_sort(inputv, q + 1, r)
. I'm sure you can see that calling
int q = (p + r) / 2;
merge_sort(inputv, p, q - 1);
merge_sort(inputv, q, r); // q is the start of the second subarray?
would be a disaster, because, although our merge
implementation could probably handle lists of very unequal length (albeit inefficiently),
things would go all haywire when we tried to split a list of length 2 into... a list of length 0 and a list of length 2, which would create
an infinite loop, and a stack overflow (the crash, not the website named after it).
Be aware of these pitfalls, and your journey towards Merge Sort will be much more pleasant.
Conclusion
Merge Sort is a strong contender for the title of "Current Best Sorting Algorithm", because of its theoretically-optimal O(n lg n)
running time.
Our implementation of it can still be improved; for example, one could note that our merge
function creates a new buffer (outputv
) every time it
is called, and modify our code to simply create one outputv
array, for all calls to merge
to use. The extra buffer requirement is, in fact, one
of the main downsides of merge sort over other algorithms like quicksort, and the reduction of that requirement is an
area of research. Maybe you can be the one to defeat it!