The New, Improved Bubble Sort
If you dismissed the bubble sort as slow and old fashioned, you were right. But that was the bubble sort before Butterfield. In this article, Jim blows the dust off this old sorting method and teaches it some powerful new tricks.
Recently, I was writing a program that needed to do some sorting. A simple sorting method, well known to beginners—the bubble or exchange sort—had features that I liked, but it became slower and slower as the number of items increased. My problem was to find a way to modify the basic bubble sort to make it faster. In this article, we'll examine the nature of the bubble sort and explore some methods to improve its performance.
The Basic Bubble
The idea behind a bubble sort is quite simple: Sweep through the items, comparing each adjacent pair. If you find a pair out of order, swap them and continue the sweep. When a sweep is finished, ask yourself if you did any swaps that time. If the answer is yes, do the sweep again. If the answer is no, you're finished—the items are sorted.
An example might illustrate this method. Suppose we wish to alphabetize the following words:
AN APPLE EACH DAY MAKES THE DEALER HAPPY
Sweeping from left to right (we could go either way), we first compare AN with APPLE. They're in the right order, so we move on to APPLE and EACH. Still OK, but the next pair (EACH and DAY) are out of order, so we swap them. The next comparison will be between EACH and MAKES (the word EACH has moved, remember), and since they're OK, we move along. Eventually, our first sweep yields:
AN APPLE DAY EACH MAKES DEALER HAPPY THE
The highest word, THE, has bubbled up to the top of the list. On the next sweep, the next highest word, MAKES, will bubble to the top. You can see where the name bubble sort comes form.
Computer scientists do not think well of the bubble sort. Most simple sorting methods are classified as N Squared sorts. This means that as you double the number of items to be sorted, the time required to do the sort is increased by a factor of four. Big numbers make this type of sort impractical—it works fine on a dozen items, but it's hopelessly slow for sorting a thousand.
Here's why: A bubble sort compares each item against almost every other item. If we had a dozen items, we might need to make up to 11 sweeps through the data, making 11 comparisons on each sweep. Total comparisons: up to 121. We can live with that, but the arithmetic shows us what happens when we have 1000 units—999 sweeps with 999 comparisons each makes it obvious that timing will be disastrously slow.
That's why computer scientists have come up with a number of other sorting methods that will lessen this crushing time barrier. The newer generation of sorts include Quicksort (generally agreed to be fastest), Heapsort, and Selective Replacement. The number of comparisons made by these sorting methods will grow much more slowly as the data increases. They are classified as N LOG N sorts. For a dozen items, the number of comparisons required might be about 45. Increasing the number of items to 1000 might call for about 10,000 comparisons. That's a lot, but it's much better than the huge numbers called for by the bubble sort.
There's another criticism of the bubble sort that's not completely fair. It's said that the bubble sort moves data around too much. Data movement is time-consuming and may cause your program to run afoul of the dreaded garbage collection problem, which is a major time waster. But that problem is easy to eliminate from the bubble sort or any other sort. Here's the method: Instead of moving the data, we move an index that points to the data. We'll use this method in our example below.
An index array becomes very useful when your data has a number of fields in each record. For each record, you might have such elements as date, account, and amount. If you don't use an index, you have to move the data itself, and that can become clumsy.
I was writing an accounting program and I wanted to use the bubble sort despite its slow speed. Why? Let me outline some of the advantages that concerned me.
First, the bubble sort is very good on items that are almost in the correct order before the sort starts. For my application, the accounting data would normally have been entered in order by date, and I expected that many of the sorted reports would still be at least partially in chronological order. There are many other types of sorts that derive no advantage from a nearly sorted set of data, but the bubble sort might straighten things out in two or three sweeps.
Second, the bubble behaves well when there are a lot of "don't care" situations in the sorting order. If my accounting system contained, say, four accounts (auto, food, house, miscellaneous), and the user wanted to sort by account, there would be many situations where we would compare similar items (auto versus auto). In such a case, the bubble sort would just skip along, leaving the items as they were found.
Third, I wanted to use a sort in which output could take place before the sort was finished. I was concerned with the user's perception of the system here. Is it better to wait for a full sort—say, five minutes—with nothing happening on the screen? Or would it be preferable to have the first item printed out in 30 seconds or so with the remaining items following at suitable intervals? You can argue the point either way. I chose the latter.
Reverse Sweep And Flags
It doesn't matter if you sweep from bottom to top or from top to bottom. For me, the top-down method works better, since each sweep guarantees at least one new item to be output (the next lowest item will bubble down to the bottom).
Here's where the speed improvement comes in. Every time a swap takes place, the upper item is marked as having been moved (a flag is set on that item). We don't need to worry about marking the lower item: We're sweeping in a downward direction so we'll test that against something new almost immediately.
On the next sweep, only the items that have moved up will need to be tested against the next higher piece of data. (If an item moves to the top, it won't need this kind of test, of course). So, the following sweep will compare only those items that need it.
An example should clear things up. We'll show flagged items in uppercase. At the beginning, all items are flagged (except the one at the top), since all pairs will need to be compared.
AN APPLE EACH DAY MAKES THE DEALER happy
Sweeping from the top, we compare DEALER with HAPPY. No swap there, so we keep going, comparing THE with DEALER. Yes: We swap and flag the higher value (THE). Completing the sweep, we get:
an apple day EACH dealer MAKES THE happy
Note the flags. The words EACH, MAKES, and THE have moved up, and they're marked as candidates for the next sweep. Only these three words will be compared with the words above.
By the way, we can also mark EACH as the bottom point in our next sweep. We'll never need to go below this. In fact, we can now output the words AN, APPLE, and DAY—that part of the sort is now complete.
Continuing on the next sweep, THE and HAPPY are out of order and are swapped. MAKES and HAPPY are also out of order, so that exchange takes place, also. DEALER is not flagged, so it's not compared with happy. Instead, we move on and find that EACH and DEALER are out of order. The result:
an apple day dealer EACH happy MAKES the
At this point, we know that the sort is complete up to and including the word DEALER. In fact, the whole sort is complete, but we don't know that yet. We'll find that out when we make the last two comparisons (MAKES versus THE, and EACH versus HAPPY).
Below is a simple demonstration program showing the method. Keep in mind that even with these revisions, the bubble sort is not in a league with the N Log N sorting methods mentioned above. It does, however, run quite a bit faster than it would otherwise.
The program invites you to input a number of names (or words). It places these words in an array (or table) called N$. In a practical data processing operation, it's likely these names would be input from a file.
At line 180, we start the sort. J tells us how many items are completely sorted (and output) so far. Its initial value is 0. Lines 190–220 build the index array. With no sorting information so far, the index is simple: The first item will be 1, the next will be 2, the next, 3, and so on.
One special aspect of the index array: It also holds the flag that tells us whether or not a value needs to be compared with the next higher value. It does this by taking on a negative value. At the start, we want to compare all values except the top one, so all elements of array I are made negative.
Last adjustment before we go into the sort proper: When we do a sweep, how far down should we go? Variable J8 holds this value, and at the beginning, we set this to value 1, since we want to sweep all the way down the first time.
Here, we are at line 240. We'll come back here to start a new sweep. Take the value of J8 and copy it to J7. J8 will be set above the top of the list. As we sweep, we'll update it.
The loop from line 250 to 330 performs the sweep by itself. We're working at position J9 in the index table. From this table, we extract the identity of the actual strings to be compared from positions J9 and J9 + 1. These identity numbers are called X2 and J3—but wait—X2 might be negative (the flag). Indeed, we'll only do the comparison if X2 is negative. Let's get the positive value by using the absolute value function, ABS, calling the result J2.
If X2 is positive, we don't need to do a comparison and can skip to the NEXT statement at line 330. Otherwise, we compare items J2 and J3. If they're in the wrong order, we need do several things. We swap the index entries (not the data), remembering to flag the upper value by making it negative. And we note, in variable J8, that on the next sweep, we must come down at least this far.
After completing a sweep, we clear the flag in the topmost entry, again using the ABS function. At this point, we would expect a conventional bubble sort to go back and do another sweep, if necessary. Not this one. We'll do some output first.
At line 350, we allow our output pointer (J) to almost catch up with outsweep pointer (J8), sending output as we go. We'll always output something on each sweep. When we've caught up to J8, back we go to do another sweep—unless we're finished and have already output everything.
The new, improved bubble sort does what I wanted to do in my program. By adding extra logic, I was able to reduce the long sorting time and make this sort practical for my application.