Super SortersPart II: Mixed Numbers Sorting Routine
By KEVIN PECK
This concludes a two-part series of powerful sorting routines for intermediate programmers, which began with Multikey Sort in the April 1988 issue. Mixed Numbers is a machine language routine that sorts strings containing mixed numeric data types--positive numbers, negative numbers and floating decimal point numbers. Also included is a general-purpose substring finder routine that does multikey sorts within the Mixed Number sorter. This BASIC program works on all 8-bit Atari computers of any memory size with disk or cassette.
Why won't all your ASCII text sort routines work with numbers in strings? The problem is that normal sort routines look at data one character at a time. As soon as the routine finds two unequal characters, it thinks it's done searching and decides whether or not to swap the two data elements, depending on the sort order. This is fine for text, but simply does not work with numbers.
Let's look at an example using a character-by-character sort on the names SMITH and JONES. The computer finds that the ASCII value of S is greater than J. Thus Smith is greater than Jones--so the two last names must be swapped. This works as expected. We don't care about the rest of the text field. We know a swap is necessary after examining just the first character in each data element.
With the numbers 24 and 156, if we go character by character, the computer will first decide that since 1 is less than 2, then 156 must be less than 24. Oops! We need another sorting method that looks at the whole number first before sorting character by character.
My Mixed Numbers sort routine first determines the signs of the two numbers. If they're not equal--one is positive and one is negative--then we don't even have to look at the rest of the number.
We can decide right away if we need to swap them. If the signs of the two numbers are equal, then we move to the next test.
The second test involves finding the lengths of the two numbers. We'll examine each number, character by character, until we find a decimal point or a space. We do this because 1.2345 is less than 45, although 1.2545 has more digits. Using this method, 1.2345 has a length of 1, and 45 has a length of 2. If the two lengths are unequal, then we can decide right now whether or not to swap them. Otherwise, we must go to the third step.
Our third step goes back to looking at the two numbers on a character-by-character basis. This only works for numbers of equal lengths--the only kind we'll have if we make it this deep into the testing. But what if they're the same length before the decimal point but have a different number of digits after?
According to part two of the test, the numbers 34.567 and 34.5 both have a length of 2. The character-by-character comparison will take care of problem with the digits after the decimal point.
The computer will find the first four characters of each number to be equal. When it gets to the fifth character it will find a 6 in the first number and a space for the second. According to the computer's internal table, the value of a space is less than the value of any number so the computer will correctly decide that 34.5 is less than 34.567.
The only problem with this three-prong numeric sort arises when you have two negative numbers. While 45 is greater than 38, -45 is less than -38. Mixed Numbers checks to see if it's sorting two negative numbers, and if so, it reverses its swap decision.
Type in Listing 1, NUMSORT.DEM, check it with TYPO II and SAVE a copy before you RUN it. If you have trouble typing in the special characters in lines 2010-2100, don't type them in. Listing 2 will create them for you. Type Listing 2, checking it with TYPO II, and SAVE a copy to disk. When RUN, Listing 2 creates these hard-to-type lines, and stores them in a file called LINES.LST.
To merge the two programs, disk users LOAD "D:NUMSORT.DEM" and then ENTER "D:LINES.LST." Cassette users: CLOAD Listing 1, then insert the separate cassette used for Listing 2 and ENTER "C:".
Finally, remember to SAVE the completed program before you RUN it. You should also keep a copy of LINES.LST, the machine language sort routines, for use in your own BASIC programs.
HOW TO USE THEM
Here is the proper format for using this routine in your own BASIC programs:
A= USR(ADR(SN$), FIRST, LAST, FIELDLEN, OFFSET, RECLEN, ORDER)
And here is an explanation of each part of the statement:
A: A USR call may return one value to your BASIC program. That value is sent here. BASIC demands that all USR calls be written this way, even if no value is returned.
ADR(SN$): Address of the string containing the sort routine. This string appears in lines 2030-2100 of the demo program.
FIRST: Memory address where we will start the sort. It will always be greater than or equal to the address or the string containing the sorted data.
LAST: Address where the sort ends. It must be greater than FIRST, or the computer will lock up.
FIELDLEN: Length of the numeric field we'll be sorting on. The data string must be put in fixed-length record format. This means that extra characters in the numeric field must be padded with spaces for the routine to work properly.
OFFSET: Amount of character spaces into the record where the numeric field starts.
RECLEN: Length of each record within the data string.
ORDER: Sort order. Use 0 for an ascending sort, or 1 (or any non-zero value) for a descending sort.
The demo program uses string DS$ set up as shown below:
|Last Name||First Name||Income||Tax Owed|
Line 1070 in your completed demo program enters the subroutine to set up the two machine language strings.
Lines 1120-1200 read the sample data from lines 1590-1780 into one big string, padding any missing characters with spaces.
Then the string is printed as is, using the routine starting at line 1470. The bottom of the screen will tell you that you are now viewing the raw, unsorted data. Press the [SPACEBAR] to continue.
Now we'll actually sort some data. First we'll do an ascending sort by income. This is done in line 1230: ADR(SN$) is the address of the machine language string, ADR(DS$) is the address of our data string, and ADR(DS$) + LEN(DS$) is the ending address of the string in memory.
The income field length is 8. And 13 is the "offset" from the start of the record to the sort field. The offset is obtained by adding the lengths of all fields before the field you're manipulating--as explained last month in SuperSorts: Part I.
Both last name (length = 7) and first name (length = 6) appear before income. Since the sum is 13, the offset of income is 13. Figure 1 gives us the record length. Finally, let's do an ascending sort, giving ORDER a value of 0.
Line 1240 display the new sorted information, tells you the current format and waits for you to press the [SPACEBAR] again.
Line 1250 sorts the data string again, but by tax owed this time--in descending order. The 8 in line 1230 is changed to a 7, so the length of the sort field is now 7 characters. The offset is changed from 13 to 21. The new format is printed to the screen in line 1260.
I've found that I usually sort the database by a normal string field and then sort numeric data within smaller sections. For example, let's use a database containing all of our programs, the language they're written in, and their length in bytes so we know how much memory we need for each.
Let's sort them by language first: BASIC, ACTION!, Logo, etc. Then let's sort the programs by memory length within each language. My Multikey sort routine won't work because of the numeric memory length field. We need the starting and ending address of each language within the main data string so we can properly sort the memory length.
The second machine language subroutine is stored in FS$. It finds the first and last occurrence of a field within your data string, then it returns the actual address of these items, which lets you pass the values to the numeric sort routine--which requires a little more work to use it properly.
This is the format of the routine:
START= USR(ADR(FS$, ADR(D$), ADR(FLD$), LENFIELD, OFFSET, LENRECORD, RECCNT)
LAST=PEEK(205)+ 256 * PEEK(206)
I used the variable START instead of A = USR(...) because the routine is passing information back to us this time. In this case, it's returning the address of the first element in the sort.
Note that START will equal zero (0) if the field string is not found within the main data string. Be sure to check START for a 0 value before calling the sorting routine. Otherwise, if START is equal to 0, the computer will attempt to sort Page 0 memory and lock up.
You should also check to see if START+RECORD LENGTH is equal to LAST. If so, then only one record containing the given field string was found. And sorting one field will get you nowhere. The second line retrieves the second bit of information supplied by the routine.
Here are the variables:
ADR(FS$): Address of the machine language routine.
ADR(D$): Starting address of the data to be searched.
ADR(FLD$): Address of the string containing the field information we're looking for.
LENFIELD: length of the field we're looking for. This should be set to LEN(FLD$) for the routine to work properly.
OFFSET: Number of character spaces into the record where the field starts.
LENRECORD: Length of each record within the data string.
RECCNT: record count--the number of records to search within the main data string. To search the whole data string, this will be set to LEN(D$)/LENRECORD. You could sort a subset of the main data string by passing a value between 2 and the actual record count.
The demo program uses the String Find routine in line 1560. We'll use a data string with a fixed record length, using two fiends: field one is the programming language and field two is the memory length in bytes:
|Length in Bytes||3|
Line 1420 sorts on the length In Bytes of each program. Notice that the parameters are passed from the String Find routine we call the sort this time. We don't want to sort the entire data string here, but other a subsection--only programs written in BASIC.
To sort each program language subsection of the data string we'd use these same lines over again for each possible value of program language. We only have to change the value of FLD$ each time--the rest of the lines remain the same. But this is cumbersome if there are several possibilities for the key sort field.
While using a database program, you might want to examine a subsection of the database. Using the program language database example again, let's say you want to see all Logo programs, with their memory length in bytes. If you've sorted the string by program language using the Multikey Sort routine, then a quick call to the String Find routine will supply the addresses necessary to display the requested information. Here's another example:
200 FLD$ = "LOGO"
210 FIRST=USR(ADR(FS$), ADR(D$), ADR(FL$), LEN(FLD$), 0, 15, LEN(D$)/15)
220 IF FIRST=0 THEN PRINT "No matching data found.":GOTO 300
230 FOR I= FIRST-ADR(D$)+1 TO LAST-ADR(D$) STEP 11
240 PRINT D$(I,I+10):NEXT I
250 REM Program continues here
These program lines would display all Logo programs and their memory lengths. You could also use the String Find routine to make sure that you haven't entered any duplicate data. Some database applications don't let you duplicate certain fiends. If you enter ''Paper Clips'' while in the Add Category function of an inventory database, the program must check to see if that category already exists. If so, then the program must tell you so and instruct you to enter the Adjust Inventory mode to add the newly purchased paper clips to the database.
To test for duplication, enter the new field data and use that data as the FLD$ parameter for the String Find routine. If START=0 after calling the routine, then the program can add the new information to the list. Otherwise the program should warn the user of the duplication and let you exit from the Add Category routine.
Kevin Peck wrote the Word Searcher puzzle solver (Antic, March 1987) as well as Super Sorter: Part I (April 1988). He is a computer science major from Salina, Kansas.