Classic Computer Magazine Archive COMPUTE! ISSUE 157 / OCTOBER 1993 / PAGE 66

Data under pressure: no matter how much storage space you have, you need more. Compression can help. (data compression)
by Paul C. Schuytema

Cyril Northcote Parkinson gave us a mantra for the modern day: "Work expands to fit the available time." That same principle holds equally true for the innards of our computers: Data will expand to fill every nook and cranny of a hard disk, no matter how many precautions you take.

Record a few seconds of 16-bit audio, update a customer database, or make an editable copy of a novel, and soon that expanse of free megabytes becomes a claustrophobic region to be protected at any cost. Data grows to fit the space, a truism just as certain as death and taxes.

Fortunately, since the infancy of computer technology and information science, mathematicians and computer scientists have been diligently battling this problem. In the late 1940s Claude Shannon began the study of data compression as he explored the entropy, or information richness, of a quantity of data. Mathematically speaking, the higher the entropy of a data file is, the more information will be in that file. Shannon explored ways in Which to store data as efficiently as possible, to get the most information into a few bits.

Since that time, the abstract gyrations of compression schemes have found their way into nearly every aspect of daily computing, Load a new program or game onto your hard disk, and you must run an installation program that decompresses the information held on the floppy disks. Download a utility or file from a BBS, and chances are that you must extract the file with PKUNZIP or some other decompression program. And now, in today's world of monster data files and multimedia information, data compression is even being factored into the most basic levels of file storage formats.

Gospel Truth

The most basic gospel of any data compression scheme is to get more into less space. To shrink data, a program must examine the data and then apply a compression algorithm to the most basic information--the bits and bytes that make up the data. This algorithm shrinks the size of a data file by combing out any redundancy in the information, thus making the output a more concise, information-rich piece of data,

Compression techniques can vary widely, and the details of the compression algorithms vary from different mathematical approaches to entirely different schemes for wildly different data types. For example, a spreadsheet and a realtime video file will be best served by different compression techniques. While some techniques are specialized, there are generic compression algorithms that work at the most basic data level, oblivious to whether the data is a text file or a scanned image of Mona Lisa.

The simplest form of data compression is called run length encoding (RLE). The PCX data format employs RLE in its basic data format. RLE compresses data by eliminating redundancy. Imagine a single frame of Disney's Snow White, for example. The image is made up of large fields of simple colors--blue for her dress and red for the nose of Sneezy, the dwarf. If we but this picture up into horizontal strips, we can see that the picture consists of a series of color areas. imagine these to be data bytes, and we can easily compress the image. If the strip showing Snow White's dress is a field of blue, then the file storing the image can represent it as a series of bytes signifying blue. But for greater efficiency, we can replace the series of blue bytes with a pair of bytes, one indicating blue and the other indicating how many blue bytes are in the row. In this way, data can be much more efficiently stored.

RLE, however, is not the best method for compressing a moving video file or a photorealistic image. The constant variations of hue and luminosity make RLE actually less efficient than storing this type of data normally. What is needed is another form of compression better suited for this type of visual data.

Generally, data compression comes in two flavors, "lossy" and "lossless." Lossy compression is a data compression scheme that represents a near match of the data, not the exact data. In a video image, for example, the human eye won't notice if a few pixels are removed or ten levels of blue are cut to eight. The JPEG (for still images) and MPEG (for video images) standards are two types of lossy data compression that are specifically designed to handle visual image files. Lossless compression is a data compression scheme that compresses and represents the data exactly. Information such as a spreadsheet or a haiku poem would become useless if any of the information was omitted or substituted. Lossless compression is the type of compression offered by DoubleSpace (which comes with DOS 6), Stacker, and SuperStor Pro.

As a hard disk fills to capacity, it's tempting to turn to one of these products for some much-needed disk real estate. But how do they work? Are they safe? Do they change the way we use our computers? These are some topics we will explore in order to arm ourselves with the information necessary to make an intelligent choice whether or not to compress.

Profit Without Loss

DoubleSpace, Stacker, and SuperStor Pro all use variations of the same generic lossless compression algorithm called Lempel-Ziv. The algorithm is named for its creators, Abraham Lempel and Jacob Ziv, who introduced the algorithm in a paper entitled "A Universal Algorithm for Sequential Data Compression" in 1977. While the three implementations of the Lempel-ziv approach offer different interfaces and utilities, on the whole, the two most important factors, the compression ratio and the performance, are remarkably similar.

When one of these generic compression programs is installed on a hard drive, it will create two drives. One will operate the same as an uncompressed hard drive, but it will have approximately double the size of the original drive (I expanded a 170MB hard drive into approximately 310MB, not including a 5MB permanent swap file for Windows). The other drive will contain information important for the compression program, as well as a single file which physically contains all of the hard disk's files, in compressed form.

The compression program's device driver is loaded into memory during the boot-up process, and it intercepts the data going to or from the hard disk. As the data streams into a buffer, the Lempel-Ziv algorithm scans the data in a "sliding window," sending off unique sections of data but looking for repeated patterns. When a redundant piece of data is encountered, an offset pointer is sent instead of the data proper. This pointer points to the first instance of that data. In this way, the Lempel-Ziv algorithm is a dictionary-based compression system, creating a table of repeating data patterns and substituting a pointer to the data's location in the dictionary, rather than the actual data. By trimming out the redundancy at the binary level, Lempel-Ziv can consistently offer about a 2 : 1 compression ratio.

However, the Lempel-Ziv algorithm used in today's generic compression programs is sophisticated enough to create an integrated dictionary--one that is contained within the compressed file. Because of this, the compression and decompression routines are executed faster, and there is no need for a separate dictionary file. The information in a Lempel-Ziv compressed file consists of a stream of actual data and pointers (set off by a code to let the decompression routine know that the information following is a pointer and not another instance of data), in which the pointers indicate an offset location in the file where the "real" instance of the data lives.

All of this data manipulation operates transparently to the user. It works directly with the read and write calls to the hard disk. On the surface, everything operates normally, with the exception that the capacity of the hard disk is doubled. If you were to examine the amount of compression taking place on a per-file basis, there would be much more variation. Executable files are the least compressible, while database files can easily see compression ratios as great as 7 : 1.

Ready to Commit

By committing to a program that compresses an entire drive, do users set themselves up for any unnatural risks? Possibly. But there are two sides to the story (and considerable middle ground).

On the paranoid side, compressing a disk using Lempel-Ziv means putting your data at risk. Since Lempel-Ziv builds a dictionary on the fly from information contained in the compressed file, one wrong byte could create a cascade of disaster. Since the algorithm relies on the absolute accuracy of everything it has read to build the file, garbled information could lead to any number of mistakes, like data's being interpreted as a pointer or a pointer's pointing to a wrong instance of data, resulting in the retrieval of irrelevant data. Fortunately, when an entire disk is compressed, it's not treated as a single file, though, technically, it is a single file. The Lempel-Ziv algorithm looks at the disk file in sectors and builds a fresh dictionary for each unit of data read into the algorithm's buffer (generally 2048 bytes), which might contain only parts of a file or might contain several small files. If some data is misread, only that sector's data will be lost.

On the other side of the coin, since data is compressed into much less physical space on the disk, the hard disk itself has to do less work to access a file, so the probability of an error's occurring is less than when accessing an uncompressed file.

The middle road, though, is truly the most sensible approach to take. Since the compression algorithm is performing an extra operation on your data, backing up regularly is essential (backup programs such as Central Point Backup work fine with compressed disks; in fact, Central Point's (backup compression algorithm is licensed from Stac Electronics). With regular backups, it's safe to say that the inherent risks of whole-disk compression are minimized such that the benefits far outweigh any dangers.

Turn of the Screw

So how do Stacker, SuperStor Pro, and DoubleSpace measure up? Compressionwise, it's a tossup (see accompanying table), with differences being very minor indeed. They all perform at roughly the same level, slowing your computer down a bit (with the exception of Stacker), but hardly enough to complain about. Each supports Windows' permanent swap file (placing it in the uncompressed drive), and each boasts Windows interfaces, though each interface is passable at best. In short, the similarities far outweigh the differences, but there are a few points worth noting.

Stacker 3.0 and 3.1

Stacker 3.1 is essentially the same product as 3.0, but it's configured specially for DOS 6, replacing Microsoft's DoubleSpace and loading the needed drivers as part of the DOS operating system and not in the CONFIG.SYS file. Also, 3.1 allows a user who has already set up a DoubleSpace drive to easily convert it to a Stacker drive. Other than that, there are no real differences between versions 3.0 and 3.1.

Stacker is the easiest of the three to set up, yet the installation takes a while to defragment and compress the disk (about 45 minutes to one hour for a 170MB hard drive). Once Stacker is in place, it works transparently.

Stacker offers a wide array of utilities, accessible at the command like or through Windows or DOS interfaces. In Windows, the user has the option of seeing a graphical dashboard--the "Stackometer"--showing the compression ratio, the amount of free space on the hard drive, and the amount of fragmentation. Stacker also features an optimized version of Norton's SpeedDisk to defragment the compressed files.

Stacker handles a Windows swap file very well, placing it in the uncompressed drive. If you want to change the size, though, it's slightly tricky. If you wish to make it smaller, you have to exit to DOS and change the size of the Stacker drive (an option which should be available in Windows). If you wish to make it larger, you have to exit to DOS and shrink the size of the Stacker drive before performing the operation in Windows.

Stacker allows a user to compress a floppy or removable hard disk with Stacker Anywhere, a transparent utility that will allow the disk to work on a system that doesn't already have Stacker installed.

SuperStor Pro

Addstor's SuperStor Pro is similar to Stacker in many respects, although installing SuperStor Pro is much more demanding for the user (the newest versions of SuperStor Pro are bundled with 1.01 Enhancements, making installation a little easier). Once the system is installed, you have access to both DOS and Windows command interfaces. SuperStor Pro's Windows utility, while not as graphically pleasing as Stacker's, allows you to perform more operations, such as setting up a floppy or removable disk. The utilities allow the user to see the compression ratios and storage savings in a number of ways, even down to the statistics of an individual file.

SuperStor Pro features its own disk optimization program, as well as an additional program, JPEG Workshop, which allows users to compress color and black-and-white images files using the JPEG standard for lossy compression (achieving an average 20 : 1 compression ratio).

SuperStor Pro also allows removable media to be outfitted with AddStor's version of UDE (Universal Data Exchange), which enables the disks to be fully functional on systems that don't have AddStor's product already installed.

Addstor also plans to offer DoubleTools, a compression program which, like Stacker 3.1, will supplant DoubleSpace.

DoubleSpace

DOS 6, when purchased as an upgrade, is the most cost-effective way to double a hard disk. DoubleSpace is a compression utility based on an algorithm licensed from Vertisoft (Stac Electronics is currently suing Microsoft for patent infringement; Microsoft first approached Stac to use its compression technology in DOS 6, but a deal could not be struck).

DoubleSpace is not automatically activated when you install DOS 6; it must be installed separately. When DoubleSpace compresses a drive, it creates a CVF (Compressed Volume File), which holds the compressed contents of the entire disk. DoubleSpace conforms to the Microsoft Realtime Compression Interface (MRCI), which is a standard that Microsoft hopes will be a common ground for all future software and hardware compression schemes (Stacker 3.1 and Addstor's DoubleTools conform to the MRCI standard).

DoubleSpace offers performance similar to that of Stacker and SupersStor Pro, but it has the advantage of being a component of the operating system. A drive compressed with the other products must maintain two copies of the CONFIG.SYS file, while DoubleSpace works with a single instance of the file.

A disadvantage of DoubleSpace is that, at the time of this writing, the included optimization software was not configured to handle the compressed disk (the CVF), so it will not actually perform an optimization at all.

DoubleSpace suffers from the fact that it's the only one of the three products that doesn't offer an uninstall feature. To unDoubleSpace a drive, you must back up the entire drive, delete the compressed drive, and retrieve the information from the backup. Also, if you want to move from Stacker to DoubleSpace, you'll want to purchase a $5 (plus $5 for handling) utility from Microsoft called The MS-DOS 6 Stacker Conversion Kit.

Conclusion

Hard disk compression utilities are a very exciting solution to a shrinking hard disk. The cost is far lower than that of a new hard drive, and the technology is advanced enough to install and forget. While some increased risk is incurred with disk compression, a prudent schedule of backups will protect important data.

Of the three programs mentioned above, any can be a wise and safe choice to double the capacity of a hard disk. Stacker offers the edge in ease of use, with effortless installation. SuperStor Pro provides the easiest access to removable media, in which the user can compress a floppy right from the Windows interface. DoubleSpace offers the cost edge, as well as the solidity of being an integral part of the operating system. Alternatives include Infinite Disk from Chili Pepper Software, which selectively compresses and archives files based on frequency of access.

Any way you go, a compressed disk can give you that much-needed breathing room: a new allotment of megabytes to conquer. Compression Facts and Figures Whole-Disk Compression Performance (170MB Hard Drive(*)) Compression Total Storage Space Used Free Space None 166,276K 89,160K 77,540K Stacker 317,656K 93,360K 224,296K SuperStor Pro 315,588K 91,706K 224,882K DoubleSpace 298,334K 88,102K 210,232K (*) Disk is set up with a 5104K Windows permanent swap file.

Performance Comparison

Test A: copying a 1183K directory from an uncompressed floppy to a compressed hard disk (directory is a mixture of executable and data files)

Test B: copying a 1183K directory from a compressed hard disk to an uncompressed floppy (directory is a mixture of executable and data files)

Test C: opening a 70K Ami Pro 3.0 file from compressed hard disk (file is a mixture of text, tables, and simple graphics) Time measured in seconds Compression Test A Test B Test C None 95 69 8 Stacker 74 83 6 SuperStor Pro 106 98 8 DoubleSpace 101 98 11

Hints

Here are some rules of thumb to help you live with disk compression.

* Be sure to back up your data before installing a hard disk compression product. Also, be sure to back up your data before you uninstall the compressed drive, since chances of errors are magnified as the program decompresses megabyte upon megabyte of data.

* If you're using DOS 6, either with DoubleSpace or with any other compression product, turn off SMART-Drive's lazy write feature. When DOS 6 is installed, SMARTDrive is set up so that it will not always write data to disk immediately, but will wait for an opportune moment. It's possible to lose data if you just switch off your computer.

* If you have additional drives on your system, such as a removable hard disk or a CD-ROM drive, don't expect the compression program to have the intelligence to figure it all out. You might have to go back and let your programs know the lay of the land. (My CD-ROM drive was changed from drive E to drive F during compression installation.)

* Be sure to have all of your manuals handy during installation. During each of my installations on two different computers (a 386 and a 486), I had problems. They were minor problems--not fatal ones--but having the operating manuals handy let me track down some of the more esoteric conundrums (such as losing my 386 enhanced driver for Windows).

* Be aware that not all games will work in compressed form. If you're a serious game player, it might be a good idea to make a drive partition, creating an uncompressed logical drive for your games, and compress only your more standard applications and files.

* If you have a removable hard drive, compressing one of the cartridges makes for an extremely simple backup option. I used a 90MB removable compressed to nearly 180MB for easy whole-disk backups, In Windows, I created a macro that drags the C drive over to my removable and copies the entire thing in roughly 15 minutes.

* When you see an indication of how much free space is on a disk, assume that it's an educated guess rather than the actual truth, since different files compress at different rates. In one instance Windows' File Manager told me I had 229MB free, the compression program's utility informed me that I had 210MB free, and DOS informed me that I actually had 234MB free.

* Don't use a standard disk optimizer on a compressed drive. Chances are that it won't hurt anything, but it will see the entire compressed drive as a single file. Use an optimizer designed for compressed disks.

* Copying a file (or moving a directory in Windows) within a compressed disk takes longer than copying that file to an uncompressed disk because the file must be decompressed and then recompressed.

* A hard disk compression utility is a perfect addition to a roving laptop computer. Consider a utility that will allow compressed floppies to be used on other systems for maximum efficiency.

* A compressed file can rarely be compressed further. Sometimes you can achieve an additional percentage or two of compression, but usually a compressed file actually becomes larger when compressed a second time. For this reason, one of the techniques for saving space on an uncompressed hard drive--using PKZIP to compress large files and directories--is useless on a drive compressed with Stacker or one of its competitors.

Compress and Back Up

One strategy for keeping your hard disk clear is to compress the files you rarely use and archive the files you never use. Chili Pepper Software has automated this process with Infinite Disk. It monitors your hard disk use and leaves often-used files uncompressed, compresses the files you only occasionally use, and prompts you to archive to a floppy the files that you haven't accessed during a specified period of time. You place a special sticker on the floppy to identify it. As far as your operating system is concerned, the archived file is still on your hard disk. The only difference is that when you access that file, Infinite Disk prompts you to insert the floppy containing its archive. It will be accessed as if it were still on your hard disk. There is no theoretical limit to the number of floppies you could use, so this method of hard disk management could yield an infinite hard disk, hence the name.. The only practical limit would be your ability to maintain an orderly collection of floppies.

Products Under Pressure

Remember that these are list prices. Many of these products are available at significantly lower prices either through their manufacturers or through retailers. SuperStor Pro 149,95 DoubleTools for DoubleSpace 99,00 ADDSTOR 1040 Marsh Rd., Ste. 100 Menlo Park, CA 94025 (800) 732-3133 DOS 6 $129.95 The MS-DOS 6 Stacker

Conversion Kit $5.00 MICROSOFT P.O. Box 3018 Bothell, WA 98041-3018 (800) 228-7007 PKZIP for DOS 2.04G $47.00 PKWARE 9025 N. Deerwood Dr. Brown Deer, WI 53223-2437 (414) 354-8699 Stacker 3.0 $149.00 Stacker 3.1 $99.95 Stacker Special Edition (only for DOS 6 users) $129.95 STAC ELECTRONICS 5993 Avenida Encinas Carlsbad, CA 92008 (800) 522-7822 Infinite Disk $189.00 CHILI PEPPER SOFTWARE 1630 Pleasant Hill Rd., Ste. 180-200 Atlanta, GA 30136-7411 (404) 339-1812