Achieving Better Compression with lrzip and rzip

| Comments

I recently upgraded to Mozilla Thunderbird 3.0. That got me thinking that now might be a good time to clean up my local mail folders. All of my mail from the past few years is stored on my IMAP server. I still have a few gigabytes of old mail from my old POP3 days stored in Thunderbird's Local Folders.

I decided that now might be a good time to do some spring cleaning and not carry around my old POP3 mail anymore. I figured it is also a good time to store this current copy of my old mail with my long-term backups.

My Old Friend rzip

I have been using rzip for quite a few years. Its job is to find and encode large chunks of duplicated data over very large distances, up to 900 MB. Once that is complete, it runs the resulting data through bzip2. For large datasets, I've found it to be much faster than `bzip2 and it usually results in an archive that is about 30% smaller than just using bzip2.

Unfortunately, rzip can’t operate on pipes. All of my automated backup scripts run along pipelines, usually from tar to bzip2 to gpg. They never touch the disk unencrypted, which probably isn't always helpful.

My New Friend lrzip

I recently discovered Con Kolivas' lrzip. lrzip takes rzip a couple steps further. It lets you choose a compressor other than `bzip2 for the second stage of compression. It can also be used in a pipe.

Unfortunately, when used in a pipe it generates a large temp file. This can be a problem if you are trying to generate a large archive and don't have a lot of free disk space, or if you don't want unencrypted data being written to the disk.

Some Benchmarks

I compressed the tarball of my .thunderbird directory every way I could think of that made sense. The default settings for lrzip kept erroring out on me at about 30%. I had to use the -w switch to reduce the window size from 20. I chose 12, which should be about 30% higher than rzip's window.

              Size    Minutes    Ratio
              (MB)
uncompressed  5761         na    1.0:1
lrzip zpaq    1207        265    4.7:1
lrzip lzma    1262         60    4.5:1
lrzip bzip2   1401         27    4.1:1
rzip          1362         20    4.2:1
lzma          1441         97    4.0:1
bzip2         1748         38    3.3:1

Both rzip and lrzip achieved a smaller file size in less time than bzip2. lrzip with zpaq is over 13 times slower than rzip for a savings of 155 MB, or about 12%.

Why Would Anyone Wait for zpaq to Finish?

Most of the time it isn't worth the wait. I'm a huge fan of smaller backups. Backups become significantly more expensive every time a single backup has to span a second (or third, or fourth…) piece of media. It's another floppy, CD, DVD, Blu-ray, tape, hard drive, or flash drive to have to manually swap around and keep safe.

I like flash drives for my personal backups. I have too many CDs and DVDs that are unreadable. I've accidentally run an old compact flash drive through the laundry and it still worked. I'm sure all flash drives won't survive that, but they do tend to be very durable.

Unfortunately, lrzip with zpaq did not get the file size down enough for the archive to fit on the backup flash drive that I keep around the house. Another 100 MB or so would have done the trick and would save me quite a bit of effort.

Which One Should You Use?

For most archives I would probably just choose bzip2. It does a very good job, and a decompressor is always very readily available.

For almost every very large archive, I will definitely be sticking with rzip. It is faster and more space-efficient than bzip2. It is also easier to find than lrzip; my Ubuntu machine has an rzip package available in apt.

I will be sure to keep lrzip with zpaq in mind, though. Sometimes an extra couple hundred MB will save the time, effort, and cost of a second piece of media. The other downside to zpaq is that decompression is also very slow as well.

Comments