zip vs. tar + gzip

Just had the need to create an archive of a folder containing 91 large text files, totaling 370MBs. Decided to pit zip against tar + gzip in a little speed test, using these commands:

tar cvzf awstats.tgz awstats
zip -9ry awstats.zip awstats

On the server in question, these were the elapsed times to accomplish this very similar task:

zip: One minute, 21 seconds
tar: 41 seconds

This is, in part because tar only has to compress once, after concatenating all the bits together (but that’s not the full story). In contrast, zip has to compress each file individually. And resulting archive sizes?

-rw-r--r-- 1 cdt cdt 141877473 Mar 8 10:31 awstats.tgz
-rw-r--r-- 1 cdt cdt 140081519 Mar 8 10:29 awstats.zip

So zip did have a slight advantage in the output size. But wait.. no fair! We used the “-9” option with zip for maximum compression. To make it more fair, let’s use the “-9” flag with gzip as well. Unfortunately, to do that we’ll need to run two consecutive commands:

$ tar cvf awstats.tar awstats ; gzip -9 awstats.tar

This caused the compression time for gzip to go way up; that command took 1:17 to run. But now the filesizes are approaching identical:

-rw-r--r-- 1 cdt cdt 140090837 Mar 8 10:42 awstats.tar.gz
-rw-r--r-- 1 cdt cdt 140081519 Mar 8 10:29 awstats.zip

Of course these kinds of things are very circumstantial – doing a similar test on a folder full of pre-compressed files like MP3s would yield very different results (in that case you’d be way better off just using tar without gzip, and definitely not zip). But the upshot is that when trying to decide whether to use zip or tar + gzip, compression times and output sizes are close enough to just not matter in general usage.

Update: I did end up doing a later test on the same dir with bzip2. Result: significantly smaller file size:

-rw-r--r-- 1 cdt cdt 104698994 Mar 8 14:17 awstats.tar.bz2

but at the expense of much longer compression times. If I use gzip and bzip2 side by side on the same 370MB tar file, I get these times:

gzip: 41 seconds
bzip2: 1 minute 36 seconds

Making bzip2 almost twice as slow as gzip (though it does generate smaller output files).

5 Replies to “zip vs. tar + gzip”

  1. bzip2 will create smaller files at the expense of time and CPU cycles.

    7zip will really lower the file size, and speed is about on par with bzip2.

  2. Just updated with a quickie comparison between gzip and bzip2. Wow – bzip2 is WAY slower. Sounds like 7zip would be the way to go unless time is of the essence, in which case you’d just stick with gzip.

  3. hi there,

    just had a question:- Will creation of tar truncate folders with spaces in the name and is there any restriction of length of foldernames it tar ?

    thanks in advance,saha

  4. Saha – The answers to these questions depend on the particular implementation of tar you’re using – what operating system and what version of tar. I just tested on OS X and CentOS 5 and neither of them had a problem with folders with spaces in the names. As for very long filenames, that’s again platform dependent – you’ll have to either google it or devise a test to see how long of filenames it will handle.

  5. When you compressed the tar ball, you used twice as much I/O than needed.

    I don’t think this matters for the small files as they will be cached into the memory in most operating systems, but for the large files it should matter since they won’t fit into the memory;

    tar cvf – awstats | gzip -9 > awstats.tgz

    You could also do the following on linux:

    echo /bin/gzip -9 > comaxpress
    chmod +x comaxpress
    tar –use-compress-program ./comaxpress -cvf awstats.tgz awstats

    This is again to prevent the duplicate I/O on large files.

    Just my 2 cents.

Leave a Reply

Your email address will not be published. Required fields are marked *