Thursday, February 16, 2012

GZip will always compress text - no matter how random

I've been recently playing a bit to check efficiency of gzip compression in various cases. One conclusion which might not be obvious - gzip will always compress text files, even if they are completely random.

The test which has proven this was very simple - generate 1MB random text and compress it. The generation looked exactly like this:

cat /dev/urandom | tr -cd [a-zA-Z0-9] | head -c $(( 1024 * 1024 ))

Such file has compressed to 769kB, which is 24,9% less. Basically the [a-zA-Z0-9] are 62 characters, which means you can store them on 6 bits, so 2 bits are unused. 2 / 8 is exactly 25%. So with full random characters 25% is maximum you can get. Gzip achieved 24.9%.

Just to add, when there was binary random file generated, without filtering with tr, it wasn't compressed at all. There were 196 bytes added actually.