Marcin Okraszewski Tech Blog: 2012

I've been recently playing a bit to check efficiency of gzip compression in various cases. One conclusion which might not be obvious - gzip will always compress text files, even if they are completely random.

The test which has proven this was very simple - generate 1MB random text and compress it. The generation looked exactly like this:

cat /dev/urandom | tr -cd [a-zA-Z0-9] | head -c $(( 1024 * 1024 ))

Such file has compressed to 769kB, which is 24,9% less. Basically the [a-zA-Z0-9] are 62 characters, which means you can store them on 6 bits, so 2 bits are unused. 2 / 8 is exactly 25%. So with full random characters 25% is maximum you can get. Gzip achieved 24.9%.

Just to add, when there was binary random file generated, without filtering with tr, it wasn't compressed at all. There were 196 bytes added actually.

Marcin Okraszewski Tech Blog

Thursday, February 16, 2012

GZip will always compress text - no matter how random