[Beowulf] Compressor LZO versus others
Ellis H. Wilson III
ellis at cse.psu.edu
Sun Aug 19 08:45:29 PDT 2012
On 08/19/2012 09:09 AM, Vincent Diepeveen wrote:
> Here is the results:
> The original file used for every compressor. A small EGTB of 1.8GB:
> -rw-rw-r--. 1 diep diep 1814155128 Aug 19 10:37 knnknp_w.dtb
> LZO (default compression):
> -rw-rw-r--. 1 diep diep 474233006 Aug 19 10:37 knnknp_w.dtb.lzo
> 7-zip (default compression):
> -rw-rw-r--. 1 diep diep 160603822 Aug 18 19:33 ../7z/33p/knnknp_w.dtb.7z
> Andrew Kadatch:
> -rw-rw-r--. 1 diep diep 334258087 Aug 19 14:37 knnknp_w.dtb.emd
> We see kadatch is a 140MB smaller in size than LZO, that's a lot at
> 474MB total size for the lzo
> and it's 10% of total size of the original data.
> So LZO in fact is so bad it doesn't even beat another Huffman
> compressor. A fast bucket compressor not using a dictionary at all is
> hammering it.
Thanks for these insightful findings Vincent. Unless I missed
something, I didn't see timings for these algorithms. I would be very
interested to see these compressions wrapped in a 'time' command and
please make sure to flush your buffer cache in between. In Hadoop LZO
seems to be the defacto standard for its widespread use, speed both of
compression and decompression, and relatively high compression ratio
compared to very bare-bones compressors.
So seeing these results, alongside the 1) time to compress when data is
solely on HDD and 2) time to decompress when data is solely on HDD would
be really, really helpful.
For Hadoop, since compression is mainly used to "package" data up prior
to network transfer (and obviously it gets "unpackaged" on the other
side if it needs to be used), the balance between speed and compression
is a fine balance, dependent on your network and CPU capabilities.
Please let me know if you get around to running these experiments and if
you find another compressor out there that is excellent and I'll have to
consider it for my use in Hadoop!
More information about the Beowulf