[Beowulf] Compressor LZO versus others
diep at xs4all.nl
Sun Aug 19 06:09:56 PDT 2012
On Aug 18, 2012, at 3:19 PM, Ellis H. Wilson III wrote:
> As a side-note -- Hadoop provides support for compression on transfers
> that might help you immensely. You can pick from a few, but LZO tends
> to be the best one for speed/compression for my workloads. This could
> really help you when you need to do that 1 pass where all nodes are
> exchanging with each other.
> Best of luck!
I compared compressors for my dataset. Let me first of all mention to
you that the official compressorbenchmarks are a joke.
They're tiny benchmarks with lots of different sorts of stuff to
Now we all realize how important all that is.
to compres LARGE data, you suddenly need total other sorts of
compressors. Speed sure matters, you can't use too slow of a compressor.
For whatever i do, 7-zip is total superior and relative fast to
LZO beats it speedwise bigtime. No surprises there.
Now even doing something tiny differnt already gets called a
different algorithm in compressing world so if i would quote
i will have it all wrong of course.
Huffman and similar stuff that's basically able to compress buckets,
that's pretty simple of course.
The 'new generation' compressors that's already here for a year or 15
+ by now, they are doing multidimensional compression,
something far superior for huge datasets of course, as there is
always multidimensional relatinships.
Most of such compressors are rather slow. 7-zip is relative fast.
I'm a bit amazed that there is no good 7-zip port for linux. In fact
i had to compile something myself for Scientific Linux
and it is some total crap type of 'decompress only' sort of 7-zip.
Some stuff i found by google it all didn't work. Wrong ELF binaries
this or wrong glibc that etc.
So then i compared LZO with another real fast compressor programmed
by Andrew Kadatch. It's also from 90s and the huge advantage for
computerchess is that it decompresses into buckets. Not sure whether
the compressor is 'free'. As a decompressor you can find the code in
several EGTB programs under which from Eugene Nalimov.
It's also Huffman and as a difference to LZO, Andrew Kadatch is a tad
better in analyzing the data. Realize it doesn't use a dictionary,
which explains much of todays better compression standards, it's
really compressing things in buckets, which is an advantage if you just
want those 64 kilobyte of data out of the database from some random
spot in the file...
Here is the results:
The original file used for every compressor. A small EGTB of 1.8GB:
-rw-rw-r--. 1 diep diep 1814155128 Aug 19 10:37 knnknp_w.dtb
LZO (default compression):
-rw-rw-r--. 1 diep diep 474233006 Aug 19 10:37 knnknp_w.dtb.lzo
7-zip (default compression):
-rw-rw-r--. 1 diep diep 160603822 Aug 18 19:33 ../7z/33p/knnknp_w.dtb.7z
-rw-rw-r--. 1 diep diep 334258087 Aug 19 14:37 knnknp_w.dtb.emd
We see kadatch is a 140MB smaller in size than LZO, that's a lot at
474MB total size for the lzo
and it's 10% of total size of the original data.
So LZO in fact is so bad it doesn't even beat another Huffman
compressor. A fast bucket compressor not using a dictionary at all is
As for decompression speeds (not compressoin but the DECOMPRESSION
speed is what matters to me), EMD is roughly same speed like LZO.
Maybe EMD tad faster, but you won't notice it. Of course being real
old program, EMD has a 32 bits limit, it won't work over 2GB and the
type of files
i proces will be quite larger. More like a 162.5GB each :)
So i'll search some further what proggie to use.
Note that as for real good compression there is much better
compressing programs of course than 7-zip.
Much much better. There is some guys busy with this. Most of them
really take a long time though to decompress (besides compression time),
which makes them less interesting.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf