[Beowulf] Compressor LZO versus others

Vincent Diepeveen diep at xs4all.nl
Sun Aug 19 06:09:56 PDT 2012


On Aug 18, 2012, at 3:19 PM, Ellis H. Wilson III wrote:

[snip]

>
> As a side-note -- Hadoop provides support for compression on transfers
> that might help you immensely.  You can pick from a few, but LZO tends
> to be the best one for speed/compression for my workloads.  This could
> really help you when you need to do that 1 pass where all nodes are
> exchanging with each other.
>
> Best of luck!
>
> ellis

Hi,

I compared compressors for my dataset. Let me first of all mention to  
you that the official compressorbenchmarks are a joke.
They're tiny benchmarks with lots of different sorts of stuff to  
compress.

Now we all realize how important all that is.

to compres LARGE data, you suddenly need total other sorts of  
compressors. Speed sure matters, you can't use too slow of a compressor.

For whatever i do, 7-zip is total superior and relative fast to  
compress.
LZO beats it speedwise bigtime. No surprises there.

Now even doing something tiny differnt already gets called a  
different algorithm in compressing world so if i would quote  
algorithmic names,
i will have it all wrong of course.

Huffman and similar stuff that's basically able to compress buckets,  
that's pretty simple of course.
The 'new generation' compressors that's already here for a year or 15 
+ by now, they are doing multidimensional compression,
something far superior for huge datasets of course, as there is  
always multidimensional relatinships.

Most of such compressors are rather slow. 7-zip is relative fast.

I'm a bit amazed that there is no good 7-zip port for linux. In fact  
i had to compile something myself for Scientific Linux
and it is some total crap type of 'decompress only' sort of 7-zip.  
Some stuff i found by google it all didn't work. Wrong ELF binaries
this or wrong glibc that etc.

So then i compared LZO with another real fast compressor programmed  
by Andrew Kadatch. It's also from 90s and the huge advantage for
computerchess is that it decompresses into buckets. Not sure whether  
the compressor is 'free'. As a decompressor you can find the code in
several EGTB programs under which from Eugene Nalimov.

It's also Huffman and as a difference to LZO, Andrew Kadatch is a tad  
better in analyzing the data. Realize it doesn't use a dictionary,
which explains much of todays better compression standards, it's  
really compressing things in buckets, which is an advantage if you just
want those 64 kilobyte of data out of the database from some random  
spot in the file...

Here is the results:

The original file used for every compressor. A small EGTB of 1.8GB:

-rw-rw-r--. 1 diep diep 1814155128 Aug 19 10:37 knnknp_w.dtb

LZO (default compression):

-rw-rw-r--. 1 diep diep  474233006 Aug 19 10:37 knnknp_w.dtb.lzo

7-zip (default compression):

-rw-rw-r--. 1 diep diep 160603822 Aug 18 19:33 ../7z/33p/knnknp_w.dtb.7z

Andrew Kadatch:

-rw-rw-r--. 1 diep diep  334258087 Aug 19 14:37 knnknp_w.dtb.emd

We see kadatch is a 140MB smaller in size than LZO, that's a lot at  
474MB total size for the lzo
and it's 10% of total size of the original data.

So LZO in fact is so bad it doesn't even beat another Huffman  
compressor. A fast bucket compressor not using a dictionary at all is  
hammering it.

As for decompression speeds (not compressoin but the DECOMPRESSION  
speed is what matters to me), EMD is roughly same speed like LZO.
Maybe EMD tad faster, but you won't notice it. Of course being real  
old program, EMD has a 32 bits limit, it won't work over 2GB and the  
type of files
i proces will be quite larger. More like a 162.5GB each :)

So i'll search some further what proggie to use.

Note that as for real good compression there is much better  
compressing programs of course than 7-zip.
Much much better. There is some guys busy with this. Most of them  
really take a long time though to decompress (besides compression time),
which makes them less interesting.











> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list