[Beowulf] Accelerator for data compressing

Vincent Diepeveen diep at xs4all.nl
Fri Oct 3 03:53:49 PDT 2008


hi Carsten,

Ah you googled 2 seconds and found some oldie homepage.

Try this homepage www.maximumcompression.com

Far better testing over there. Note that it's the same testset there  
that gets compressed a lot.
In real life, database type data is having all kind of patterns which  
PPM type compressors find.

My experience is that at terabyte level the better compressors at  
maximumcompression.com,
are a bit too slow (PAQ) and not so good like simple things like 7-zip.

Look especially at compressed sizes and decompression times.

The only thing you want to limit over your network is the amount of  
bandwidth over your network.
A real good compression is very helpful then. How long compression  
time takes is nearly not relevant,
as long as it doesn't take infinite amounts of time (i remember a new  
zealand compressor which took 24
hours to compress a 100MB data). Note that we are already at a phase  
that compression time hardly
matters, you can buy a GPU for that to offload your servers for that.

Query time (so decompression time) is important though.

If we look to graphics there:


026	7-Zip 4.60b	-m0=ppmd:o=4	          764420	81.58	 1.4738
..
94	BZIP2 1.0.5	-9                                     890163  	78.55	  
1.7162
..
158	PKZIP 2.50	-exx	                                1250536	69.86	  
2.4110
159	HIT 2.10	-x	                                        1250601	69.86	 
2.4111
160	GZIP 1.3.5	-9	                                1254351	69.77	2.4184
161	ZIP 2.2	-9                                        	1254444	69.77	 
2.4185
162	WINZIP 8.0	(Max Compression)	1254444	69.77	2.4185

Note a real supercompressor is getting it even tinier:
003	WinRK 3.0.3	PWCM 912MB	          568919	86.29	 1.0969

Again all these tests are at microlevel. Just a few megabtes of data  
that gets compressed.
You don't build a big infrastructure just for a few megabytes, it's  
not so relevant.

The traffic over your network dominates there, plenty of idle server  
cores there is, in fact there is
so many companies now that buy dual cores, as they do not know how to  
keep the cores in quad cores
busy.

This is all microlevel. Things really change when you have terabytes  
to compress and HUGE files.
Bzip2 is ugly slow for files in gigabyte size, 7-zip is totally  
beating it there.

Vincent


On Oct 3, 2008, at 11:27 AM, Carsten Aulbert wrote:

> Hi all
>
> Bill Broadley wrote:
>>
>> Another example:
>> http://bbs.archlinux.org/viewtopic.php?t=11670
>>
>> 7zip compress: 19:41
>> Bzip2 compress:  8:56
>> Gzip compress:  3:00
>>
>> Again 7zip is a factor of 6 and change slower than gzip.
>
> Have you looked into threaded/parallel bzip2?
>
> freshmeat has a few of those, e.g.
>
> http://freshmeat.net/projects/bzip2smp/
> http://freshmeat.net/projects/lbzip2/
>
> (with the usual disclaimer that I haven't tested them myself).
>
> HTH
>
> carsten
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list