[Beowulf] Accelerator for data compressing

Carsten Aulbert carsten.aulbert at aei.mpg.de
Fri Oct 3 04:49:04 PDT 2008


Hi Vincent,

Vincent Diepeveen wrote:
> Ah you googled 2 seconds and found some oldie homepage.

Actually no, I just looked at my freshmeat watchlist of items still to
look at :)

> 
> Look especially at compressed sizes and decompression times.

Yeah, I'm currently looking at
http://www.maximumcompression.com/data/summary_mf3.php

We have a Gbit network, i.e. for us this test is a null test, since it
takes 7-zip close to 5 minutes to compress the data set of 311 MB which
we could blow over the network in less than 5 seconds, i.e. in this case
tar would be our favorite ;)

> 
> The only thing you want to limit over your network is the amount of
> bandwidth over your network.
> A real good compression is very helpful then. How long compression time
> takes is nearly not relevant,
> as long as it doesn't take infinite amounts of time (i remember a new
> zealand compressor which took 24
> hours to compress a 100MB data). Note that we are already at a phase
> that compression time hardly
> matters, you can buy a GPU for that to offload your servers for that.
> 

No, quite on the contrary. I would like to use a compressor within a
pipe to increase the throughput over the network, i.e. to get around the
~ 120 MB/s limit.

> Query time (so decompression time) is important though.
> 

No for us that number is at least as important than the compression time.

Imagine the following situation:

We have a file server with close to 10 TB of data on it in nice chunks
with a since of about 100 MB each. We buy a new server with new disks
and the new one can now hold 20 TB and we would like to copy the data
over. So for us the more important figure is the
compression/decompression speed which should be >> 100 MB/s on our systems.

If 7-zip can only compress data at a rate of less than say 5 MB/s (input
data) I can much much faster copy the data over uncompressed regardless
of how many unused cores I have in the system. Exactly for these cases I
would like to use all cores available to compress the data fast in order
to increase the throughput.

Do I miss something vital?

Cheers

Carsten



More information about the Beowulf mailing list