[Beowulf] Southampton's RPi cluster is cool but too many cables?

Dr Tim Cutts tjrc at sanger.ac.uk
Tue Sep 25 22:27:28 PDT 2012



On 25 Sep 2012, at 18:01, Jesse Becker <beckerjes at mail.nih.gov> wrote:

> The .2bit FASTA[1] format specifically compresses the ACGT data into
> 2 bits (T:00, C:01, A:10, G:11), plus some header/metada information.
> Other formats such as 'VCF' specifically store variants against known
> references[2].  The current *human* reference genome is several
> gigabytes...compressed[0].

As Jesse says, lots of work has been done on compressing the actual called data (including a recent competition, won narrowly by James Bonfield who works here).  He used reference based compression, amongst other techniques.

The called bases are the easy bit to store and compress, though.  Rather harder are the quality values giving a measure of the confidence for the called base.  There are rather more bits of this data than of the called base itself, and you can't discard it because it's vital for the QC and assembly processes.

Tim

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 



More information about the Beowulf mailing list