[Beowulf] Re: [Bioclusters] servers for bio web services setup

Thu Jan 13 15:18:27 PST 2005

> He wants to set up a bio facility which provides web/grid services
> (probably Axis or GT3/4) to a substantial user community (UK-wide but
> with access control, so probably in the region of hundreds or perhaps
> thousands of potential users). Services will include the usual things
> things like BLAST, ClustalW, protein structure analysis etc. -- probably
> a small subset of what EBI offers.

A couple of things to consider in general:

1.  Some of these back end jobs can generate enormously large
output files.  If you let somebody queue up a 1000 entry fasta
file and use the default BLAST format  with 50 alignments each to
search the nt database -  Ugh!!  You definitely don't
want those coming back through your front end machines
if at all possible.  You might, for instance,
set up the back end nodes to email the results directly.
Or to email a page with a link to the results.  Unless a job's 
results are tiny the most you're probably going to want the front
end machine to present is a page that looks like:

   Your XXXXX job finished at 21:09 GMT
   Results (link)
   Error messages (link)
   Other (link)
   Parameters (link)

where all the links go out to different machines, to spread the
load around.

2.  Even if the result is only a million bytes or so you do not
want the users to be loading those pages directly in their browsers.
Browsers can take a really long time to open a file like that, but
they can typically download it very fast.  Have them right
click download and then open it in a faster text viewer.  (most
of the results will be text.)  This may not change the load on
your server much but it can make a big difference in the end users'
perception of the speed of your service.  

3.  Sanity check everything for valid parameters and expected run
times.  Let's say you provide an interface to Phylip.  Do you
really want to let somebody stuff a 200 sequence alignment
into DNAPENNY?  Not unless you want to lock up the back end
machine for the next hundred years.  It can be pretty tricky figuring
ahead of time how long a job may run, but do the best you
can so that at least in some cases the web interface can tell
the users up front to change the job parameters.  And on the
back end absolutely set some maximum CPU time limit
for jobs.  Better an email "your job was terminated after one
hour" than annoyed end users constantly emailing you asking where
their jobs went.

4.  If at all possible provide the run time parameters back to
the end users.  People tend to just print the result off the web
page and, if the program doesn't echo the parameters when they
go back later they can never remember how they ran a particular
program.  It's also useful for catching bugs in the web interface.

5.  If the load is really significant you're going to want at least
two, and maybe more, front end web servers.  Ie, www.yourservice.org
connects at random to www01.yourservice.org,
www02.yourservice.org, etc. That will both split the load and
reduce the effect of a downed front end server.  If all the
computation is going out onto a grid these machines won't
need much local storage but would presumably need reasonably
fast network connections.

> Would a single high-spec machine be sufficient for this kind of thing? 
> Or would one have several servers doing the same thing in parallel? 

Depends on what the front end server is doing.  If it's just shuffling
smallish requests off to the end compute nodes it needn't be
very large.  If it's spooling hundreds of 10 Mb result files
per second and then sending those off to the end users interactively
it's going to have to be monstruously large (ditto for your
network connections).

That is, we can't really answer that question specifically until
you tell us how much data needs to be stored locally, processed
locally, and shipped in and out through the network.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech