From jdmelo at leca.ufrn.br  Fri Jun  2 06:10:54 2000
From: jdmelo at leca.ufrn.br (Jorge Dantas de Melo)
Date: Fri, 02 Jun 2000 10:10:54 -0300
Subject: Mirinet and Beowulf
Message-ID: <3937B25E.C556CD45@leca.ufrn.br>

Hi,
We are interested to build a Beowulf cluster using Mirinet. Could anyone
give us some informations about research groups which have done the
same?
Thanks,
Prof. Jorge Dantas de Melo
Computing Engineering and Automation Laboratory
Federal University of Rio Grande do Norte


From fryman at lw.net  Fri Jun  2 08:09:28 2000
From: fryman at lw.net (J. Fryman)
Date: Fri, 02 Jun 2000 11:09:28 -0400
Subject: Jobs with Beowulf systems?
Message-ID: <3937CE28.99FD99DB@lw.net>

Hi all,

Is there any location of jobs available working on/with/etc beowulf and
similar cluster systems?  I've check the obvious places, but haven't
turned up anything positive.

Tips and pointers would be welcome.

Josh Fryman
fryman at lw.net


From glindahl at hpti.com  Fri Jun  2 08:22:24 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Fri, 2 Jun 2000 11:22:24 -0400
Subject: Jobs with Beowulf systems?
In-Reply-To: <3937CE28.99FD99DB@lw.net>
Message-ID: <003801bfcca6$5acd3aa0$0932fea9@hptilap.hpti.com>

> Is there any location of jobs available working on/with/etc beowulf and
> similar cluster systems?  I've check the obvious places, but haven't
> turned up anything positive.

This would be a cool thing to have -- I'd like to be able to point my
recruiting people at a website and say "advertise *here*, find candidates
*here*..."

-- greg


From rgb at phy.duke.edu  Fri Jun  2 08:59:16 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 2 Jun 2000 11:59:16 -0400 (EDT)
Subject: Jobs with Beowulf systems?
In-Reply-To: <003801bfcca6$5acd3aa0$0932fea9@hptilap.hpti.com>
Message-ID: <Pine.LNX.4.10.10006021155030.17745-100000@ganesh.phy.duke.edu>

On Fri, 2 Jun 2000, Greg Lindahl wrote:

> > Is there any location of jobs available working on/with/etc beowulf and
> > similar cluster systems?  I've check the obvious places, but haven't
> > turned up anything positive.
> 
> This would be a cool thing to have -- I'd like to be able to point my
> recruiting people at a website and say "advertise *here*, find candidates
> *here*..."

Well, I personally certainly don't object to a limited amount of
recruiting via the list (having used it myself to that effect in the
past).  Posting beowulf-specific job openings one time, or posting an
announcement one time that you are beowulf-skilled and looking for work
(see attached CV) should likely be tolerated, as the community in either
direction is likely to be small and in many cases connected only via
this list.

However, I agree that a "beowulf bulletin board" would be a useful thing
to have and is more appropriate in the long run.  Dwight was planning to
set one up on www.supercomputer.org; or it might be a good idea to add
an associated pair of lists to the mailman server on scyld.com when this
is all running.  beowulf-jobs is a reasonable list to have, and because
mailman archives the list in web-accessible form, it would be very
simple for employers or jobseekers to post there or search for recent
posts.

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From feldy at myri.com  Fri Jun  2 09:17:25 2000
From: feldy at myri.com (Bob Felderman)
Date: Fri, 2 Jun 2000 09:17:25 -0700 (PDT)
Subject: Mirinet and Beowulf
Message-ID: <200006021617.JAA21988@myri.com>

=> Hi,
=> We are interested to build a Beowulf cluster using Mirinet. Could anyone
=> give us some informations about research groups which have done the
=> same?
=> Thanks,
=> Prof. Jorge Dantas de Melo
=> Computing Engineering and Automation Laboratory
=> Federal University of Rio Grande do Norte

Here's an old list of some projects
http://www.myri.com/myrinet/customer_projects/index.html


From vor+ at pitt.edu  Fri Jun  2 16:09:44 2000
From: vor+ at pitt.edu (Victor Ortega)
Date: Fri, 02 Jun 2000 19:09:44 -0400 (EDT)
Subject: Jobs with Beowulf systems?
In-Reply-To: <003801bfcca6$5acd3aa0$0932fea9@hptilap.hpti.com>
Message-ID: <Pine.GSO.3.96L.1000602190627.4406E-100000@unixs2.cis.pitt.edu>

On Fri, 2 Jun 2000, Greg Lindahl wrote:
> > Is there any location of jobs available working on/with/etc beowulf and
> > similar cluster systems?  I've check the obvious places, but haven't
> > turned up anything positive.
> 
> This would be a cool thing to have -- I'd like to be able to point my
> recruiting people at a website and say "advertise *here*, find candidates
> *here*..."

I agree--sounds like a good idea.  Perhaps in a corner on beowulf.org.

Victor


From jok707s at mail.smsu.edu  Sat Jun  3 03:55:53 2000
From: jok707s at mail.smsu.edu (jok707s at mail.smsu.edu)
Date: Sat, 3 Jun 2000 05:55:53 -0500
Subject: Stock Trading &c
Message-ID: <39357206@caliber>

Does anyone have info on the porting of stock trading software to clusters?  
For example, there is a list of financial/stock programs at:

http://linux.com/links/Software/Financial/

How many of these programs are worth parallelizing?  Who has actually tried 
it?

Joel


From deadline at plogic.com  Sat Jun  3 07:53:16 2000
From: deadline at plogic.com (Douglas Eadline)
Date: Sat, 3 Jun 2000 10:53:16 -0400 (EDT)
Subject: Jobs with Beowulf systems?
In-Reply-To: <Pine.GSO.3.96L.1000602190627.4406E-100000@unixs2.cis.pitt.edu>
Message-ID: <Pine.LNX.4.10.10006031052210.8277-100000@lisa.plogic.com>

On Fri, 2 Jun 2000, Victor Ortega wrote:

> On Fri, 2 Jun 2000, Greg Lindahl wrote:
> > > Is there any location of jobs available working on/with/etc beowulf and
> > > similar cluster systems?  I've check the obvious places, but haven't
> > > turned up anything positive.
> > 
> > This would be a cool thing to have -- I'd like to be able to point my
> > recruiting people at a website and say "advertise *here*, find candidates
> > *here*..."
> 
> I agree--sounds like a good idea.  Perhaps in a corner on beowulf.org.

Or perhaps on Beowulf Underground

Doug
-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------


From seth at hogg.org  Sat Jun  3 09:46:07 2000
From: seth at hogg.org (Simon Hogg)
Date: Sat, 03 Jun 2000 17:46:07 +0100
Subject: Mixed distros in one  cluster
Message-ID: <4.3.1.2.20000603174137.00b365d0@icex5.cc.ic.ac.uk>

Is there an inherent drawback in using different distributions in one 
cluster (apart from more complicated maintenance)?

They should all work together, anyway, right?

Suse, Redhat and Debian is what I've got - are there any special 
considerations for this combination that anyone can think of?

Of course, I will migrate everything to one distro at some time (probably 
Debian) but different people want to 'play' with different distros, and 
this is not a production cluster, so it might even make things more 
interesting!

--
Simon Hogg


From rbross at parl.ces.clemson.edu  Sat Jun  3 11:07:18 2000
From: rbross at parl.ces.clemson.edu (Rob Ross)
Date: Sat, 3 Jun 2000 14:07:18 -0400 (EDT)
Subject: Jobs with Beowulf systems?
In-Reply-To: <Pine.LNX.4.10.10006031052210.8277-100000@lisa.plogic.com>
Message-ID: <Pine.LNX.4.10.10006031406350.16551-100000@hell>

People are welcome to use the "Announcements" section of Beowulf
Underground to announce job openings.

Rob

On Sat, 3 Jun 2000, Douglas Eadline wrote:

> On Fri, 2 Jun 2000, Victor Ortega wrote:
> 
> > On Fri, 2 Jun 2000, Greg Lindahl wrote:
> > > > Is there any location of jobs available working on/with/etc beowulf and
> > > > similar cluster systems?  I've check the obvious places, but haven't
> > > > turned up anything positive.
> > > 
> > > This would be a cool thing to have -- I'd like to be able to point my
> > > recruiting people at a website and say "advertise *here*, find candidates
> > > *here*..."
> > 
> > I agree--sounds like a good idea.  Perhaps in a corner on beowulf.org.
> 
> Or perhaps on Beowulf Underground
> 
> Doug


From rgb at phy.duke.edu  Sat Jun  3 12:22:43 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 3 Jun 2000 15:22:43 -0400 (EDT)
Subject: Mixed distros in one  cluster
In-Reply-To: <4.3.1.2.20000603174137.00b365d0@icex5.cc.ic.ac.uk>
Message-ID: <Pine.LNX.4.10.10006031316490.19246-100000@ganesh.phy.duke.edu>

On Sat, 3 Jun 2000, Simon Hogg wrote:

> Is there an inherent drawback in using different distributions in one 
> cluster (apart from more complicated maintenance)?
> 
> They should all work together, anyway, right?
> 
> Suse, Redhat and Debian is what I've got - are there any special 
> considerations for this combination that anyone can think of?

Better you than me, is all I can say;-).  Actually, I agree that it
might be fun to play with and compare the different distros in a
lab/beowulf setting -- if one had nothing else to do (like real work to
do ON the clusters), and I've suggested to some Intel Brass that they
consider funding such an effort at a public facility set up for that
very purpose.  However, I predict that you'll end up doing nearly three
times as much work solving the same problems three different ways and
building stuff for (possibly) three different library sets.  Actually,
RH and SuSE will probably coexist (both RPM based, similar libraries)
but I think that Debian and RH/SuSE will fight in various ways that will
require a lot of work, at least if you plan to make the software
offerings and user environment identical on all the platforms.

For truly large operations of any sort, heterogeneity is evil.  The more
that is different, the more that is nonstandard or custom, the more work
you have to do to provide a degree of homogeneity to benighted and
ignorant users.  I fought this fight for years with different Unices
(e.g. SunOS, Irix, AIX) in a single LAN and the distilled wisdom from
the experience is summarized as:

One person can do a pretty good job of installing, administering and
maintaining one operating system on one LAN.  If things are well set up
(that is, set up scalably with a fair degree of automation and
reasonably homogeneous hardware) the SIZE of the LAN can be pretty large
(hundreds of hosts) and one person can still manage the
hardware/software end of things.  However, user support doesn't scale so
well and a standalone systems person usually gets used up by users at
the expense of hardware before getting to that many hosts (unless a lot
of them are in a beowulf cluster so there are more machines than users).

One person CAN usually do two OS's (or two LANs in different
buildings/departments) but only if they do a less than perfect job on
one.  Too much to master, too much to duplicate, too much glue (or too
far to go and one place/group of people that suits you better).

One person can not generally do a good job with three.  Usually, having
three to keep running "acceptably" prevents one from having even one of
them running "excellently well".

Now with three Linuces you're not quite equivalent to three different
general Unices.  However, I'll bet that /etc is laid out differently,
that startup scripts are different, that different variables are set and
used, that they have different install tools, that different sets of
things are provided in a "default" installation and that different
packages are collected in different ways to support things like X,
gnome, WM's in general, news and mail tools, and possibly even compilers
and basic libraries.  It won't do to have one version of Gnome running
on RH and SuSE and a different one on Debian, or to have different
compiler revisions or kernels or module sets.  Just moving between
Slackware and Red Hat, I had to learn a huge amount and make fundamental
changes in the way I did various things.  Mostly for the better, I might
add, all though there are certainly still things that irritate me about
Red Hat.

> Of course, I will migrate everything to one distro at some time (probably 
> Debian) but different people want to 'play' with different distros, and 
> this is not a production cluster, so it might even make things more 
> interesting!

Remember the Chinese curse:  "May you live in interesting times";-) I
personally hope that your experience is interesting in only the best of
ways...

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From gordan at dcs.rhbnc.ac.uk  Sat Jun  3 16:37:40 2000
From: gordan at dcs.rhbnc.ac.uk (Gordan Bobic)
Date: Sun, 4 Jun 2000 00:37:40 +0100 (BST)
Subject: Stock Trading &c
Message-ID: <Pine.OSF.4.21.0006040021270.19549-100000@platon.cs.rhbnc.ac.uk>

> Does anyone have info on the porting of stock trading 
> software to clusters?  
> For example, there is a list of financial/stock programs at:
> 
> http://linux.com/links/Software/Financial/
> 
> How many of these programs are worth parallelizing?  Who 
> has actually tried 
> it?

It depends on your exact needs, really. The software on the page you have
mentioned is all for monitoring performance of stocks. As such, it
requires very little processing power, so clusters are not really a
terribly useful platform to be porting it to.

There are other things you can do on a cluster, though.

I am currently working on a stock market trading and signalling system,
and when you think about it the right way, the parallelism is very
obvious. If you consider that there are in excess of 10,000 companies
being traded world wide, then analysing the trends in those can be
performed in parallel as 10,000 jobs running at the same time, each using
whatever your method of choice is, be it ridge/lease squares regression,
support vector machines, or neural networks.

The point is, if that is the sort of thing you are working on, then you
could quite simply run all of these in parallel. The tasks involved in
detailed analysis, such as the methods mentioned above, are extremely CPU
intensive, but cause very little IO traffic, to the disk, and hence the
network. This means that your spawning/migration times are going to be
negligible compared to CPU time consumed.

Seen as that is the case, you might as well just slap a few machines
together and use Mosix to load ballance the tasks.

If you are comparing the performance of companies, and comparing each one
of them with each of the others, then you again have the situation where
you are running a bunch of identical tasks in parallel on different data.

What you could potentially save on is using the same code section with
varying data section in your program, and using this to minimize memory
usage. This is often quite effective in conserving memory on a single CPU
system, but when you start trying to spread the program over the entire
cluster, you need the program code to be running on all machines, so you
will either not save anything, or you will cause enough IO traffic between
machines to make the whole exercise not worth your while due to horrendous
overheads.

As far as the stock trading problem goes, the explanation given here is
rather trivial, but I hope that it does illustrate the kind of problem you
are likely to be facing.

Hope this helps.

Gordan


From lkchu at cs.ucsb.edu  Sat Jun  3 22:05:28 2000
From: lkchu at cs.ucsb.edu (Lingkun Chu)
Date: Sat, 3 Jun 2000 22:05:28 -0700
Subject: Multicast on channel-bonding
Message-ID: <016801bfcde2$80b93e70$017610ac@sweeper>

Hi all,

Our beowulf cluster has recently got channel bonded on the latest kernel 2.2.15. 
Everything seems okay except the IP multicasting. 
By "tcpdump -i bond0", I find the multicast packets do reach the related nodes.
But the application can not always receive the corresponding packets. Most of
packets are dropped. It happens when I connect a socket to a MC group, and 
then use send. When I use sendto and specify group/port, things work fine. 

Any comments are appreciated.

Thank you.

-Lingkun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20000603/3a5f765a/attachment.html>

From covenant at dirac.org  Sun Jun  4 12:24:09 2000
From: covenant at dirac.org (Peter Jay Salzman)
Date: Sun, 4 Jun 2000 12:24:09 -0700 (PDT)
Subject: automating commands on nodes
Message-ID: <Pine.LNX.4.10.10006041222460.24753-100000@dirac.org>

dear all,

i'd like to:

edit /etc/profile to include /sbin and /usr/sbin in PATH
adduser jobrun

on our 40 nodes.  is there a way of doing this without telnetting 40 times?

thanks!
pete


From jakob at ostenfeld.dk  Sun Jun  4 16:23:44 2000
From: jakob at ostenfeld.dk (=?iso-8859-1?Q?Jakob_=D8stergaard?=)
Date: Mon, 5 Jun 2000 01:23:44 +0200
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.10.10006041222460.24753-100000@dirac.org>
References: <Pine.LNX.4.10.10006041222460.24753-100000@dirac.org>
Message-ID: <20000605012344.V770@ostenfeld.dk>

On Sun, 04 Jun 2000, Peter Jay Salzman wrote:

> dear all,
> 
> i'd like to:
> 
> edit /etc/profile to include /sbin and /usr/sbin in PATH
> adduser jobrun
> 
> on our 40 nodes.  is there a way of doing this without telnetting 40 times?

If you haven't already, you should setup SSH on all the nodes.  Then you can
put your public key in ~root/.ssh/authorized_keys  to allow instant login from
anywhere provided your passphrase is entered correctly at your workstation.

If you don't know about SSH, Secure Shell, you should read about it (good
pointers anyone ?)

Once you've done that, it should be a simple matter to do what you asked:
Provided the names of all your hosts are in the file /etc/hostfile:

[start up a shell under ssh-agent, type in passphrase to ssh-add]

for i in `cat /etc/hostfile`; do
  ssh -l root $i perl -pi -e 's/(PATH=\"[^"]+)\"/$1:\/usr\/sbin:\/sbin\"/' /etc/profile
  ssh -l root $i adduser jobrun
done

You might want to experiment with copies of /etc/profile when doing tricks
like that....   This time I actually managed to get it right at first shot,
but your mileage might vary  ;)

-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:


From covenant at dirac.org  Sun Jun  4 17:11:58 2000
From: covenant at dirac.org (Peter Jay Salzman)
Date: Sun, 4 Jun 2000 17:11:58 -0700 (PDT)
Subject: automating commands on nodes
In-Reply-To: <20000605012344.V770@ostenfeld.dk>
Message-ID: <Pine.LNX.4.10.10006041706170.25604-100000@dirac.org>

hi jakob,

we have ssh on the beowulf frontend, but not on the nodes.  any ideas on
automating installing ssh on the nodes?  i haven't seen redhat 6.1 ssh rpms,
i guess that's a remnant of the USA's moronic crypto export policy (which i
understand was mostly lifted).

i've tried to use mandrake's ssh packages on a redhat 6.1, but redhat balked
at the mandrake rpms.

btw, i didn't know ssh had the capability to run stuff non-interactively.
that is a very cool thing to know!  thank you very much!

pete


> Date: Mon, 5 Jun 2000 01:23:44 +0200
> From: "[iso-8859-1] Jakob ?stergaard" <jakob at ostenfeld.dk>
> To: Peter Jay Salzman <covenant at dirac.org>
> Cc: Beowulf Mailing List <beowulf at beowulf.org>
> Subject: Re: automating commands on nodes
> 
> On Sun, 04 Jun 2000, Peter Jay Salzman wrote:
> 
> > dear all,
> > 
> > i'd like to:
> > 
> > edit /etc/profile to include /sbin and /usr/sbin in PATH
> > adduser jobrun
> > 
> > on our 40 nodes.  is there a way of doing this without telnetting 40 times?
> 
> If you haven't already, you should setup SSH on all the nodes.  Then you can
> put your public key in ~root/.ssh/authorized_keys  to allow instant login from
> anywhere provided your passphrase is entered correctly at your workstation.
> 
> If you don't know about SSH, Secure Shell, you should read about it (good
> pointers anyone ?)
> 
> Once you've done that, it should be a simple matter to do what you asked:
> Provided the names of all your hosts are in the file /etc/hostfile:
> 
> [start up a shell under ssh-agent, type in passphrase to ssh-add]
> 
> for i in `cat /etc/hostfile`; do
>   ssh -l root $i perl -pi -e 's/(PATH=\"[^"]+)\"/$1:\/usr\/sbin:\/sbin\"/' /etc/profile
>   ssh -l root $i adduser jobrun
> done
> 
> You might want to experiment with copies of /etc/profile when doing tricks
> like that....   This time I actually managed to get it right at first shot,
> but your mileage might vary  ;)


From jakob at ostenfeld.dk  Sun Jun  4 18:19:17 2000
From: jakob at ostenfeld.dk (=?iso-8859-1?Q?Jakob_=D8stergaard?=)
Date: Mon, 5 Jun 2000 03:19:17 +0200
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.10.10006041706170.25604-100000@dirac.org>
References: <20000605012344.V770@ostenfeld.dk> <Pine.LNX.4.10.10006041706170.25604-100000@dirac.org>
Message-ID: <20000605031917.W770@ostenfeld.dk>

On Sun, 04 Jun 2000, Peter Jay Salzman wrote:

> hi jakob,
> 
> we have ssh on the beowulf frontend, but not on the nodes.  any ideas on
> automating installing ssh on the nodes?  i haven't seen redhat 6.1 ssh rpms,
> i guess that's a remnant of the USA's moronic crypto export policy (which i
> understand was mostly lifted).

I was wondering about the restrictions myself, as RH6.2 seems to ship with
Kerberos5...    Anyway, you can find RedHat-Crypto directories at your favourite
FTP site holding ssh-1.2.27-7i as src.rpm which works nicely with RH6.1 and 6.2
at least.

I was considering OpenSSH, now that it supports ssh-2 protocol. Never got
around to migrate to ssh-2 because of the lame license.  OpenSSH may well
be worth investigating if you're about to install SSH on a lot of machines
anyway.

No I don't know how to automate SSH installation on a lot of nodes where
you don't have remote access (except for telnet).  Maybe you could write
up an expect script to telnet into a node and run rpm -U /mnt/somewhere/ssh-...

Actually, even if you haven't got the faintest idea about how to write
an expect script, the autoexpect program should get you started.  I managed
to write an expect script for logging into a Cisco and pulling BGP tables
in some 5 minutes or so, without _ever_ having used expect before.  The
autogenerated script will need light editing, but that should be fairly
easily once you have the basic script written all for you.  Check out
autoexpect.

> 
> i've tried to use mandrake's ssh packages on a redhat 6.1, but redhat balked
> at the mandrake rpms.
> 
> btw, i didn't know ssh had the capability to run stuff non-interactively.
> that is a very cool thing to know!  thank you very much!

They provide the same features as rsh (but in a secure manner!), and then
some.   Really nice.

-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:


From karsten.petersen at informatik.tu-chemnitz.de  Sun Jun  4 23:32:15 2000
From: karsten.petersen at informatik.tu-chemnitz.de (Karsten Petersen)
Date: Mon, 5 Jun 2000 08:32:15 +0200 (CEST)
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.10.10006041706170.25604-100000@dirac.org>
Message-ID: <Pine.LNX.4.21.0006050828250.1051-100000@lola.kapet.vpn>

On Sun, 4 Jun 2000, Peter Jay Salzman wrote:
> i haven't seen redhat 6.1 ssh rpms,

you can find crypto-related RPMS for RedHat 6.1 and 6.2 on the FTP 
of RedHat Germany:
	ftp.redhat.de

besides ssh there are pgp, gpg, openssh, stunnel, openssl, mod_ssl, ...

Greets, Karsten
-- 
,-,  Student of Computer Science at Chemnitz University of Technology  ,-,
| |    EMail:  Karsten at kapet.de          WWW:  http://www.kapet.de/    | |
'-'  Home: kapet at dollerup.csn   V72 / 230    Phone: +49-177-82 35 136  '-'


From c.best at fz-juelich.de  Mon Jun  5 06:07:16 2000
From: c.best at fz-juelich.de (Christoph Best)
Date: Mon,  5 Jun 2000 15:07:16 +0200 (CEST)
Subject: Benchmarking L2 cache on the Alpha 21264
Message-ID: <14651.41221.672255.829879@verne.local>

Hi everybody,

I am having a problem benchmarking the L2 cache performance on some
Alpha 21264 systems from our clusters and wondering if anybody else
has seen this. We use a benchmark that models the kernel of our main
application (computational physics/lattice gauge theory). When running
in L1 cache or beyond L2 cache, it gives perfectly consistent readings
with deviations of 1% or less. But in L2 cache, the numbers from
different runs may be off by as much as 20%, for which I cannot find a
good explanation. If I plot performance vs. memory footprint, there is
a clear shoulder from the L1 cache (64 KB), but then a kind of
logarithmic behavior (double the memory use loses 30 MFlops).

The benchmark consists of a completely deterministic set of
floating-point operations, and I use a version that accesses memory
completely consecutively. The systems are Compaq DS10 (466 MHz single
proc.), ES40 (666 MHz 4-proc.), and API UP2000 (666 MHz dbl. proc.)
under Linux. I did not see this effect under Tru64 on a XP1000 (666
MHz single proc.).

The question is: Is there anything either in Linux or the 21264 that
could account for such behavior? Could the cache be polluted by other
processes that effectively? (The machines were basically idle during
benchmarks).

In particular, it seems that code running just inside the L2 cache (4
MB on the UP2000 and ES40) is not performing much better than code in
main memory, which would be a pity. We expect cache performance to be
a major determinant of total performance for our application: in L1
cache, the performance is about 600 MFlops, outside L2 cache it drops
to about 200 MFlops. Inside L2 it varies between 300 and 450 MFlops.

Thanks
-Chris
-- 
Christoph Best                                        c.best at computer.org
John von Neumann Institute for Computing/DESY   http://www.oche.de/~cbest


From RSchilling at affiliatedhealth.org  Mon Jun  5 07:59:49 2000
From: RSchilling at affiliatedhealth.org (Schilling, Richard)
Date: Mon, 5 Jun 2000 07:59:49 -0700 
Subject: Beowulf metric postings.
Message-ID: <51FCCCF0C130D211BE550008C724149E8FEC9A@mail1.affiliatedhealth.org>


Some months ago, there was a discussion about hosting beowulf metrics.  I
offered to host them on my private web site, but in the midst of that
discussion, I changed jobs (got a promotion!).  So, I lost track.  

Have metrics been posted, and if not is there still an interest?  I'd still
be happy to host the list.

Richard Schilling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20000605/fd4d6a78/attachment.html>

From glindahl at hpti.com  Mon Jun  5 08:10:59 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Mon, 5 Jun 2000 11:10:59 -0400
Subject: Benchmarking L2 cache on the Alpha 21264
In-Reply-To: <14651.41221.672255.829879@verne.local>
Message-ID: <000601bfcf00$41706da0$f69cfea9@hptilap.hpti.com>

> But in L2 cache, the numbers from
> different runs may be off by as much as 20%, for which I cannot find a
> good explanation.

Page coloring?

> In particular, it seems that code running just inside the L2 cache (4
> MB on the UP2000 and ES40) is not performing much better than code in
> main memory, which would be a pity.

Which would be a smoking gun.

-- greg


From hjin at ceng.usc.edu  Mon Jun  5 10:54:07 2000
From: hjin at ceng.usc.edu (Hai Jin)
Date: Mon, 05 Jun 2000 10:54:07 -0700
Subject: CC-TEA'2000, Las Vegas - Online Proceedings
Message-ID: <393BE93F.CF79A5FC@ceng.usc.edu>

Dear All,

The program and online proceedings of:
                The 2000  International Workshop on
  "Cluster Computing - Technologies, Environments, and Applications
(CC-TEA'2000)"
      to be held in conjunction with PDPTA-2000
       Las Vegas, Nevada, USA, June 26th-29th, 2000
  In Co-operation with the "IEEE Task Force on Cluster Computing (TFCC)"

can be found at:
   http://www.dgs.monash.edu.au/~rajkumar/CC-TEA2000/
   http://www.dgs.monash.edu.au/~rajkumar/CC-TEA2000/program.html
   OR:
     http://ceng.usc.edu/~hjin/cc-tea2000.html

Happy reading.
--
Best wishes,
CC-TEA'2000 organisers
Raj, Hai, Toni


From salim at ee.fit.edu  Mon Jun  5 13:01:32 2000
From: salim at ee.fit.edu (Salim Mounir AlAoui)
Date: Mon, 5 Jun 2000 16:01:32 -0400 (EDT)
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.21.0006050828250.1051-100000@lola.kapet.vpn>
Message-ID: <Pine.GSO.3.96.1000605155730.15337D-100000@yacht.ee.fit.edu>


for remote commands you can go to:
ftp.remotesensing.org then you go to /pub/sadm/rpms
there you will find cfm rpm which after installed will permit you to use
"scmd" command. You type scmd "<command>" <host list>, it will run
whatever command you want on any node of the beowulf. cfm is also very
usefull to manage and keep track of your beowulf modifications.


--------------------------------------------------------------------------
Salim Mounir Alaoui					salim at ee.fit.edu
Computer Science  Dept.					salaoui at cs.fit.edu
Research Assistant.					salim at ieee.org
Florida Institute of Technology
Melbourne, Florida
Voice: (407) 537-8025.
--------------------------------------------------------------------------


From goebel at his.com  Mon Jun  5 17:04:01 2000
From: goebel at his.com (John Goebel)
Date: Mon, 5 Jun 2000 20:04:01 -0400 (EDT)
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.10.10006041222460.24753-100000@dirac.org>
Message-ID: <Pine.BSI.4.05L.10006051959180.13169-100000@herndon10.his.com>

On Sun, 4 Jun 2000, Peter Jay Salzman wrote:

> dear all,
> 
> i'd like to:
> 
> edit /etc/profile to include /sbin and /usr/sbin in PATH
> adduser jobrun
> 
> on our 40 nodes.  is there a way of doing this without telnetting 40 times?
> 

Take a look at cfengine. You can maintain system state better than just
pushing out mistakes, you can pull changes instead of pushing changes from
each node, and the syntax (AlthoughRatherSelfDocumenting) is straight
forward. You can also do it through a des encrypted transfere.

Also, prsh is handy. It beats writing 'for' and 'foreach' shell scripts.

Or you can use rsync (ssh -e rync). 

The world is your oyster.
John


From joysarkar at jncasr.ac.in  Tue Jun  6 10:04:29 2000
From: joysarkar at jncasr.ac.in (Mr.Joy Sarkar)
Date: Tue, 6 Jun 2000 12:04:29 -0500 (GMT+5)
Subject: Announcing the existence of kamadhenu@jncasr, INDIA.
Message-ID: <Pine.LNX.4.04.10006061157560.15926-100000@jncasr.ac.in>


Hi Folks,
	This is to announce the birth of kamadhenu, the first beowulf
cluster at JNCASR, India. Its a 8 node Pentium III cluster built for
molecular dynamics simulation. 

	We have been successful with the project for which we thank the
Open Source Community!

For more info, you can visit kamadhenu at:

http://www.jncasr.ac.in/kamadhenu

Expecting you!

Joy and Bala.


***********************************************************************
			   Joy Sarkar
Currently: Summer Research Fellow,
	   Beowulf Cluster Project and Brillouin Scattering Lab,
	   Jawaharlal Nehru Centre for Advanced Scientific Research,
	   Jakkur, Bangalore.
	   INDIA.
Also(!)  : Student, Dept of Physics, Indian Institute of Technology,
           Kharagpur, INDIA-721302.
***********************************************************************
		

From covenant at dirac.org  Tue Jun  6 00:17:49 2000
From: covenant at dirac.org (Peter Jay Salzman)
Date: Tue, 6 Jun 2000 00:17:49 -0700 (PDT)
Subject: automating commands on nodes
In-Reply-To: <Pine.GSO.3.96.1000605155730.15337D-100000@yacht.ee.fit.edu>
Message-ID: <Pine.LNX.4.10.10006060015210.32677-100000@dirac.org>

hi salim!

this looks really good -- but i'm having trouble finding references to cfm
on the net.  before i install it, i'd like to take a look at some man pages
and/or documentation.   can't find it on freshmeat or gnu.org.

can you give me its homepage?

thanks!
pete


On Mon, 5 Jun 2000, Salim Mounir AlAoui wrote:

> Date: Mon, 5 Jun 2000 16:01:32 -0400 (EDT)
> From: Salim Mounir AlAoui <salim at ee.fit.edu>
> To: beowulf at beowulf.org
> Subject: Re: automating commands on nodes
> 
> 
> 
> for remote commands you can go to:
> ftp.remotesensing.org then you go to /pub/sadm/rpms
> there you will find cfm rpm which after installed will permit you to use
> "scmd" command. You type scmd "<command>" <host list>, it will run
> whatever command you want on any node of the beowulf. cfm is also very
> usefull to manage and keep track of your beowulf modifications.


From wildfire at progsoc.uts.edu.au  Tue Jun  6 00:41:41 2000
From: wildfire at progsoc.uts.edu.au (Anand Kumria)
Date: Tue, 6 Jun 2000 17:41:41 +1000
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.10.10006041706170.25604-100000@dirac.org>; from covenant@dirac.org on Sun, Jun 04, 2000 at 05:11:58PM -0700
References: <20000605012344.V770@ostenfeld.dk> <Pine.LNX.4.10.10006041706170.25604-100000@dirac.org>
Message-ID: <20000606174141.J8460@ftoomsh.progsoc.uts.edu.au>

On Sun, Jun 04, 2000 at 05:11:58PM -0700, Peter Jay Salzman wrote:
> hi jakob,
> 
> we have ssh on the beowulf frontend, but not on the nodes.  any ideas on
> automating installing ssh on the nodes?  i haven't seen redhat 6.1 ssh rpms,

Unless all of your nodes are exposed on the public internet, do you need
ssh on them? I wouldn't have thought so.

> i guess that's a remnant of the USA's moronic crypto export policy (which i
> understand was mostly lifted).

For source code, mostly. Binaries are still troublesome.

> i've tried to use mandrake's ssh packages on a redhat 6.1, but redhat balked
> at the mandrake rpms.

oh well. so much for a single packaging system. 

> btw, i didn't know ssh had the capability to run stuff non-interactively.
> that is a very cool thing to know!  thank you very much!

Something the original poster hasn't taken into account is that sometimes
some programs wil require a pty allocated and you'll need to use the -t
switch to ssh.

Anand


From wildfire at progsoc.uts.edu.au  Tue Jun  6 01:24:11 2000
From: wildfire at progsoc.uts.edu.au (Anand Kumria)
Date: Tue, 6 Jun 2000 18:24:11 +1000
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.10.10006060015210.32677-100000@dirac.org>; from covenant@dirac.org on Tue, Jun 06, 2000 at 12:17:49AM -0700
References: <Pine.GSO.3.96.1000605155730.15337D-100000@yacht.ee.fit.edu> <Pine.LNX.4.10.10006060015210.32677-100000@dirac.org>
Message-ID: <20000606182411.L8460@ftoomsh.progsoc.uts.edu.au>

On Tue, Jun 06, 2000 at 12:17:49AM -0700, Peter Jay Salzman wrote:
> hi salim!
> 
> this looks really good -- but i'm having trouble finding references to cfm
> on the net.  before i install it, i'd like to take a look at some man pages
> and/or documentation.   can't find it on freshmeat or gnu.org.
> 
> can you give me its homepage?

www.remotesensing.org; go the CVS repository, choose sadm then cfm.

Anand


From he at Physics.usyd.edu.au  Tue Jun  6 02:23:23 2000
From: he at Physics.usyd.edu.au (Hao He)
Date: Tue, 6 Jun 2000 19:23:23 +1000 (EST)
Subject: Invited speaker request 
Message-ID: <Pine.SOL.3.96.1000606192242.5474B-100000@suphys.physics.usyd.edu.au>

Dear Beowulf experts,

There will be a Conference on Computational Physics (CCP2000) at the end
of this year in Queensland, Australia. As the Chair of Open Source
Session, I need to find an invited speaker from the open source community
urgently. There will be a guy from MS talking about NT clustering. It
would be great if we have someone from open source talking about Linux
clustering or any similar open source projects. The conference will pay
this person's return air ticket to Australia and accommodations. If you
are interested to be this person or would like to recommend someone,
please email me at he at physics.usyd.edu.au.

For more information about CCP2000, visit www.physics.uq.edu.au/CCP2000.

Thank you.

Dr. Hao He 


From gerry at cs.tamu.edu  Tue Jun  6 04:50:41 2000
From: gerry at cs.tamu.edu (Gerry Creager N5JXS)
Date: Tue, 06 Jun 2000 06:50:41 -0500
Subject: Invited speaker request
References: <Pine.SOL.3.96.1000606192242.5474B-100000@suphys.physics.usyd.edu.au>
Message-ID: <393CE591.7BEF14EC@cs.tamu.edu>

Hao He wrote:
> 
> Dear Beowulf experts,
> 
> There will be a Conference on Computational Physics (CCP2000) at the end
> of this year in Queensland, Australia. As the Chair of Open Source
> Session, I need to find an invited speaker from the open source community
> urgently. There will be a guy from MS talking about NT clustering. It
> would be great if we have someone from open source talking about Linux
> clustering or any similar open source projects. The conference will pay
> this person's return air ticket to Australia and accommodations. If you
> are interested to be this person or would like to recommend someone,
> please email me at he at physics.usyd.edu.au.
> 
> For more information about CCP2000, visit www.physics.uq.edu.au/CCP2000.

Greg?  RGB?  You guys sound like naturals.
--
Gerry Creager		gerry at cs.tamu.edu, gerry at page4.cs.tamu.edu
Network Engineering			|Geodesy
Computer Science Department		|Satellite Geodesy and Control
Texas A&M University			|
979.458.4020


From Tim.Tenhave at compaq.com  Tue Jun  6 05:21:27 2000
From: Tim.Tenhave at compaq.com (Tenhave, Tim)
Date: Tue, 6 Jun 2000 08:21:27 -0400 
Subject: Benchmarking L2 cache on the Alpha 21264
Message-ID: <21ECC6E090DCD21180D20000F809A18B03B7C2BF@exctay-02.tay.dec.com>

Hi Chris,

I posed your question to some folks in Compaq.  The resounding answer was
lack of page coloring in Linux.  There are some linker optimizations in
Tru64 UNIX, but page coloring was the most possible reason.

I was also told that Greg Lindahl and Joe Martin have posted patches to help
fix this.

Sorry I do not have a link right now.  I could find one if you cannot.

Hope this helps,

Tim


From rgb at phy.duke.edu  Tue Jun  6 06:36:58 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 6 Jun 2000 09:36:58 -0400 (EDT)
Subject: automating commands on nodes
In-Reply-To: <20000606174141.J8460@ftoomsh.progsoc.uts.edu.au>
Message-ID: <Pine.LNX.4.10.10006060856260.19246-100000@ganesh.phy.duke.edu>

On Tue, 6 Jun 2000, Anand Kumria wrote:

> On Sun, Jun 04, 2000 at 05:11:58PM -0700, Peter Jay Salzman wrote:
> > hi jakob,
> > 
> > we have ssh on the beowulf frontend, but not on the nodes.  any ideas on
> > automating installing ssh on the nodes?  i haven't seen redhat 6.1 ssh rpms,
> 
> Unless all of your nodes are exposed on the public internet, do you need
> ssh on them? I wouldn't have thought so.

Goodness.  We just had an extended discussion of this, and it should be
in the archives from just last week or two weeks ago. The answer is "no,
but it often won't matter and makes good sense".  ssh provides certain
services (notably forwarding of ports and a universally portable
environment in /etc/environment) that can be very useful to a beowulf
user at the expense of about 0.15 seconds per connection (plus any time
spend encrypting traffic, which is usually negligible for small files).

In terms of net load, bproc (being actively worked on by scyld.com) is
by far the most efficient way to run remote shell commands and so forth
on a beowulf (I haven't yet tested it personally but they report times
of 0.01 seconds for a file copy, IIRC from last week), but integrates
deeply with the kernel to accomplish this and so isn't for everyone.
rsh costs ~0.1[1-5] seconds for a (small) file copy (or any other kind
of connection) but provides "no" security, no cross-network encryption,
no forwarding or ports or preloading of environment. ssh costs ~0.2[5-9]
for a small file copy.

If you are running ssh on the head node (presumably bundled into an RPM
or ready-to-install tarball) then the effort required to install it on
the nodes via e.g. kickstart, rsync, or whatever is essentially zero.
If all you use remote shells for is to synchronize a few /etc files,
enable MPI and PVM to (infrequently) spawn remote processes, allow login
access to the nodes from hosts outside the gateway node and so forth
there is really no reason to avoid using ssh and (strictly IMHO) there
are several good reasons to use it.  If you use remote shells a LOT for
a LARGE true beowulf, you should almost certainly use bproc as it is
likely to be on a track that will evolve into a true distributed beowulf
kernel (peering into my crystal ball with a wink at the Scyld folks) and
you can probably contribute to the development.  

Perhaps there is some ground in between for rsh, but I personally would
like to see it killed dead as it is a brainless and obsolete security
incident waiting to happen IN ADDITION TO having been designed back when
issues like the passing of environments and forwarding of ports hadn't
yet come to the foreground.  Even if you configure ssh to use no
encryption and not to verify connections at all (making it "just like"
rsh) you still get /etc/environment and port forwarding.

> 
> > i guess that's a remnant of the USA's moronic crypto export policy (which i
> > understand was mostly lifted).
> 
> For source code, mostly. Binaries are still troublesome.

There are several issues associated with ssh distribution.  One is the
RSA patent that is due to expire in September.  However, I've heard that
they've applied for an extension and that extensions are usually
knee-jerk granted.  Hopefully this time sanity will prevail and the knee
won't jerk.

The RSA patent is NOT international because it is directly based on work
published almost 100 years ago, and international patents are not
granted for ideas based on published work.  Finally yes, there was/is
the USA's moronic crypto policy.

For all of these reasons, many crypt-concerned software companies are
finding it expedient to become multinational (even if they are totally
home-grown) and to distribute their encryption software from a European
office.  IBM has just played this trick.  Looks like Red Hat is right in
there.  It is perfectly legal for them to produce and distribute
RSA-based software in Europe.  I actually have no idea if one is
breaking the law (nominally) if one purchases RH or SuSE linux "packaged
in Germany" that contains ssh with all the RSA stuff included, or if one
downloads it from a European site.  I must say that I don't much care,
either -- US software patents are often nonsense because the folks in
the patent office are utterly ignorant of what is de facto in the public
domain.  At this moment I could do something like say: "Hmmm, perhaps
neural networks can be used to identify clown faces in bank cameras".  I
can go find and build and train an utterly prosaic NN for that purpose.
If I then file a patent for a "NN clown-face identification engine for
use in the banking industry" there is an excellent chance that it will
be granted.

If suddenly the banking world realizes that nearly fifty percent of
their customers in clown faces are there to rob the bank and not to make
a deposit after working a kid's birthday party and my company "CF-ID
Inc." takes off, I can then squash possible competitors when they go to
the SAME books I went to to build my NN to duplicate the idea.  It
doen't matter that the patent is stupid and indefensible.  Unless a big
player tries to get into the market and has the capital for a court
fight, I'm pretty safe and can run my own little monopoly for many, many
years.

Think it can't happen?  It has.  The "idea" of using NN's in credit card
fraud detection is patented this very day, in spite of it being an
utterly prosaic application of the NN.  Although it is indefensible, it
worked long enough for the company that obtained the patent to build
themselves a de facto monopoly that still has very few competitors.

Probably oversimplified, but I assure you -- if Sterling, Becker et. al.
had tried to PATENT the beowulf concept, the pre-existence of PVM and
MPI and/or Gnu and Linux would very likely not have been enough to keep
it from being granted.  Companies like paralogic and alta tech would
have to license the "technology" from S&B Inc.  A software patent is
much stronger protection, in its way, than a software copyright, as one
can generally reverse engineer a copyrighted software product from an
API, but one has to really fight to show that a patent, once granted, is
invalid.

> 
> > i've tried to use mandrake's ssh packages on a redhat 6.1, but redhat balked
> > at the mandrake rpms.
> 
> oh well. so much for a single packaging system.

The issue is usually how they interface with e.g. pam.  ssh is pretty
complicated stuff.  A "perfectly built" RPM would probably remain
portable, but a sloppily built one might well fail simply because it has
dependencies that weren't correctly established (by the builder) at
build time.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From marini at pcmenelao.mi.infn.it  Tue Jun  6 06:55:21 2000
From: marini at pcmenelao.mi.infn.it (Francesco Marini)
Date: Tue, 6 Jun 2000 15:55:21 +0200
Subject: Problems with MPICH 1.2 and Beowulf/Linux
Message-ID: <200006061355.PAA26685@pcmenelao.mi.infn.it>

Hi all,

    I've got a really weird problem with MPICH 1.2.
    The system consists of a server and 16 computing nodes, all
diskless, mounting root via NFS from the server.  It works very well
with pvm and LAM-MPI.
    Now, I'm trying to compile the latest source of MPICH, the make
process goes well, but when I try to "make testing" I get this output
(repeated for all tests using more than 1 machine) :

*** Testing MPI_Test ***
pcwalhalla : Mon May 29 16:27:09 CEST 2000
/work/staff/marini/mpich-1.2.0/bin/mpicc -DUSE_SOCKLEN_T
-DUSE_U_INT_FOR_XDR -DFORTRANUNDERSCORE -DHAVE_MPICHCONF_H
-DHAVE_STDLIB_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1
-DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1
-DHAVE_SIGACTION=1   -c persistent.c
/work/staff/marini/mpich-1.2.0/bin/mpicc  -o persistent persistent.o
*** Testing MPI_Recv_init ***
Differences in persistent.out
2,5c2,8
< rm_3383:  p4_error: rm_start: net_conn_to_listener failed: 3165
< p0_20161:  p4_error: Timeout in making connection to remote process on
node1: 0
< bm_list_20162:  p4_error: interrupt SIGINT: 2
< rm_l_1_20168:  p4_error: interrupt SIGINT: 2
---
> Receiving message 1
> Received message 1
> Receiving message 2
> Received message 2
> Receiving message 3
> Received message 3
> Completed all receives
7d9
< rm_20167:  p4_error: interrupt SIGINT: 2
pcwalhalla : Mon May 29 16:32:12 CEST 2000
/work/staff/marini/mpich-1.2.0/bin/mpicc -DUSE_SOCKLEN_T
-DUSE_U_INT_FOR_XDR -DFORTRANUNDERSCORE -DHAVE_MPICHCONF_H
-DHAVE_STDLIB_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1
-DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1
-DHAVE_SIGACTION=1   -c persist.c
/work/staff/marini/mpich-1.2.0/bin/mpicc  -o persist persist.o
*** Testing MPI_Startall/Request_free ***
Differences in persist.out
2,5c2
< rm_3388:  p4_error: rm_start: net_conn_to_listener failed: 3171
< p0_20318:  p4_error: Timeout in making connection to remote process on
node1: 0
< bm_list_20319:  p4_error: interrupt SIGINT: 2
< rm_l_1_20325:  p4_error: interrupt SIGINT: 2
---
> No errors
7d3
< rm_20324:  p4_error: interrupt SIGINT: 2
pcwalhalla : Mon May 29 16:37:14 CEST 2000
/work/staff/marini/mpich-1.2.0/bin/mpicc -DUSE_SOCKLEN_T
-DUSE_U_INT_FOR_XDR -DFORTRANUNDERSCORE -DHAVE_MPICHCONF_H
-DHAVE_STDLIB_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1
-DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1
-DHAVE_SIGACTION=1   -c persist2.c
/work/staff/marini/mpich-1.2.0/bin/mpicc  -o persist2 persist2.o
*** Testing MPI_Startall(Bsend)/Request_free ***
Differences in persist2.out
2,5c2
< rm_3391:  p4_error: rm_start: net_conn_to_listener failed: 3177
< p0_20473:  p4_error: Timeout in making connection to remote process on
node1: 0
< bm_list_20474:  p4_error: interrupt SIGINT: 2
< rm_l_1_20480:  p4_error: interrupt SIGINT: 2
---

    Seems like MPICH cannot start the remote process or cannot establish
the connection. The crazy thing is that with pvm and LAM-MPI all goes
well.

    Any idea ?

    Second : I've got some prob compiling ScaLapack with LAM-MPI, gcc
and pgf77 (f77 compiler from Portland Group), it gives a lot of
unresolved symbols regarding MPI. Anyone succeded in compiling them
under same configuration ?

    Thank you all in advance,


Franz Marini


---------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : marini at pcmenelao.mi.infn.it
---------------------------------------------


From c.best at fz-juelich.de  Tue Jun  6 07:27:52 2000
From: c.best at fz-juelich.de (Christoph Best)
Date: Tue,  6 Jun 2000 16:27:52 +0200 (CEST)
Subject: Benchmarking L2 cache on the Alpha 21264
In-Reply-To: <200006052111.RAA10427@orourke.mclinux.com>
References: <14651.41221.672255.829879@verne.local>
	<200006052111.RAA10427@orourke.mclinux.com>
Message-ID: <14653.2007.88779.708786@verne.local>

Hi everybody,

thanks for all the help. I think the problem I am seeing is the lack
of page coloring. I will try Joseph Martin's kernel patch asap - we
are very interested in making efficient use of the L2 cache as it is
so big (4 MB on some of our machines). 

In particular, page coloring should be a very good idea for cluster
nodes where we do not care about the actual performance of the kernel
page allocator (just running one process a long time in a fixed page
setup), but the penalties for cache misses are very high. We easily
see a factor of three in MFlops numbers between L1 cache and memory.

BTW, we use the Compaq compiler which gives about 20% more MFlops than 
the gnu compiler in L1 cache.

Thanks again
-Chris
-- 
Christoph Best                                        c.best at computer.org
John von Neumann Institute for Computing/DESY   http://www.oche.de/~cbest


From glindahl at hpti.com  Tue Jun  6 07:33:13 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Tue, 6 Jun 2000 10:33:13 -0400
Subject: Benchmarking L2 cache on the Alpha 21264
In-Reply-To: <21ECC6E090DCD21180D20000F809A18B03B7C2BF@exctay-02.tay.dec.com>
Message-ID: <001b01bfcfc4$255f3780$f69cfea9@hptilap.hpti.com>

> I was also told that Greg Lindahl and Joe Martin have posted
> patches to help
> fix this.

And btw, here is our status:

My patch doesn't quite work right, but I think I know how to fix it.

Joe's patch, different approach, doesn't work quite right either. He
probably has some ideas...

We know what the right answer is (from Tru64), we know some tests that
reveal if it is working well or not.

What we could use would be a volunteer to drive this thing home. I'm way too
busy.

-- g


From c.best at fz-juelich.de  Tue Jun  6 07:42:10 2000
From: c.best at fz-juelich.de (Christoph Best)
Date: Tue,  6 Jun 2000 16:42:10 +0200 (CEST)
Subject: Benchmarking L2 cache on the Alpha 21264
In-Reply-To: <393D0BAD.E26F1301@quadrics.com>
References: <14651.41221.672255.829879@verne.local>
	<200006052111.RAA10427@orourke.mclinux.com>
	<14653.2007.88779.708786@verne.local>
	<393D0BAD.E26F1301@quadrics.com>
Message-ID: <14653.3273.414147.224625@verne.local>

Hi,

I have been asked to post where I found the patch.

Joseph Martin posted this to the linux-kernel list on April 18:

   http://www.uwsg.indiana.edu/hypermail/linux/kernel/0004.2/0503.html

But if he should be reading this, maybe he has a more recent version?

-Chris
-- 
Christoph Best                                        c.best at computer.org
John von Neumann Institute for Computing/DESY   http://www.oche.de/~cbest


From vor+ at pitt.edu  Tue Jun  6 08:58:05 2000
From: vor+ at pitt.edu (Victor Ortega)
Date: Tue, 06 Jun 2000 11:58:05 -0400 (EDT)
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.10.10006060856260.19246-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.GSO.3.96L.1000606114216.16146D-100000@unixs1.cis.pitt.edu>

On Tue, 6 Jun 2000, Robert G. Brown wrote:
> Even if you configure ssh to use no encryption and not to verify
> connections at all (making it "just like" rsh) you still get
> /etc/environment and port forwarding.

I'm glad someone said it.  I was going to say it myself otherwise.

Although I'll admit I haven't done this, it should be possible to
configure ssh such that outside connections to the head node are
encrypted, but connections within the cluster are unencrypted (for the
sake of those worried about performance degradation within the cluster
due to ssh).  Internal authentication need not be TOTALLY disabled;
simply set up public and private keys on all the nodes and there'll
still be a level of security--even some bad guy who brings in a
computer and attaches it to the internal network will not be able to
just log into the other nodes without at least having a public key.

Also, the security and convenience features of ssh make it almost a
must for those wishing to connect to a cluster from an external
location; at that point, having just ssh (and not both ssh and rsh)
will make administration and configuration of the cluster easier.  I
will give that those who absolutely refuse to have ssh on their system
can still get away with using SRP for secure connections to the
cluster and then use rsh within the cluster (and therefore still have
both security and high performance), but again, that's still two
packages that need to be maintained instead of just one.

Victor
p.s. check out http://srp.stanford.edu/srp/ for information on SRP,
     a backwards-compatible, secure replacement for telnet and ftp.


From glindahl at hpti.com  Tue Jun  6 09:20:03 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Tue, 6 Jun 2000 12:20:03 -0400
Subject: automating commands on nodes
In-Reply-To: <Pine.GSO.3.96L.1000606114216.16146D-100000@unixs1.cis.pitt.edu>
Message-ID: <002101bfcfd3$11b6b960$f69cfea9@hptilap.hpti.com>

> Although I'll admit I haven't done this, it should be possible to
> configure ssh such that outside connections to the head node are
> encrypted, but connections within the cluster are unencrypted

This is a pain. You have to recompile sshd to allow unencrypted connections.
Then there is no existing policy option to enforce external connections
being encrypted. Gaah.

-- greg


From bnh at dimension6.com  Tue Jun  6 10:32:26 2000
From: bnh at dimension6.com (brad)
Date: Tue, 6 Jun 2000 11:32:26 -0600
Subject: alpha multia beowulf cluster -- ideas
Message-ID: <NDBBJEGKELFODAAMMNHIIECICDAA.bnh@dimension6.com>

Hello, I was considering building a beowulf cluster based on alpha multia's.
has anyone tried this? what kind of performance can i generally expect? does
anyone know of any resources online regarding this?

Thanks,
Brad


From rgb at phy.duke.edu  Tue Jun  6 11:49:37 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 6 Jun 2000 14:49:37 -0400 (EDT)
Subject: automating commands on nodes
In-Reply-To: <Pine.GSO.3.96L.1000606114216.16146D-100000@unixs1.cis.pitt.edu>
Message-ID: <Pine.LNX.4.10.10006061439260.19246-100000@ganesh.phy.duke.edu>

On Tue, 6 Jun 2000, Victor Ortega wrote:

> On Tue, 6 Jun 2000, Robert G. Brown wrote:
> > Even if you configure ssh to use no encryption and not to verify
> > connections at all (making it "just like" rsh) you still get
> > /etc/environment and port forwarding.
> 
> I'm glad someone said it.  I was going to say it myself otherwise.
> 
> Although I'll admit I haven't done this, it should be possible to
> configure ssh such that outside connections to the head node are
> encrypted, but connections within the cluster are unencrypted (for the
> sake of those worried about performance degradation within the cluster
> due to ssh).  Internal authentication need not be TOTALLY disabled;
> simply set up public and private keys on all the nodes and there'll
> still be a level of security--even some bad guy who brings in a
> computer and attaches it to the internal network will not be able to
> just log into the other nodes without at least having a public key.

I agree, although my measurements (published last week on the list) do
show that the bulk of the "cost" of ssh relative to rsh comes from the
original RSA handshake, not from the encryption.  If ssh is build with
--with-none defined, one can call ssh as ssh -c none whereever whatever
to skip crypting the net traffic.

I believe that you are right, though, in that ssh could be set up to do
full RSA authentication on connections to the head node and then do
basically no host authentication and no encryption between nodes on the
private network (in)side.  I'll see if I can work out the appropriate
configuration files and/or wrappers and if I can I'll publish them back
to the list and in the book under construction.  I should probably do an
rshbench of ssh when RSA host authentication is turned off anyway to see
what fraction of the overhead is associated with reading
/etc/environment and managing any forwarded ports.

I agree with the rest of your note as well.  Net snooping has been
responsible for the bulk of the successful cracks into our department
over the last fifteen years or so.  It is easiest to maintain just one
of ssh/rsh (and not both) and given this choice, ssh is the obvious one.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Tue Jun  6 11:51:14 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 6 Jun 2000 14:51:14 -0400 (EDT)
Subject: automating commands on nodes
In-Reply-To: <002101bfcfd3$11b6b960$f69cfea9@hptilap.hpti.com>
Message-ID: <Pine.LNX.4.10.10006061450090.19246-100000@ganesh.phy.duke.edu>

On Tue, 6 Jun 2000, Greg Lindahl wrote:

> > Although I'll admit I haven't done this, it should be possible to
> > configure ssh such that outside connections to the head node are
> > encrypted, but connections within the cluster are unencrypted
> 
> This is a pain. You have to recompile sshd to allow unencrypted connections.
> Then there is no existing policy option to enforce external connections
> being encrypted. Gaah.

And the marginal gain in performance is very small unless you are
regularly using ssh to send large files.  Most of the cost is in the
original RSA connection, not the encryption.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From vor+ at pitt.edu  Tue Jun  6 12:06:26 2000
From: vor+ at pitt.edu (Victor Ortega)
Date: Tue, 06 Jun 2000 15:06:26 -0400 (EDT)
Subject: automating commands on nodes
In-Reply-To: <Pine.LNX.4.10.10006061439260.19246-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.GSO.3.96L.1000606145519.16146O-100000@unixs1.cis.pitt.edu>

On Tue, 6 Jun 2000, Robert G. Brown wrote:
> I agree, although my measurements (published last week on the list) do
> show that the bulk of the "cost" of ssh relative to rsh comes from the
> original RSA handshake, not from the encryption.

But I believe that your benchmarks were done with copying small files;
I am worried that forwarding a full X connection, encrypted, over ssh
from some internal node (ssh into head node, ssh into some internal
node, load up some big GUI) will incur a big performance penalty.  I
tried this yesterday with a simple two-hop connection, and the GUI was
twice as slow (it was slow enough already with just a single encrypted
X connection going over our external 10base-T network).  Unfortunately
I have no benchmarks for this.

Aside from that, I agree that the cost of encrypting communications
within the internal network is probably negligible.

Victor


From rgb at phy.duke.edu  Tue Jun  6 12:08:18 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 6 Jun 2000 15:08:18 -0400 (EDT)
Subject: diskless alphalinux nodes
In-Reply-To: <393D3B4C.32DD7025@okstate.edu>
Message-ID: <Pine.LNX.4.10.10006061451590.19246-100000@ganesh.phy.duke.edu>

On Tue, 6 Jun 2000, Mathew Lee wrote:

> ....additionally I searched the archives for diskless and found a reference from
> Oct. 1998, where you talk about a diskless booting sequence ...I have attached it
> to refresh your memory.  I was wondering if the diskless.tar.gz is still available,
> and/or if it has been updated.....also, is there a place that I could find
> additional information on diskless booting...mounting root-nfs or by other
> means....(ramdisk or coda possibly)

I largely abandoned this particular approach because new kernels came
out that supported much better methods.  The most intriguing is Greg
Warnes NFS hack that permits the normal installation of just one server
to support N nodes without creating any host-specific export directories
at all -- I think he posted it last week.  However, I believe that there
are other packages out there as well.

There are three different levels of problems to solve setting up and
running diskless systems.  The first is getting a kernel to load (via
the net with special proms on a NIC or from a boot floppy).  The second
is getting the kernel you boot to NFS mount your root (and other) file
system(s).  The third is efficiently laying out exports on a server so
that you provide writeability where a given system really has to have
it.  Pretty much all unixoid systems will be unhappy unless they can
write /var, /tmp, /etc and /dev, although one can often rig /etc with
symlinks to writeable space in e.g. /var/etc to fake it.  

Greg's NFS hack allows a single fs to be exported but gives writability
and remapped identity to files via an IP-based tag, so e.g.
/etc/ld.so.cache as mounted on host xxxxxxxx is really exported as
/etc/ld.so.cache_xxxxxxxx on the server.  IIRC, that is (he may correct
me).  I'd expect that his changes are moderately portable since they are
likely well above the machine hardware layer of the kernel.

However, once you've figured out how to build a NFS lilo boot floppy
(which isn't that difficult from the current howtos and e.g. mkinitrd)
it is also pretty simple to go diskless by just giving each node e.g.
/exports/[b1,b2,b3...] exported to each host as its root and then
cloning everything BUT /usr into it (that is, make /usr a separate
filesystem on the server, usually, and export it RO to all the hosts).
This wastes a bit of space but space is cheap.  You'll still need to
periodically rsync the node roots with a carefully determined exclusion
list, as otherwise e.g. RPM installs on the server won't properly
propagate to the nodes.

I might tackle an NFS/diskless installation again one day, but if I do
I'm almost certainly going to work from Greg's or one of the others that
have been posted/advertised on the list in the last few months.  Use the
search engine to find them.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From warnes at biostat.washington.edu  Tue Jun  6 12:40:18 2000
From: warnes at biostat.washington.edu (Gregory R. Warnes)
Date: Tue, 6 Jun 2000 12:40:18 -0700 (PDT)
Subject: diskless alphalinux nodes
In-Reply-To: <Pine.LNX.4.10.10006061451590.19246-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.GSO.4.21.0006061227440.7326-100000@atlas.biostat.washington.edu>

On Tue, 6 Jun 2000, Robert G. Brown wrote:

  RGB>> On Tue, 6 Jun 2000, Mathew Lee wrote:
  RGB>> 
  RGB>> > ....additionally I searched the archives for diskless and found a reference from
  RGB>> > Oct. 1998, where you talk about a diskless booting sequence 

	[snip]

  RGB>> 
  RGB>> I largely abandoned this particular approach because new kernels came
  RGB>> out that supported much better methods.  The most intriguing is Greg
  RGB>> Warnes NFS hack that permits the normal installation of just one server
  RGB>> to support N nodes without creating any host-specific export directories
  RGB>> at all -- I think he posted it last week.

The NFS mod that Robert mentions is called ClusterNFS and has a home
page at http://ClusterNFS.sourceforge.net 

  RGB>> Greg's NFS hack allows a single fs to be exported but gives writability
  RGB>> and remapped identity to files via an IP-based tag, so e.g.
  RGB>> /etc/ld.so.cache as mounted on host xxxxxxxx is really exported as
  RGB>> /etc/ld.so.cache_xxxxxxxx on the server.  IIRC, that is (he may correct
  RGB>> me).  I'd expect that his changes are moderately portable since they are
  RGB>> likely well above the machine hardware layer of the kernel.

Actually, ClusterNFS runs entirely in userspace, so that *no* kernel
modifications are required on either the server or the client. Since
ClusterNFS is a simple extension to the standard Universal-NFS server,
which is reported to work on a wide variety of OS's and CPU's, I expect it
will compile and work out-of-the-box with Alpha-Linux.  (Of course, I
haven't actually tried anything but Intel Linux.  Let me know if something
doesn't work.)

  RGB>> However, once you've figured out how to build a NFS lilo boot floppy
  RGB>> (which isn't that difficult from the current howtos and e.g. mkinitrd)
  RGB>> it is also pretty simple to go diskless by just giving each node e.g.
  RGB>> /exports/[b1,b2,b3...] exported to each host as its root and then
  RGB>> cloning everything BUT /usr into it (that is, make /usr a separate
  RGB>> filesystem on the server, usually, and export it RO to all the hosts).
  RGB>> This wastes a bit of space but space is cheap.  You'll still need to
  RGB>> periodically rsync the node roots with a carefully determined exclusion
  RGB>> list, as otherwise e.g. RPM installs on the server won't properly
  RGB>> propagate to the nodes.

I created ClusterNFS explicitly to get away from the need to keep separate
directories for each client (either on NFS or on the client itself).  
Keeping {track of, propagating} changes to all of the appropriate
directories gets hairy fast.  Even if you use rsync, it is quite difficult
to figure out everything that should be {in,ex}cluded.  I used to do this
on our cluster, and every couple of months I'd discover that something
else needed to be excluded that wasn't.  In addition, particularly painful
things happen when rsync tries to update the libraries it is using on the
clients....

-Greg


From rgb at phy.duke.edu  Tue Jun  6 13:18:50 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 6 Jun 2000 16:18:50 -0400 (EDT)
Subject: automating commands on nodes
In-Reply-To: <Pine.GSO.3.96L.1000606145519.16146O-100000@unixs1.cis.pitt.edu>
Message-ID: <Pine.LNX.4.10.10006061607190.26900-100000@ganesh.phy.duke.edu>

On Tue, 6 Jun 2000, Victor Ortega wrote:

> On Tue, 6 Jun 2000, Robert G. Brown wrote:
> > I agree, although my measurements (published last week on the list) do
> > show that the bulk of the "cost" of ssh relative to rsh comes from the
> > original RSA handshake, not from the encryption.
> 
> But I believe that your benchmarks were done with copying small files;

Big ones too.  I tested a 1M file copy at 0.67 sec for ssh (using
default idea encryption) vs 0.2 sec for rsh.  I also tested e.g.
blowfish and one can interpolate a bit and still get encryption.  All
this also depends strongly on the speed of the CPUs.  A rough estimate
of 1 (extra) second for each 2 MB sent is probably not unreasonable,
although you might get 3 MB in a second on a good day or even four or
five with blowfish.  Beyond that you're approaching wirespeed.

> I am worried that forwarding a full X connection, encrypted, over ssh
> from some internal node (ssh into head node, ssh into some internal
> node, load up some big GUI) will incur a big performance penalty.  I
> tried this yesterday with a simple two-hop connection, and the GUI was
> twice as slow (it was slow enough already with just a single encrypted
> X connection going over our external 10base-T network).  Unfortunately
> I have no benchmarks for this.

Hmm, hadn't thought about this, as I try not to run graphics-heavy X
apps over any kind of shell connection -- with linux one can usually run
them locally -- although e.g. xterms and simple Tk-ish apps work fine.
I'll have to see if I can set this up to measure it.  However, at an
extra 0.5 seconds per megabyte, I agree that you won't want to play a
hi-res video game this way and that netscape should be significantly
delayed.  "Simple" X apps, though (e.g.  xterm) should be ok, and I
can't think of why one would need to run e.g.  netscape on a node.

I also have no idea if a double ssh doubles this overhead.  It might be
that a->b is encrypted and then b->c is REencrypted.  Or it might be
that the b->c transfer forwards the keys (so to speak).  I'll try to
test this as well.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From jakob at ostenfeld.dk  Tue Jun  6 14:39:16 2000
From: jakob at ostenfeld.dk (=?iso-8859-1?Q?Jakob_=D8stergaard?=)
Date: Tue, 6 Jun 2000 23:39:16 +0200
Subject: [Announce] The jobd Load Balancer
Message-ID: <20000606233916.E770@ostenfeld.dk>

Hi all!

You may remember that I asked for a simple load balancing system for parallel
make jobs, about a week (or two) ago.

Some suggestions came up, and I was even offered the opportunity to beta-test a
commercial queuing system.  I decided to first play around with some of the
queuing systems already freely available out there, as well as various parallel
make variants.  I was certain that parallel makes was something a lot of people
did and therefore there would be mature and well functioning tools for the job
- as usual.

I need GNU Make, so BSD pmake is out.  Customs GNU Make doesn't work properly
and isn't maintained. PVM GNU Make may work, but didn't for me, I also have the
feeling that this is too much of a hack to be relied upon.  GNU Queue was
close, except that it breaks under load.  Generic NQS was too big (too slow for
short jobs, to complex).

Fixing GNU Queue wasn't an option for me, with all due respect that is by far
the ugliest code I've seen in a long time.

So I did what I originally wanted to avoid:  Wrote up a new load balancing
system from scratch.

It's very simple, providing an rsh like command ``jsh'' which instead of taking
a hostname argument (as rsh does) takes a job-type argument.  It communicates
with the jobd daemon running on the local host, and finds the best host for the
job-type given.  The job is then executed on this best host for the job.

For example, running the hostname command as a gcc-type job:
[joe at eagle joe]$ jsh -t gcc hostname
eagle
[joe at eagle joe]$ jsh -t gcc hostname
albatros

It's simple, efficient, and even somewhat secure. (I believe it is secure if
the network is physically secure and and nodes in the /etc/jobd.hosts file can
be trusted).

It is available at  http://ostenfeld.dk/~jakob/jobd/

The current version is 0.1, which should indicate that there is still work to
be done.  However, the system seems to work for me, and I'll be using it at
work the next few days to see how it fares.   There will be one major update
for the resource handling soon, but all in all I think the system is ready
for some use and feedback.  Hence this notice    :)

So if anyone besides me is sick and tired of waiting for those half-hour C++
compilations, here's a chance to justify a Beowulf for your boss   ;)

Cheers,
-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:


From jgscribner at riversidepaper.com  Wed Jun  7 07:48:43 2000
From: jgscribner at riversidepaper.com (Justin Scribner)
Date: Wed, 7 Jun 2000 09:48:43 -0500 
Subject: Please help me unsubscribe
Message-ID: <11251BCC86FCD1118CFA00805F6F91B92C0CCF@CBC>

I truly apologize for posting this of message to the list but feel that I
have no other recourse.  I have been trying to unsubscribe for weeks by
sending messages to both beowulf-request at beowulf.gsfc.nasa.gov and
Majordomo at beowulf.gsfc.nasa.gov but get nothing but undeliverable messages
(even when sent from other completely disparate addresses).  All messages to
other addresses work fine and I doubt there is a problem with the
aforementioned addresses.  I tried to unsubscribe from the web page, but
haven't received a response from there either.  I do appreciate the
discussions thus far and learned that I need to consider Mosix rather than
Beowulf.  Thank you in advance and my sincerest apologies to those who have
seen one-too-many unsubscribes posted to mailing-lists.

Justin G. Scribner
MIS - Technician
jgscribner at riversidepaper.com


From rgb at phy.duke.edu  Wed Jun  7 08:23:05 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 7 Jun 2000 11:23:05 -0400 (EDT)
Subject: Please help me unsubscribe
In-Reply-To: <11251BCC86FCD1118CFA00805F6F91B92C0CCF@CBC>
Message-ID: <Pine.LNX.4.10.10006071117500.28357-100000@ganesh.phy.duke.edu>

On Wed, 7 Jun 2000, Justin Scribner wrote:

> I truly apologize for posting this of message to the list but feel that I
> have no other recourse.  I have been trying to unsubscribe for weeks by
> sending messages to both beowulf-request at beowulf.gsfc.nasa.gov and
> Majordomo at beowulf.gsfc.nasa.gov but get nothing but undeliverable messages
> (even when sent from other completely disparate addresses).  All messages to
> other addresses work fine and I doubt there is a problem with the
> aforementioned addresses.  I tried to unsubscribe from the web page, but
> haven't received a response from there either.  I do appreciate the
> discussions thus far and learned that I need to consider Mosix rather than
> Beowulf.  Thank you in advance and my sincerest apologies to those who have
> seen one-too-many unsubscribes posted to mailing-lists.

It isn't working because the beowulf.gsfc.nasa.gov address is defunct
and obsolete.  This is one (of many) reasons to use the proper domain
address:  www.beowulf.org.  This is a "portable" entity and has followed
Don Becker, Erik Hendriks, and many of the rest of the NASA Goddard
folks to Scyld.  I believe that www.beowulf.org is currently actually at
scyld.com but I'm not sure and the reason for using the domain name is
that it won't matter.  Whereever it really is, that's where you'll go.

So, try sending your unsubscribe message to majordomo at beowulf.org.

I believe that we are VERY close to having the beowulf list managed by
mailman, which will be a very Good Thing (tm).  If/when this finally
occurs, you can subscribe and unsubscribe and generally control the flow
of list traffic directly from a password protected web interface.  This
is a very desirable thing...

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From josip at icase.edu  Wed Jun  7 08:35:13 2000
From: josip at icase.edu (Josip Loncaric)
Date: Wed, 07 Jun 2000 11:35:13 -0400
Subject: Athlon + PC133: no ECC?
Message-ID: <393E6BB1.2557EF4@icase.edu>

Athlons do well on floating point, so we've been looking at building
some Athlon nodes for our cluster, using PC133 memory of course.  This
requires VIA's KX133 chipset (now) or KT133 (near future) or AMD's 760
(more distant future).

On June 5th, AMD finally released Athlons with full speed on-chip cache
(see http://www.amd.com/news/prodpr/20108.html).  These will come in
OEM 'Slot A' packaging for the existing Athlon motherboards (e.g. those
based on VIA's KX133 chipset), but 'Socket A' packaging will be
preferable.  The 'Socket A' Athlons will require the KT133 chipset from
VIA (see http://www.viatech.com/news/00kt133launch.htm), at least until
AMD gets its 760 chipset out the door.

So far so good.  Unfortunately, while VIA's KX133 datasheet at least
mentioned 'optional' ECC capability, the KT133 datasheet (VT8363 North
Bridge Controller, see http://www.viatech.com/pdf/productinfo/kt133.pdf)
makes no pretense of having any ECC features.

Our applications require a lot of RAM (16-32GB or so), and we expect
individual node uptimes of several months.  Windows users who reboot
their 128MB machines daily would not even see a problem, but we need
ECC. It makes me very uneasy to even think about tracking down an
intermittent memory problem in 32GB of RAM without ECC capability.

Am I correct in concluding that the new 'Socket A' chipset KT133 will
have *no* DRAM data integrity features?  Does anyone know if the current
motherboards based on the KX133 (the 'Slot A' chipset) actually *use*
ECC?  My reading of the Asus K7V manual is that while this motherboard
will accept an ECC memory module, there is *no* way to tell BIOS to use
DRAM ECC (only an L2 cache ECC mode is mentioned).  Moreover, the
datasheets talk about '64-bit system memory interface' in both cases, so
it seems that the KX133 optional ECC feature is external to the VIA
VT8371 chip.  Do any KX133 motherboards actually implement ECC on DRAM?

If ECC is indeed unavailable on VIA's chipsets, and AMD's 760 chipset
remains unavailable, things do not look so good for Athlons at our end.
How concerned should we be about the lack of ECC with fast Athlons? 
This issue may even force us to go back to Pentiums.  BTW, some
Linux compatibility issues with Athlons were also reported, such as the
MTRR setup and even DMA problems with certain ATA drives, but unlike the
ECC situation, those compatibility issues are presumably resolvable in
software.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From deadline at plogic.com  Wed Jun  7 09:54:14 2000
From: deadline at plogic.com (Douglas Eadline)
Date: Wed, 7 Jun 2000 12:54:14 -0400 (EDT)
Subject: Athlon + PC133: no ECC?
In-Reply-To: <393E6BB1.2557EF4@icase.edu>
Message-ID: <Pine.LNX.4.10.10006071242550.13821-100000@lisa.plogic.com>

On Wed, 7 Jun 2000, Josip Loncaric wrote:

> Athlons do well on floating point, so we've been looking at building
> some Athlon nodes for our cluster, using PC133 memory of course.  This
> requires VIA's KX133 chipset (now) or KT133 (near future) or AMD's 760
> (more distant future).
> 
> On June 5th, AMD finally released Athlons with full speed on-chip cache
> (see http://www.amd.com/news/prodpr/20108.html).  These will come in
> OEM 'Slot A' packaging for the existing Athlon motherboards (e.g. those
> based on VIA's KX133 chipset), but 'Socket A' packaging will be
> preferable.  The 'Socket A' Athlons will require the KT133 chipset from
> VIA (see http://www.viatech.com/news/00kt133launch.htm), at least until
> AMD gets its 760 chipset out the door.
>

As I understand it, the "new Athlons" will only be available
as socket A parts to the general public. Slot A parts will
only be sold to OEMs and will not work with with the KX chipset
in any case.
(This was my understanding anyway, perhaps I am wrong)
 
-snip-
> 
> If ECC is indeed unavailable on VIA's chipsets, and AMD's 760 chipset
> remains unavailable, things do not look so good for Athlons at our end.
> How concerned should we be about the lack of ECC with fast Athlons? 
> This issue may even force us to go back to Pentiums.  BTW, some
> Linux compatibility issues with Athlons were also reported, such as the
> MTRR setup and even DMA problems with certain ATA drives, but unlike the
> ECC situation, those compatibility issues are presumably resolvable in
> software.

ECC is nice. We are integrating ECC capabilities in our monitoring
tool to detect possible problems.

Doug

-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------


From SeanWard at msn.com  Wed Jun  7 10:15:54 2000
From: SeanWard at msn.com (Sean Ward)
Date: Wed, 7 Jun 2000 13:15:54 -0400
Subject: Athlon + PC133: no ECC?
References: <Pine.LNX.4.10.10006071242550.13821-100000@lisa.plogic.com>
Message-ID: <003f01bfd0a4$09cff1e0$120010ac@alex1.va.home.com>

    I'm currently using two Athlon 750 systems based on the VIA KX133 (abit
KA7 mobos). It does indeed support ECC RAM, although finding PC133 ECC ram
is rather difficult, as the Athlons are very particular about the ram they
use. Several name brand rams, including Mushkin and Crucial turned out to be
unstable. However, if you can find a good provider of PC133 ECC ram, the
~450 mbps stream figures the KA7 and an Athlon 750 turns in are not bad for
commodity parts. As for other items, once you get decent ram in the systems,
they are rock solid. My one box has been up for 30 days so far (thats when I
built it). As far as the MTTR and DMA drive support, just grab the newest
patches from www.linux-ide.org   for ULTRA66/100 support, and use a recent
kernel revision (such as 2.2.15) to have athlon MTTR support.
-Sean
----- Original Message -----
From: Douglas Eadline <deadline at plogic.com>
To: Josip Loncaric <josip at icase.edu>
Cc: Beowulf mailing list <beowulf at beowulf.org>
Sent: Wednesday, June 07, 2000 12:54 PM
Subject: Re: Athlon + PC133: no ECC?


> On Wed, 7 Jun 2000, Josip Loncaric wrote:
>
> > Athlons do well on floating point, so we've been looking at building
> > some Athlon nodes for our cluster, using PC133 memory of course.  This
> > requires VIA's KX133 chipset (now) or KT133 (near future) or AMD's 760
> > (more distant future).
> >
> > On June 5th, AMD finally released Athlons with full speed on-chip cache
> > (see http://www.amd.com/news/prodpr/20108.html).  These will come in
> > OEM 'Slot A' packaging for the existing Athlon motherboards (e.g. those
> > based on VIA's KX133 chipset), but 'Socket A' packaging will be
> > preferable.  The 'Socket A' Athlons will require the KT133 chipset from
> > VIA (see http://www.viatech.com/news/00kt133launch.htm), at least until
> > AMD gets its 760 chipset out the door.
> >
>
> As I understand it, the "new Athlons" will only be available
> as socket A parts to the general public. Slot A parts will
> only be sold to OEMs and will not work with with the KX chipset
> in any case.
> (This was my understanding anyway, perhaps I am wrong)
>
> -snip-
> >
> > If ECC is indeed unavailable on VIA's chipsets, and AMD's 760 chipset
> > remains unavailable, things do not look so good for Athlons at our end.
> > How concerned should we be about the lack of ECC with fast Athlons?
> > This issue may even force us to go back to Pentiums.  BTW, some
> > Linux compatibility issues with Athlons were also reported, such as the
> > MTRR setup and even DMA problems with certain ATA drives, but unlike the
> > ECC situation, those compatibility issues are presumably resolvable in
> > software.
>
> ECC is nice. We are integrating ECC capabilities in our monitoring
> tool to detect possible problems.
>
> Doug
>
> -------------------------------------------------------------------
> Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
> 130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
> Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
> -------------------------------------------------------------------
>
>
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
>


From rgb at phy.duke.edu  Wed Jun  7 11:00:55 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 7 Jun 2000 14:00:55 -0400 (EDT)
Subject: Easier said than done
In-Reply-To: <200006071720.e57HKpm18031@axe1.med.upenn.edu>
Message-ID: <Pine.LNX.4.10.10006071336540.28357-100000@ganesh.phy.duke.edu>

On Wed, 7 Jun 2000 axelsen at axe1.med.upenn.edu wrote:

> 
> Dear Robert,
> 
> I have also been trying to post, and failing that, to unsubscribe
> and resubscribe so that I could post a message.  I've sent messages
> to beowulf-admin at beowulf.org about this, but no response.

Ah, this is a good thing (really!).  I went to the beowulf page at
www.beowulf.org and lo, the beowulf list is ALREADY a mailman mediated
list.  This is infinitely better than it being a majordomo list.

SO, here are the revised instructions for getting on (or off) the
beowulf list.

The "website" of the mailman-mediated beowulf mailing list (not the
beowulf website per se, just that of the list) is now:

http://www.beowulf.org/mailman/listinfo/beowulf

Start by visiting this site with your favorite browser.  To subscribe,
well, subscribe.  To unsubscribe (if you've already been on the list a
while and want to get off) scroll down to the "subscribers" section
(since you are already subscribed:-).  Enter your email address EXACTLY
AS IT WAS GIVEN IN YOUR ORIGINAL SUBSCRIPTION and click the "edit
options" button.

This puts you into your own personal configuration page.  Right there
before you is the unsubscribe option.  HOWEVER, to unsubscribe you need
a password.  Generally speaking, there is a maintenance script that runs
on the mailman server that mails all subscribed persons a reminder of
their passwords once a month, but if it has been set up by the list
administrators it obviously hasn't been run yet.

No matter.  Right there underneath the unsubscribe panel is a "Forgotten
your Password?" panel.  If you click the "Email my password to me"
button, you will get the standard password/instruction set mailed to you
in a second or so.  Go retrieve it in your mail program and you can
unsubscribe.

BUT, think -- do you really want to?  Note all the options below on this
page.  One of them is to disable mail (while remaining subscribed).  You
can stay on the list and turn it on and off like a spigot with this
option.  You can then re-enable list delivery, ask a question, stay
online for a week to get all the responses, and when the thread plays
out turn the list "off" again.  It actually might be more efficient to
work this way than to subscribe and unsubscribe over and over again to
get a week's worth of traffic when you need it.

Then there is digest mode.  If you "like" getting the list traffic but
just can't handle getting mail every twenty minutes (or whatever the
MTBM is), you can try this for a while.  In digest mode, you get the
entire day's traffic in a single message, once a day, with a
header/table of contents.  If you see anything in the TOC that interests
you, you can read it.  Otherwise, hit the ol' "d" button and move on.  I
digest all the lists I'm not myself active on, which cuts their
effective burden on me to near zero.

You can even control whether or not you want to receive MIME messages or
plaintext only.

Stuff like this is what makes mailman a very nice thing indeed.  Even
those without procmail installed (which can simulate parts of this,
poorly) can now control the delivery of list traffic very nicely,
although you'll still need procmail to filter out certain prolific
contributors (like rgb:-) or the occasional spammer that targets the
list if they annoy you...

Note that EVERYBODY on the list can (and at their convenience probably
should) check out their subscription options and retrieve/save their
password information (as well as bookmark the subscription page).  The
URL above is also the most direct route for subscribing to the beowulf
list at this point.

   rgb

> 
> 
> |>>>  From beowulf-admin at beowulf.org Wed Jun  7 11:28:58 2000
> |>>>  
> |>>>  It isn't working because the beowulf.gsfc.nasa.gov address is defunct
> |>>>  and obsolete.  This is one (of many) reasons to use the proper domain
> |>>>  address:  www.beowulf.org.  This is a "portable" entity and has followed
> |>>>  Don Becker, Erik Hendriks, and many of the rest of the NASA Goddard
> |>>>  folks to Scyld.  I believe that www.beowulf.org is currently actually at
> |>>>  scyld.com but I'm not sure and the reason for using the domain name is
> |>>>  that it won't matter.  Whereever it really is, that's where you'll go.
> |>>>  
> |>>>  So, try sending your unsubscribe message to majordomo at beowulf.org.
> 
> 
> 
> When I did this, I got the following back ...
> 
> 
> 
> |>>>  From MAILER-DAEMON at axe1.med.upenn.edu Wed Jun  7 13:13:13 2000
> |>>>  Received: from localhost (localhost)
> |>>>  	by axe1.med.upenn.edu (8.10.0/8.10.1) id e57HDDN17984;
> |>>>  	Wed, 7 Jun 2000 13:13:13 -0400 (EDT)
> |>>>  Date: Wed, 7 Jun 2000 13:13:13 -0400 (EDT)
> |>>>  From: Mail Delivery Subsystem <MAILER-DAEMON at axe1.med.upenn.edu>
> |>>>  Message-Id: <200006071713.e57HDDN17984 at axe1.med.upenn.edu>
> |>>>  To: axelsen at axe1.med.upenn.edu
> |>>>  MIME-Version: 1.0
> |>>>  Content-Type: multipart/report; report-type=delivery-status;
> |>>>  	boundary="e57HDDN17984.960397993/axe1.med.upenn.edu"
> |>>>  Subject: Returned mail: see transcript for details
> |>>>  Auto-Submitted: auto-generated (failure)
> |>>>  
> |>>>  This is a MIME-encapsulated message
> |>>>  
> |>>>  --e57HDDN17984.960397993/axe1.med.upenn.edu
> |>>>  
> |>>>  The original message was received at Wed, 7 Jun 2000 13:13:13 -0400 (EDT)
> |>>>  from axelsen at localhost
> |>>>  
> |>>>     ----- The following addresses had permanent fatal errors -----
> |>>>  majordomo at beowulf.org
> |>>>      (reason: 550 <majordomo at beowulf.org>... User unknown)
> |>>>  
> |>>>     ----- Transcript of session follows -----
> |>>>  ... while talking to blueraja.scyld.com.:
> |>>>  >>> RCPT To:<majordomo at beowulf.org>
> |>>>  <<< 550 <majordomo at beowulf.org>... User unknown
> |>>>  550 5.1.1 majordomo at beowulf.org... User unknown
> |>>>  
> |>>>  --e57HDDN17984.960397993/axe1.med.upenn.edu
> |>>>  Content-Type: message/delivery-status
> |>>>  
> |>>>  Reporting-MTA: dns; axe1.med.upenn.edu
> |>>>  Arrival-Date: Wed, 7 Jun 2000 13:13:13 -0400 (EDT)
> |>>>  
> |>>>  Final-Recipient: RFC822; majordomo at beowulf.org
> |>>>  Action: failed
> |>>>  Status: 5.1.1
> |>>>  Remote-MTA: DNS; blueraja.scyld.com
> |>>>  Diagnostic-Code: SMTP; 550 <majordomo at beowulf.org>... User unknown
> |>>>  Last-Attempt-Date: Wed, 7 Jun 2000 13:13:13 -0400 (EDT)
> |>>>  
> |>>>  --e57HDDN17984.960397993/axe1.med.upenn.edu
> |>>>  Content-Type: message/rfc822
> |>>>  
> |>>>  Return-Path: <axelsen>
> |>>>  Received: (from axelsen at localhost)
> |>>>  	by axe1.med.upenn.edu (8.10.0/8.10.1) id e57HDCO17982
> |>>>  	for majordomo at beowulf.org; Wed, 7 Jun 2000 13:13:13 -0400 (EDT)
> |>>>  Date: Wed, 7 Jun 2000 13:13:13 -0400 (EDT)
> |>>>  From: axelsen
> |>>>  Message-Id: <200006071713.e57HDCO17982 at axe1.med.upenn.edu>
> |>>>  To: majordomo at beowulf.org
> |>>>  
> |>>>  
> |>>>  unsubscribe
> |>>>  
>  
> ---------------------------------------------------------------------------- 
> 
>  Paul H. Axelsen MD, Associate Professor          ....   .....  .   .  .   .
>  Departments of Pharmacology and                  .   .  .      ..  .  ..  .
>    Medicine, Infectious Diseases Section          ....   ....   . . .  . . .
>  University of Pennsylvania School of Medicine    .      .      .  ..  .  ..
>  Rooms 130/131 John Morgan Bldg                   .      .....  .   .  .   .
>  3620 Hamilton Walk
>  Philadelphia, PA 19104-6084                      --------------------------
>                                                     215-898-9238  (office)
>  Email: axe at pharm.med.upenn.edu                     215-898-9766   (lab)
>  WWW: http://axe2.med.upenn.edu                     215-573-2236   (fax)
> 
> ----------------------------------------------------------------------------
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From axelsen at axe1.med.upenn.edu  Wed Jun  7 11:39:55 2000
From: axelsen at axe1.med.upenn.edu (axelsen at axe1.med.upenn.edu)
Date: Wed, 7 Jun 2000 14:39:55 -0400 (EDT)
Subject: Scalability of CHARMM on various architectures
Message-ID: <200006071839.e57Idtx18234@axe1.med.upenn.edu>


We are designing a cluster for which the most important code will be the
computational chemistry program, CHARMM.  In our preliminary tests on
an existing cluster, we have confirmed our expectation that interthread 
communications will be our first bottleneck.  A test run on a typical 
problem did not scale beyond 4 nodes on a cluster composed of P2-450
processors and fast ethernet interconnections.  In contrast, the same
code scales almost perfectly up to at least 12 processors on an SGI-Unix 
SMP machine.  CHARMM uses PVM.

I would appreciate contact from anyone who has run CHARMM on a cluster
and has considered the best way to make this code scale better.  From 
anyone, I would appreciate general guidance on several issues:

 * If we go with pentium II/III processors, how far is Myrinet likely to
   permit us to scale these calculations?

 * With Myrinet, would alpha nodes tend to scale any better than pentium nodes?

   (a single 500 MHz alpha processor is about 1.6-fold faster than a 
    single 450 MHz P2 on a typical problem, but it is about 3-fold the
    cost.  With this question, I am looking for any additional advantages 
    of alpha to justify this cost)

 * Are there differences between different pentium chip sets that will impact
   this problem?

 * Is there any advantage to buying dual-processor machines, either alpha
   or pentium, with respect to scaling?  Any advantage to dedicating one 
   processor in each dual-box to communications?  If we dedicate processors
   in this way, would both processors have to have the same clock speed?


------------- axe at pharm.med.upenn.edu -----------------
                                                       
Paul H. Axelsen               ....   ....  .   .  .   .
Department of Pharmacology    .   .  .     ..  .  ..  .
University of Pennsylvania    ....   ...   . . .  . . .
3620 Hamilton Walk            .      .     .  ..  .  ..
Philadelphia, PA 19104-6084   .      ....  .   .  .   .

-------------------------------------------------------


From glindahl at hpti.com  Wed Jun  7 11:56:01 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Wed, 7 Jun 2000 14:56:01 -0400
Subject: Scalability of CHARMM on various architectures
In-Reply-To: <200006071839.e57Idtx18234@axe1.med.upenn.edu>
Message-ID: <005701bfd0b2$0647fb40$f69cfea9@hptilap.hpti.com>

> I would appreciate contact from anyone who has run CHARMM on a cluster
> and has considered the best way to make this code scale better.  From
> anyone, I would appreciate general guidance on several issues:
>
>  * If we go with pentium II/III processors, how far is Myrinet likely to
>    permit us to scale these calculations?
>
>  * With Myrinet, would alpha nodes tend to scale any better than
> pentium nodes?

I only have *old* Alpha/Myrinet numbers, but they're considerably better on
scaling than Intel ethernet, not that this is a surprise.

Michael Crowley was supposed to send me a new copy of the code so I could do
a complete comparison, but it hasn't happened yet.

http://legion.virginia.edu/centurion/Applications.html

-- greg


From alan at dasher.wustl.edu  Wed Jun  7 12:11:49 2000
From: alan at dasher.wustl.edu (Alan Grossfield)
Date: Wed, 07 Jun 2000 14:11:49 -0500
Subject: Scalability of CHARMM on various architectures 
In-Reply-To: Your message of "Wed, 07 Jun 2000 14:39:55 EDT."
             <200006071839.e57Idtx18234@axe1.med.upenn.edu> 
Message-ID: <200006071911.OAA0000473920@dasher.wustl.edu>

:We are designing a cluster for which the most important code will be the
:computational chemistry program, CHARMM.  In our preliminary tests on
:an existing cluster, we have confirmed our expectation that interthread 
:communications will be our first bottleneck.  A test run on a typical 
:problem did not scale beyond 4 nodes on a cluster composed of P2-450
:processors and fast ethernet interconnections.  In contrast, the same
:code scales almost perfectly up to at least 12 processors on an SGI-Unix 
:SMP machine.  CHARMM uses PVM.

    Actually, all modern versions of charmm use mpi.  The scaling behavior
    is more or less well-known, though, and quite frustrating.  You can do
    a bit better if you use Josip Loncharic's TCP patches (8 processors is
    maybe 6x faster than 1 processor for MD using PME with ~10K atoms), but
    that's about it.

: * If we go with pentium II/III processors, how far is Myrinet likely to
:   permit us to scale these calculations?

    It should be better -- I know Bernie Brooks' lab is using gigabit
    ethernet to connect LoBoS (they're one of the primary sites for CHARMM
    development), but I haven't checked their benchmarks recently.

:
: * With Myrinet, would alpha nodes tend to scale any better than pentium nodes
-:?
:

    Probably, because of the better bus speeds, but I haven't seen data on
    this (maybe the paralogic guys can comment, if they've done benchmarks).


Alan Grossfield
-------------------------------------
| New email:  alan at dasher.wustl.edu |
| Update accordingly                |
-------------------------------------


From cgreer1 at midsouth.rr.com  Wed Jun  7 21:12:36 2000
From: cgreer1 at midsouth.rr.com (Chris Greer)
Date: Wed, 07 Jun 2000 23:12:36 -0500
Subject: managing user accounts without NIS
References: <Pine.GSO.3.96L.1000520121157.25277B-100000@unixs2.cis.pitt.edu>
Message-ID: <393F1D34.6870A997@midsouth.rr.com>

We are in the process of migrating away from NIS to an rsync based
system.  We've got some scripts to help manage a centralized password
system but each machine only gets the specific "political groups" of 
users that are assigned to it.  You change password via a web interface.
I know this has some people probably cringing, I was myself on the idea
for a while, but the web interface allows us to take things a step 
or two further.  We are working on scripts that will also integrate
into the Novell/NT side of our Lan so that we truly have a single
account system.   The PC side is still in the works, and obviously
if you are just reading this group for the beowulf aspects this
isn't important to you, but I deal not only with a beowulf type
setup from an admin perspective, but we also have 100+ UNIX servers
of varying flavors not including our 20 node cluster.  

Chris G.

Another option we used at a previous site was a smart script that would
gather the password files from all the nodes, figure out if you changed 
it on any of them, update the password map with the changed password, 
and then re-push out the new passowrd map to all of the servers.  It 
ran once an hour, so that changes weren't immediate, but were propagated
in a reasonable time.  Of course if you are using a beowulf for high end
computing, you probably don't want to interrupt things every hour just
to see if things changed and such.  

I haven't had experience with kerberos, but it might help you.  I don't
know if it can be used in place of the password authentication for user
accounts though.


Victor Ortega wrote:
> 
> I have looked at the archives searching for a good way to manage user
> accounts on a beowulf cluster.  Some people suggested using rsync, but
> my question is, how?  rsync is nothing more than an efficient version
> of rcp; it doesn't really "synchronize" files--by that I mean that as
> soon as (or soon after) one file gets modified, the other files get
> updated.  In particular, I want my users to be able to change their
> passwords or their login shells from any node and have the relevant
> files in /etc updated on all nodes, without the users having to do
> anything else on their part (like running some "update" script).  I
> would really rather not write setuid-root wrappers to passwd and chsh,
> as I don't want to inadvertently introduce a security hole to my
> system.  I have considered writing a PAM module, but I don't think
> this would cover the chsh case.  I also don't want to hack the kernel
> or the file system to manage user accounts.  Any suggestions?
> 
> Victor
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf


From cgreer1 at midsouth.rr.com  Wed Jun  7 21:17:39 2000
From: cgreer1 at midsouth.rr.com (Chris Greer)
Date: Wed, 07 Jun 2000 23:17:39 -0500
Subject: managing user accounts without NIS
References: <39278C80.9CBEF081@supercomputer.org>
Message-ID: <393F1E63.576C317B@midsouth.rr.com>

rsync -e ssh is the option we use.  It's rsync over ssh.

dwight wrote:
> 
> Victor Ortega wrote:
> 
> > NIS and NFS are insecure and incur performance penalties.  I'm looking
> > for better alternatives.  My idea of setuid-root wrappers (using rsync
> > for distribution of relevant files) already provides a more secure,
> > high-performance, high-availability alternative; I just want to make
> > sure that there isn't something better out there already, and that I'm
> > not overlooking some potential security hole.
> 
> Just using rsync per se might well subject you to a man-in-the-middle
> attack, or a spoofing attack. ssh/scp would be a better tool.
> 
> Or just set up Kerberos and simply use it for authentication.
> 
> Best Regards,
> 
>     -dwight-
> 
> ---------------------------------------------------------------------------
> The Beowulf Mailing list archives can now be searched by visiting:
>         http://www.supercomputer.org/Search/
> The Calendar of Events in supercomputering can be found at:
>         http://www.supercomputer.org/calendar/
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf


From covenant at dirac.org  Wed Jun  7 23:16:57 2000
From: covenant at dirac.org (Peter Jay Salzman)
Date: Wed, 7 Jun 2000 23:16:57 -0700 (PDT)
Subject: managing user accounts without NIS
In-Reply-To: <393F1D34.6870A997@midsouth.rr.com>
Message-ID: <Pine.LNX.4.10.10006072315200.7902-100000@dirac.org>

chris,

i'm about to configure NIS on our cluster.   i'd be very interested in
hearing why your group is moving away from NIS.   we have a very
homogeneous 40 node cluster which is pretty secure at the moment.

before continuing with the NIS howto, i'd love to hear your comments.  :)

pete

> Date: Wed, 07 Jun 2000 23:12:36 -0500
> From: Chris Greer <cgreer1 at midsouth.rr.com>
> To: Victor Ortega <vor+ at pitt.edu>
> Cc: Beowulf mailing list <beowulf at beowulf.org>
> Subject: Re: managing user accounts without NIS
> 
> We are in the process of migrating away from NIS to an rsync based
> system.  We've got some scripts to help manage a centralized password
> system but each machine only gets the specific "political groups" of 
> users that are assigned to it.  You change password via a web interface.
> I know this has some people probably cringing, I was myself on the idea
> for a while, but the web interface allows us to take things a step 
> or two further.  We are working on scripts that will also integrate
> into the Novell/NT side of our Lan so that we truly have a single
> account system.   The PC side is still in the works, and obviously
> if you are just reading this group for the beowulf aspects this
> isn't important to you, but I deal not only with a beowulf type
> setup from an admin perspective, but we also have 100+ UNIX servers
> of varying flavors not including our 20 node cluster.  
> 
> Chris G.
> 
> Another option we used at a previous site was a smart script that would
> gather the password files from all the nodes, figure out if you changed 
> it on any of them, update the password map with the changed password, 
> and then re-push out the new passowrd map to all of the servers.  It 
> ran once an hour, so that changes weren't immediate, but were propagated
> in a reasonable time.  Of course if you are using a beowulf for high end
> computing, you probably don't want to interrupt things every hour just
> to see if things changed and such.  
> 
> I haven't had experience with kerberos, but it might help you.  I don't
> know if it can be used in place of the password authentication for user
> accounts though.
> 
> 
> Victor Ortega wrote:
> > 
> > I have looked at the archives searching for a good way to manage user
> > accounts on a beowulf cluster.  Some people suggested using rsync, but
> > my question is, how?  rsync is nothing more than an efficient version
> > of rcp; it doesn't really "synchronize" files--by that I mean that as
> > soon as (or soon after) one file gets modified, the other files get
> > updated.  In particular, I want my users to be able to change their
> > passwords or their login shells from any node and have the relevant
> > files in /etc updated on all nodes, without the users having to do
> > anything else on their part (like running some "update" script).  I
> > would really rather not write setuid-root wrappers to passwd and chsh,
> > as I don't want to inadvertently introduce a security hole to my
> > system.  I have considered writing a PAM module, but I don't think
> > this would cover the chsh case.  I also don't want to hack the kernel
> > or the file system to manage user accounts.  Any suggestions?
> > 
> > Victor


From brua at paralline.com  Thu Jun  8 04:29:50 2000
From: brua at paralline.com (Pierre Brua)
Date: Thu, 08 Jun 2000 13:29:50 +0200
Subject: [Announce] The jobd Load Balancer
References: <20000606233916.E770@ostenfeld.dk>
Message-ID: <393F83AE.D23B9255@paralline.com>

Jakob ?stergaard wrote:
> It is available at  http://ostenfeld.dk/~jakob/jobd/

Won't work. The good one is http://www.ostenfeld.dk/~jakob/jobd/

> So if anyone besides me is sick and tired of waiting for those half-hour C++
> compilations, here's a chance to justify a Beowulf for your boss   ;)

There may be a naming problem with the jobd at
http://bond.imm.dtu.dk/jobd/.
Maybe this other jobd already solve your problems and is more mature
than yours ?

Hope it helps,
-- 
Pierre Brua     PARALLINE Sarl      Parallelism & Linux Solutions
71,av. des Vosges Phone:+33 388 141 740 mailto:brua at paralline.com
F-67000 STRASBOURG  Fax:+33 388 141 741  http://www.paralline.com


From demeler at bioc09.v19.uthscsa.edu  Thu Jun  8 07:45:58 2000
From: demeler at bioc09.v19.uthscsa.edu (Borries Demeler)
Date: Thu, 8 Jun 2000 09:45:58 -0500 (CDT)
Subject: [Announce] The jobd Load Balancer
In-Reply-To: <20000606233916.E770@ostenfeld.dk> from "=?iso-8859-1?Q?Jakob_=D8stergaard?=" at Jun 06, 2000 11:39:16 PM
Message-ID: <200006081445.JAA25167@bioc09.v19.uthscsa.edu>

> I need GNU Make, so BSD pmake is out.  Customs GNU Make doesn't work properly
> and isn't maintained. PVM GNU Make may work, but didn't for me, I also have the
> feeling that this is too much of a hack to be relied upon.  GNU Queue was
> close, except that it breaks under load.  Generic NQS was too big (too slow for
> short jobs, to complex).
> 
> Fixing GNU Queue wasn't an option for me, with all due respect that is by far
> the ugliest code I've seen in a long time.
> 

I haven't tried it, but maybe someone else has and can comment: Doesn't
Mosix allow for automatic process migration such that you could invoke
a compilation with make -j <number_of_nodes> and have it compile in parallel?

In any case, I would like to know if this is a feasable route for parallel
compilation. Has anybody tried this and how would it compare in speed to
something like jobd?

Thanks for all responses!

-Borries


From josip at icase.edu  Thu Jun  8 14:58:45 2000
From: josip at icase.edu (Josip Loncaric)
Date: Thu, 08 Jun 2000 17:58:45 -0400
Subject: TCP patch for Red Hat 6.2 kernel 2.2.14-12
Message-ID: <39401715.904DF2D@icase.edu>

Hello,

my TCP patch is now available for Red Hat 6.2 kernel 2.2.14-12:

  http://www.icase.edu./~josip/tcp-patch-for-2.2.14-12

The web page http://www.icase.edu/coral/LinuxTCP2.html explains what
this patch does.

Patches for older Red Hat 6.2 kernels are:

  http://www.icase.edu./~josip/tcp-patch-for-2.2.14-6.0.1
  http://www.icase.edu./~josip/tcp-patch-for-2.2.14-5.0

Patches for Linux kernels 2.2.12 and 2.2.13 are:

  http://www.icase.edu./~josip/tcp-patch-for-2.2.13
  http://www.icase.edu./~josip/tcp-patch-for-2.2.12

Please verify the md5 checksum after downloading these files to make
certain they did not get corrupted in trasit.  Here are the correct md5
checksums:

3fc16704ac99651a18e47b7a3eccc675 *tcp-patch-for-2.2.12
f72305c7800552b2449d8288bc63b975 *tcp-patch-for-2.2.13
4841c4c21a3bc10e5fa5d04cfd6288ac *tcp-patch-for-2.2.14-12
4a2d599a5b07676808fe3c6e1769efea *tcp-patch-for-2.2.14-5.0
4138e1c13fd6c3895e56ac6b97773e40 *tcp-patch-for-2.2.14-6.0.1

To apply the patch, download the patch file to /tmp then do the
following:

(1) create a new kernel source tree from original kernel files,
    e.g. cp -a /usr/src/linux-2.2.14 /usr/src/linux-2.2.14-12tcp

(2) cd /usr/src/linux-2.2.14-12tcp
    patch -p1 </tmp/tcp-patch-for-2.2.14-12

(3) verify that the patch was applied correctly, then replace the link
/usr/src/linux with
    ln -s /usr/src/linux-2.2.14-12tcp /usr/src/linux

(4) configure, build and install the new kernel

(5) update network card driver modules as needed

(6) add the following to your /etc/rc.d/rc.local script:

#
# Based on Fast Ethernet tests with tulip.c:v0.92 driver,
# the recommended strategy is '3' with faster timeouts
#
if [ -f /proc/sys/net/ipv4/tcp_delack_strategy ]; then
    echo 3 >/proc/sys/net/ipv4/tcp_delack_strategy
fi
if [ -f /proc/sys/net/ipv4/tcp_faster_timeouts ]; then
    echo 1 >/proc/sys/net/ipv4/tcp_faster_timeouts
fi
#
# Some generally useful network features
#
if [ -f /proc/sys/net/core/netdev_max_backlog ]; then
    echo 1000 >/proc/sys/net/core/netdev_max_backlog
fi
if [ -f /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts ]; then
    echo 1 >/proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
fi

(7) Re-check your work, run /sbin/lilo, then reboot with the new kernel

After reboot, files /proc/sys/net/ipv4/tcp_delack_strategy and
/proc/sys/net/ipv4/tcp_faster_timeouts should exist and have the values
you specified in your rc.local script.  All TCP sockets which turn on
the TCP_NODELAY socket option (e.g. MPI sockets) will activate the
patch, while all other connections should remain unaffected.  BTW,
tcp_delack_strategy=10 and tcp_faster_timeouts=0 turn off the patch
completely.  These are the defaults after boot, so the patch will not be
active unless these values are changed.

Sincerely,

Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From jsquyres at lsc.nd.edu  Thu Jun  8 15:12:57 2000
From: jsquyres at lsc.nd.edu (Jeff Squyres)
Date: Thu, 8 Jun 2000 18:12:57 -0400 (EDT)
Subject: Problems with MPICH 1.2 and Beowulf/Linux
In-Reply-To: <200006061355.PAA26685@pcmenelao.mi.infn.it>
Message-ID: <Pine.LNX.4.21.0006081811280.6900-100000@tigger.lsc.nd.edu>

On Tue, 6 Jun 2000, Francesco Marini wrote:

>     Second : I've got some prob compiling ScaLapack with LAM-MPI, gcc
> and pgf77 (f77 compiler from Portland Group), it gives a lot of
> unresolved symbols regarding MPI. Anyone succeded in compiling them
> under same configuration ?

I'm afraid that I can't help you with your MPICH problem, but for
instructions for compiling ScaLAPACK with LAM, see:

	http://www.mpi.nd.edu/lam/3rd-party/scalapack.php3

{+} Jeff Squyres
{+} squyres at cse.nd.edu
{+} Perpetual Obsessive Notre Dame Student Craving Utter Madness
{+} "I came to ND for 4 years and ended up staying for a decade"


From alex at santafe.edu  Thu Jun  8 17:56:16 2000
From: alex at santafe.edu (Alex Lancaster)
Date: 08 Jun 2000 18:56:16 -0600
Subject: Please help me unsubscribe
References: <Pine.LNX.4.10.10006071117500.28357-100000@ganesh.phy.duke.edu>
Message-ID: <cfd7lroeof.fsf@lucero.santafe.edu>

>>>>> "RB" == Robert G Brown <rgb at phy.duke.edu> writes:

[...]

RB> I believe that we are VERY close to having the beowulf list
RB> managed by mailman, which will be a very Good Thing (tm).  If/when
RB> this finally occurs, you can subscribe and unsubscribe and
RB> generally control the flow of list traffic directly from a
RB> password protected web interface.  This is a very desirable
RB> thing...

Yep, I agree it is a very desirable thing for most folks.  *Provided*
one thing: that you can still [un]subscribe via majordomo if you so
desire.  I'm loathe to start up a web browser just to do mailing list
management.  I know most people have trouble with majordomo, but when
you've been using it for as long as I have, you get used to its
quirks, and at least I can manage my mailing lists using `gnus' inside
emacs slogged-in to a terminal over a modem line without having to
fire up lynx or netscape...

Here's hoping majordomo doesn't go away completely...  My $0.02.

A.
-- 
Alex Lancaster * alex at santafe.edu * www.santafe.edu/~alex * 505 984-8800 x242
Santa Fe Institute (www.santafe.edu) & Swarm Development Group (www.swarm.org)


From deadline at plogic.com  Fri Jun  9 04:30:39 2000
From: deadline at plogic.com (Douglas Eadline)
Date: Fri, 9 Jun 2000 07:30:39 -0400 (EDT)
Subject: Please help me unsubscribe
In-Reply-To: <cfd7lroeof.fsf@lucero.santafe.edu>
Message-ID: <Pine.LNX.4.10.10006090729390.17094-100000@lisa.plogic.com>

FYI:

http://www.crn.com/dailies/digest/breakingnews.asp?ArticleID=17350

Doug
-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------


From Gianluca.Cecchi at Italy.ACNielsen.com  Fri Jun  9 04:59:20 2000
From: Gianluca.Cecchi at Italy.ACNielsen.com (Cecchi, Gianluca)
Date: Fri, 9 Jun 2000 12:59:20 +0100 
Subject: Please help me unsubscribe
Message-ID: <67288CD5E8C0D211B1B30001FAD4F04AB31274@ACN039MILMSX01>

FYI too:

http://www.hptechcomp.com/index.asp?sessionid=560396771316524984562&navi=5&a
rnr=0600_042_linux

Gianluca Cecchi


-----Original Message-----
From: Douglas Eadline [mailto:deadline at plogic.com]
Sent: venerd? 9 giugno 2000 13:31
To: beowulf at beowulf.org
Subject: Re: Please help me unsubscribe


FYI:

http://www.crn.com/dailies/digest/breakingnews.asp?ArticleID=17350

Doug
-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------


_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Fri Jun  9 05:19:21 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 9 Jun 2000 08:19:21 -0400 (EDT)
Subject: Please help me unsubscribe
In-Reply-To: <Pine.LNX.4.10.10006090729390.17094-100000@lisa.plogic.com>
Message-ID: <Pine.LNX.4.10.10006090807440.1259-100000@ganesh.phy.duke.edu>

On Fri, 9 Jun 2000, Douglas Eadline wrote:

> 
> FYI:
> 
> http://www.crn.com/dailies/digest/breakingnews.asp?ArticleID=17350

    Wolff said he questions whether security issues will be totally
    solved as Linux scales upward. "I'm always afraid to jump on any fad
    until I see where it plays," he said. "Linux is a great platform for
    some things, but IBM jumping in will help."

<chortle> 
I would have written this as "IBM was once a great company in a lot of
ways, and adopting linux across the board will help."
</chortle>

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From Jose_Maria_Gonzalez at dell.com  Fri Jun  9 07:04:46 2000
From: Jose_Maria_Gonzalez at dell.com (Jose_Maria_Gonzalez at dell.com)
Date: Fri, 9 Jun 2000 09:04:46 -0500 
Subject: network performance tool
Message-ID: <06E1DE556A23D411825A0090273BF1C82AC49E@LIMXMMF204>

Hi there,

My sincere apologise for this silly question, but does anybody know any
either tool,program,script, or native tool in RedHat to measure the real
network traffic on my network. I have set up a COW (8 nodes) and I am using
MPICH 1.1.2 to run a parallel program. When I run above program the network
traffic is very high so I just wonder if it could be a bottleneck.

I have used netperf and the latest version of SCMS which displays some
network device information too, but it does not really help me much.

Any ideas?
I would really appreciate any information.

Thank you very much.
Sincerely Yours,
Jose

_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/

Jose Maria Gonzalez Martin,
System Engineer,
ASC Lab, Dell Computer Corporation, Castletroy, Limerick, Ireland.
> & 353 61 502100
Jose_Maria_Gonzalez at dell.com
http://www.dell.com/asc
 <<...>> 
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/


From wsb at paralleldata.com  Fri Jun  9 10:44:49 2000
From: wsb at paralleldata.com (W Bauske)
Date: Fri, 09 Jun 2000 12:44:49 -0500
Subject: Please help me unsubscribe
References: <Pine.LNX.4.10.10006090807440.1259-100000@ganesh.phy.duke.edu>
Message-ID: <39412D11.B5419830@paralleldata.com>

"Robert G. Brown" wrote:
> 
> 
> <chortle>
> I would have written this as "IBM was once a great company in a lot of
> ways, and adopting linux across the board will help."
> </chortle>
> 

I guess the fact businesses spend $60-70 billion on them each year makes
them a has been and Linux will add huge amounts more revenue to their
pitiful bottom line. Get real...


Wes Bauske


From rgb at phy.duke.edu  Fri Jun  9 11:02:32 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 9 Jun 2000 14:02:32 -0400 (EDT)
Subject: network performance tool
In-Reply-To: <06E1DE556A23D411825A0090273BF1C82AC49E@LIMXMMF204>
Message-ID: <Pine.LNX.4.10.10006091321230.2415-100000@ganesh.phy.duke.edu>

On Fri, 9 Jun 2000 Jose_Maria_Gonzalez at dell.com wrote:

> Hi there,
> 
> My sincere apologise for this silly question, but does anybody know any
> either tool,program,script, or native tool in RedHat to measure the real
> network traffic on my network. I have set up a COW (8 nodes) and I am using
> MPICH 1.1.2 to run a parallel program. When I run above program the network
> traffic is very high so I just wonder if it could be a bottleneck.
> 
> I have used netperf and the latest version of SCMS which displays some
> network device information too, but it does not really help me much.

I'm not sure whether you are asking for tools like netperf or netpipe
for measuring your network's capacity or tools for monitoring the real
network traffic while you're program is running.  Netperf (or netpipe)
is about as good as it gets for measuring raw performance -- set up a
socket connection and measure what you can jam through it is basically
what either one do (with various flags controlling this and that).

I believe that both MPICH and PVM come with some examples and tools for
measuring effective network performance, but I'm not certain if those
tools (at least as described in the postscript guide) made it into the
Red Hat powertools mpich RPM.  One could presumably retrieve the sources
(which is all that you need) from a full MPICH tarball easily enough.
PVM's examples do include timing programs with the RH installation (on
the main Red Hat CD these days).

If you're looking for tools to measure the packet flow during dynamical
operation, there are BOTH tracing tools (upshot/nupshot for MPICH and
more, xpvm for PVM), some commercial tools (check the MPI/PVM websites)
and a variety of tools for monitoring raw network loads on independent
computers (e.g. procmeters of various sorts).  There are remarkably few
and poor load meters that come with RH, for whatever reason, even
including the powertools cd.  You can always try shopping on rufus 

  http://rufus.w3.org/linux/RPM/

which is what I do when I want a program for some task.  This server has
100+ GB of RPM's in a massive, cross-referenced database.  You should
likely shop the beowulf underground site as well, as they may have other
tools that have been registered that are more beowulf specific.

Finally, there is procstatd, which comes with a simple perl-tk tool and
has a template web interface in either of its source packages (it's hard
to package the web interface because webserver setups vary so widely).
Eventually I hope to add a few other interfaces (or hope that somebody
else does for me, being lazy).  procstatd has a simple interface you can
use to build your own GUI or tty tool(s) from any scripting or
programming language.  As it is currently set up, it lets you monitor up
to four ethernet interfaces (virtually every aspect of the interface as
recorded in /proc) on a whole network of hosts simultaneously.  The
provided perlTk tool (watchman) is adequate for monitoring 8-32 hosts at
once (depending on the resolution of your display) and of course you can
have multiple instances of watchman running to do more.  If I ever have
time I'll build a scrolling display, probably abandoning perlTk for Gtk
and C.  You can find the current rpm, a source rpm, and a ready-to-make
tarball, on

  http://www.phy.duke.edu/brahma

(look for the procstatd links, which point to symlinks to the current
release).  You will need to get and install a perlTk rpm on top of your
existing perl -- one that should work is provided on the brahma page but
isn't guaranteed to be particularly current.

  Hope this helps.  If it doesn't, be a bit more specific about what you
are looking for.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Fri Jun  9 11:39:36 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 9 Jun 2000 14:39:36 -0400 (EDT)
Subject: Please help me unsubscribe
In-Reply-To: <39412D11.B5419830@paralleldata.com>
Message-ID: <Pine.LNX.4.10.10006091419100.2415-100000@ganesh.phy.duke.edu>

On Fri, 9 Jun 2000, W Bauske wrote:

> "Robert G. Brown" wrote:
> > 
> > 
> > <chortle>
> > I would have written this as "IBM was once a great company in a lot of
> > ways, and adopting linux across the board will help."
> > </chortle>
> > 
> 
> I guess the fact businesses spend $60-70 billion on them each year makes
> them a has been and Linux will add huge amounts more revenue to their
> pitiful bottom line. Get real...

Whoa, I was just kidding, in a wry sort of way.  That'll teach ME to be
terse;-)

To explain my remark further, I really do think that Linux is an
essential part of IBM's strategy for maintaining those huge revenues.
OS/2 tanked after they were basically betrayed by Microsoft, and it is
difficult and expensive for IBM to maintain their own "private" systems
group with non-mainstream operating systems that are not compatible or
portable across all their various platforms, however lucrative the fish
they have shot in these particular barrels have been in the past.  Ask
DEC, Honeywell, etc (long list) just how long a multibillion dollar
mostly-hardware company lasts when their software is too nonstandard or
their price point too non-competitive.

I love IBM.  I bought IBM stock back when I was nine or ten.  Learned to
type on an IBM Selectric typewriter.  Learned to program on IBM
mainframes with IBM fortran IV and HASP on IBM card punches and IBM card
readers, programmed mastermind in APL on an IBM 5100 (still have to
program on an archaic tape somewhere), owned a 64K motherboard IBM PC
(and am still kicking myself for donating the aged husk of a chassis to
a school as it would have been a kick to refill it with modern
motherboards and use it as a desktop).  I think of them as the huge,
immensely rich and powerful, multinational monopoly with a heart (just
kidding again!).  Seriously, IBM has succeded in reinventing themselves
a number of times where their competitors have failed and fallen by
various waysides.

I'm very pleased that they've overwhelmingly adopted linux and that the
adoption appears to be migrating quite agressively from their small
computer and netfinity business into their other small mainframe and
supercomputing operations. I live not far from their Research Triangle
operation, which used to be home to OS/2 development -- folks out there
are often militant about linux, these days (as is a lot of their
netfinity group).  

On the other hand, IBM does nothing except in the hope of making money
(while providing good services, of course) and aren't moving to linux
out of dreams of revenge on Microsoft or because their other OS's aren't
currently profitable.  They're betting on the horse they think will win
the race and preserve those lovely revenues, while (I'm sure)
anticipating that in the long run they can eventually save a lot of
moneyby NOT having a multiple competing incompatible mainline software
operations.  IBM's hardware has always been excellent, but their
software over the years has not infrequently left something to be
desired and some of it would never have sold at all if their hardware
customers had had a choice.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From wsb at paralleldata.com  Fri Jun  9 13:20:55 2000
From: wsb at paralleldata.com (W Bauske)
Date: Fri, 09 Jun 2000 15:20:55 -0500
Subject: Please help me unsubscribe
References: <s941042a.065@mail2.allegro.net>
Message-ID: <394151A7.32FD3273@paralleldata.com>

Erik Mullinix wrote:
> 
> I read this post for educated information.
> The real your looking for is that indeed IBM has the sense to adopt what is becoming comonplace before it looses it's edge.  Thier AS400 arena has realy helped out in the data storage and processing arena.. However even they admit the mini computers are a bit pricy.. and the Beowulf is a great arena to step into especialy bringing thier experiance with parallel platform design into the mix.
> 

My point was IBM is quite healthy as is. Linux is an interesting
horse and I don't discount it doing well. I use Linux on Alphas
myself and I see no reason not to have IBM get linux working on
mainframes or whatever they like. I seriously doubt many people knew
that an IBM mainframe could run vast numbers of Linux images. They 
have some really good technology and many good qualtities that 
businesses appreciate and vote on using their pocketbooks.

As to pricing, you just have to know how to get the deal done.

Enough about IBM, now back to beowulfs please.


Wes

> Erik Mullinix
> 
> >>> "W Bauske" <wsb at paralleldata.com> 06/09/00 01:44PM >>>
> "Robert G. Brown" wrote:
> >
> >
> > <chortle>
> > I would have written this as "IBM was once a great company in a lot of
> > ways, and adopting linux across the board will help."
> > </chortle>
> >
> 
> I guess the fact businesses spend $60-70 billion on them each year makes
> them a has been and Linux will add huge amounts more revenue to their
> pitiful bottom line. Get real...
> 
> Wes Bauske
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
>   --------------------------------------------------------------------------------
> 
>    TEXT.htmName: TEXT.htm
>            Type: Plain Text (text/plain)


From david.lombard at mscsoftware.com  Fri Jun  9 13:47:20 2000
From: david.lombard at mscsoftware.com (David Lombard)
Date: Fri, 09 Jun 2000 13:47:20 -0700
Subject: Please help me unsubscribe
References: <Pine.LNX.4.10.10006091419100.2415-100000@ganesh.phy.duke.edu>
Message-ID: <394157D8.C00B2EAF@mscsoftware.com>

"Robert G. Brown" wrote:
> 
> On the other hand, IBM does nothing except in the hope of making money
> (while providing good services, of course) ...

I'm thinking this is a common motivation among *all* non-profit orgs...
;^)

-- 
David N. Lombard
MSC.Software


From wsb at paralleldata.com  Fri Jun  9 13:55:31 2000
From: wsb at paralleldata.com (W Bauske)
Date: Fri, 09 Jun 2000 15:55:31 -0500
Subject: IBM (was Re: Please help me unsubscribe)
References: <Pine.LNX.4.10.10006091419100.2415-100000@ganesh.phy.duke.edu>
Message-ID: <394159C3.E9A61FEA@paralleldata.com>

"Robert G. Brown" wrote:
> 
> On Fri, 9 Jun 2000, W Bauske wrote:
> 
> > "Robert G. Brown" wrote:
> > >
> > >
> > > <chortle>
> > > I would have written this as "IBM was once a great company in a lot of
> > > ways, and adopting linux across the board will help."
> > > </chortle>
> > >
> >
> > I guess the fact businesses spend $60-70 billion on them each year makes
> > them a has been and Linux will add huge amounts more revenue to their
> > pitiful bottom line. Get real...
> 
> Whoa, I was just kidding, in a wry sort of way.  That'll teach ME to be
> terse;-)
> 

OK. Just pointing out how your comment sounds to those folks
who look at computing from a business perspective.

> To explain my remark further, I really do think that Linux is an
> essential part of IBM's strategy for maintaining those huge revenues.
> OS/2 tanked after they were basically betrayed by Microsoft, and it is
> difficult and expensive for IBM to maintain their own "private" systems
> group with non-mainstream operating systems that are not compatible or
> portable across all their various platforms, however lucrative the fish
> they have shot in these particular barrels have been in the past.  Ask
> DEC, Honeywell, etc (long list) just how long a multibillion dollar
> mostly-hardware company lasts when their software is too nonstandard or
> their price point too non-competitive.
> 

See my other comments. Pricing is fine for people who
have a business case and know how to work with IBM.

> 
> On the other hand, IBM does nothing except in the hope of making money
> (while providing good services, of course) and aren't moving to linux
> out of dreams of revenge on Microsoft or because their other OS's aren't
> currently profitable.  They're betting on the horse they think will win
> the race and preserve those lovely revenues, while (I'm sure)
> anticipating that in the long run they can eventually save a lot of
> moneyby NOT having a multiple competing incompatible mainline software
> operations.  IBM's hardware has always been excellent, but their
> software over the years has not infrequently left something to be
> desired and some of it would never have sold at all if their hardware
> customers had had a choice.
> 

Exactly what is wrong with making a profit? IBM interest in Linux is 
to first, provide customers with another option for using their systems 
and second, to perhaps persuade companies outgrowing Linux based small 
servers to look at IBM equipment as the next step. Most business types 
are more comfortable when you say you're buying a solution with IBM in it.

As to their SW, I've also used much of IBM's SW and HW and while
at times they make life difficult for a programmer, particularly
on the mainframe side, I've found their AIX systems quite standards
conforming and have had little difficulty with them. Maybe I got
lucky or something but I've used AIX for 10 years or so. Pretty
much each UNIX flavor has it's own quirks and AIX or Linux are
no better or worse than anything else I've seen. In fact, Linux/Gnu
is not a single system. Each packager makes things slightly different
so Linux/Gnu is itself, quirky for each comapny that releases a
version of it.

Back to Beowulf things now.


Wes Bauske


From per at computer.org  Fri Jun  9 15:54:55 2000
From: per at computer.org (Per Jessen)
Date: Fri, 09 Jun 2000 23:54:55 +0100
Subject: Please help me unsubscribe
Message-ID: <200006092247.WAA00797@venus.netherwood.road>

On Fri, 9 Jun 2000 14:39:36 -0400 (EDT), Robert G. Brown wrote:

[snip]
>operations.  IBM's hardware has always been excellent, but their
>software over the years has not infrequently left something to be
>desired and some of it would never have sold at all if their hardware
>customers had had a choice.

Robert,

perhaps you have not (as yours truly) built a career first running
IBM software, then supporting it, then debugging it, then ending up
writing software for it. I used to say "if it's blue, I can run it".

Let me tell you, as an 'insider' on IBM (mainframe) software - I have
few complaints. Granted, IBM makes mistakes and silly and incomprehensible 
decisions. But as you have already pointed out, their hardware kicks
ass (pardon my language), and frankly so does their software. Does
anyone know of a better transaction processor than TPF ? Or CICS/MVS ? 
I dont think so. MVS or OS/390 itself is rock-solid - a customer
having to take a stand-alone dump - we are talking a rare occasion.

I think we're going slightly off-topic here, so ...


regards,
Per Jessen

PS. The above is ONLY related to software running on the S/390
platform. I really have no idea about the AS400, RS6000 etc. :-)


From siegert at sfu.ca  Fri Jun  9 18:48:32 2000
From: siegert at sfu.ca (Martin Siegert)
Date: Fri, 9 Jun 2000 18:48:32 -0700 (PDT)
Subject: updating the Linux kernel (was: Please help me unsubscribe)
In-Reply-To: <200006092247.WAA00797@venus.netherwood.road> from "Per Jessen" at Jun 09, 2000 11:54:55 PM
Message-ID: <200006100148.SAA12877@fraser.sfu.ca>

After all that talk about the quality of IBM products I'd like to get
back to more beowulf like stuff:

Yesterday a bug was found (or made public on bugtraq) in the Linux
kernel (in all 2.2 versions up to and including 2.2.15) that allows
local users to gain root.

highly recommended remedy: upgrade to 2.2.16

My question now is how do you handle such an issue?
Our beowulf is fully loaded with jobs.
Some of these jobs run for about 30 days.
Upgrading the kernel means killing those jobs ... and gives you some
very unhappy users.
If the bug would allow a remote root exploit I wouldn't have a choice,
but to upgrade immediately.

In this situation:
(1) do you upgrade immediately?
(2) do you say "I trust my local users they won't do anything bad"
    and do nothing?
(3) do you wait until RedHat comes out with patches?
(4) something else (e.g., disable logins and upgrade node after node
    when no jobs are running on them anymore).

Cheers,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================


From gsl at linuxcolombia.com.co  Fri Jun  9 19:57:01 2000
From: gsl at linuxcolombia.com.co (gsl at linuxcolombia.com.co)
Date: Fri, 9 Jun 2000 21:57:01 -0500 (COT)
Subject: updating the Linux kernel (was: Please help me unsubscribe)
In-Reply-To: <200006100148.SAA12877@fraser.sfu.ca>
Message-ID: <Pine.LNX.4.21.0006092154310.3486-100000@krusty.linuxcolombia.com.co>

Guess that doing it node by node its better.. but what i think its better
its applyng the patch and renicing some jobs that eat the machine, or
replacing jobs from computer to another one up and free from load...

thats what i think, maybe im wrong..

GERARDO


On Fri, 9 Jun 2000, Martin Siegert wrote:

> After all that talk about the quality of IBM products I'd like to get
> back to more beowulf like stuff:
> 
> Yesterday a bug was found (or made public on bugtraq) in the Linux
> kernel (in all 2.2 versions up to and including 2.2.15) that allows
> local users to gain root.
> 
> highly recommended remedy: upgrade to 2.2.16
> 
> My question now is how do you handle such an issue?
> Our beowulf is fully loaded with jobs.
> Some of these jobs run for about 30 days.
> Upgrading the kernel means killing those jobs ... and gives you some
> very unhappy users.
> If the bug would allow a remote root exploit I wouldn't have a choice,
> but to upgrade immediately.
> 
> In this situation:
> (1) do you upgrade immediately?
> (2) do you say "I trust my local users they won't do anything bad"
>     and do nothing?
> (3) do you wait until RedHat comes out with patches?
> (4) something else (e.g., disable logins and upgrade node after node
>     when no jobs are running on them anymore).
> 
> Cheers,
> Martin
> 
> ========================================================================
> Martin Siegert
> Academic Computing Services                        phone: (604) 291-4691
> Simon Fraser University                            fax:   (604) 291-4242
> Burnaby, British Columbia                          email: siegert at sfu.ca
> Canada  V5A 1S6
> ========================================================================
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From dunna001 at bama.ua.edu  Fri Jun  9 19:45:50 2000
From: dunna001 at bama.ua.edu (Crutcher Dunnavant)
Date: Fri, 9 Jun 2000 21:45:50 -0500
Subject: updating the Linux kernel
Message-ID: <200006100248.VAA14337@bama.ua.edu>

Now, I might completly miss something here, but shouldn't all *distibuted* 
parallel programs assume that a node may not return. After all, what do you 
assume about hardware failures? So, while it may not be a *good* way to do it, 
In a properlly paralized application, shouldn't you be able to take down any 
random node other than the job allocation node, AT ANY TIME, and have that job 
reallocated (Yeah, you lose the local work, but those tasks should be 
checkpointed frequently). I just don't think that you should EVER be able to 
lose more than 5-10 minutes worth of work on a given node, and if you can, you 
should re-examine your program design. So just kill the boxes, and update 
them, one at a time.

-Crutcher Dunnavant
"Elegant, Documented, On Time; Choose 2"
Email: dunna001 at bama.ua.edu
Resume: http://resumes.dice.com/crutcher
Home:(256)-232-7883


From rauch at inf.ethz.ch  Sat Jun 10 05:47:46 2000
From: rauch at inf.ethz.ch (Felix Rauch)
Date: Sat, 10 Jun 2000 14:47:46 +0200 (CEST)
Subject: Myrinet LANai7 with Linux 2.3.99?
Message-ID: <Pine.LNX.4.21.0006101443220.13170-100000@maloney.inf.ethz.ch>

I hope Myrinet questions are not too off topic for this list.

Is anybody on this list able to load self written MCPs to the new
LANai7 cards on intel-based PCs running Linux 2.3.x?

We just got new LANai7 Myrinet cards and would like to test them with
our own MCPs, but the Myrinet drivers (3.25) from Myricom seem to only
work with 2.2.x kernels. We already ported the old 3.22 drivers to the
new kernel, but before porting the newer 3.25 drivers again (which
contain important changes because of the new card layout), I wanted to
ask if somebody already did the job, or how people (if they do this at
all) load their own MCPs with newer kernels.

- Felix
-- 
Felix Rauch                      | Email: rauch at inf.ethz.ch
Institute for Computer Systems   | Homepage: http://www.cs.inf.ethz.ch/~rauch/
ETH Zentrum / RZ H18             | Phone: ++41 1 632 7489
CH - 8092 Zuerich / Switzerland  | Fax:   ++41 1 632 1307


From iorfr00 at student.vxu.se  Sat Jun 10 09:09:04 2000
From: iorfr00 at student.vxu.se (Nacho Ruiz)
Date: Sat, 10 Jun 2000 18:09:04 +0200
Subject: Some basic Questions...
Message-ID: <012301bfd2f6$33973ac0$396d2fc2@lyan.vxu.se>

Hello to everybody,

I'm building a Beowulf cluster with a bunch of oudated computers and I've
some questions.
The cluster consists in 8 nodes:
- 486 DX33
- 16 MB Ram
- 200 MB HD
- SMC 8013 Ethernet NIC (Coax connection)

And a server (that also acts as master node):
- Pentium 133
- 32 MB Ram
- 2x500 MB HDs & 200 MB HD
- 2 Ethernet NIC's: The SMC for the cluster and a 3c509B for the outside
world.

I have already managed to setup RedHat 6.2 to all machines, but now I'm
facing more Beowulf oriented questions...
Considering my system:

- It makes sense to have on each node PVM/MPI installed or is better to have
it via NFS?

- RSH o SSH for the cluster?
   Nobody has acces to the nodes machines, in theory. The only one that
could acces them is the root for administration reasons... and if somebody
get the root password, why care about the 486 without internet acces?

- I haven't setup yet the NIS, nor the MPI/PVM on the nodes. In order to
make them work should I install NIS?
   To acces the Beowulf you should acces the Master node and login, then you
should be able to use the cluster via MPI/PVM.

- Here we already have a SUNs COW, it's possible to use this COW & the
Beowulf?
   I mean, if I configure the PVM/MPI node file to include the COW nodes and
the Beowulf (master node). Can the Beowulf master node use the nodes
automaticaly without having to include the client nodes in the node file,
and make then "knowlegeable" to the world?
  In other words, van I treat the cluster as a whole instead of as a cluster
of client nodes?

- Any other hint or help?

Thank you to all of you.

Nacho Ruiz
The Retro-Beouwulf Master.


From glindahl at hpti.com  Sun Jun 11 11:47:13 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sun, 11 Jun 2000 14:47:13 -0400
Subject: updating the Linux kernel (was: Please help me unsubscribe)
In-Reply-To: <200006100148.SAA12877@fraser.sfu.ca>
Message-ID: <001901bfd3d5$755283e0$45a2fea9@hptilap.hpti.com>

> Yesterday a bug was found (or made public on bugtraq) in the Linux
> kernel (in all 2.2 versions up to and including 2.2.15) that allows
> local users to gain root.

It does not. It just allows buffer overflow attacks to be as likely to
succeed as other OSes. That was the most misleading CERT advisory I've ever
read.

-- greg


From glindahl at hpti.com  Sun Jun 11 11:49:25 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sun, 11 Jun 2000 14:49:25 -0400
Subject: Myrinet LANai7 with Linux 2.3.99?
In-Reply-To: <Pine.LNX.4.21.0006101443220.13170-100000@maloney.inf.ethz.ch>
Message-ID: <001a01bfd3d5$c3c314e0$45a2fea9@hptilap.hpti.com>

> I hope Myrinet questions are not too off topic for this list.

No, but there is a myrinet mailing list at OSC...

> We just got new LANai7 Myrinet cards and would like to test them with
> our own MCPs, but the Myrinet drivers (3.25) from Myricom seem to only
> work with 2.2.x kernels.

The "myriapi" drivers are very obselete. GM is the replacement. But I don't
know what people who write custom MCPs do these days to load. Since there
are only a few groups that do this, you may want to contact them directly:
BIP, RWCP, HPVM...

-- greg


From david.lombard at mscsoftware.com  Mon Jun 12 08:11:08 2000
From: david.lombard at mscsoftware.com (David Lombard)
Date: Mon, 12 Jun 2000 08:11:08 -0700
Subject: updating the Linux kernel
References: <200006100248.VAA14337@bama.ua.edu>
Message-ID: <3944FD8C.D9429CBF@mscsoftware.com>

Crutcher Dunnavant wrote:
> 
> Now, I might completly miss something here, but shouldn't all *distibuted*
> parallel programs assume that a node may not return. After all, what do you
> assume about hardware failures? ...

Um, no.

It all depends upon the software.  PVM does provide the ability to
recover from a node failure, while an MPI program will just tank.

> ... So, while it may not be a *good* way to do it,
> In a properlly paralized application, shouldn't you be able to take down any
> random node other than the job allocation node, AT ANY TIME, and have that job
> reallocated.> reallocated (Yeah, you lose the local work, but those tasks should be
> checkpointed frequently)...

As for checkpointing, that too is an "it depends" answer. 
Application-level checkpointing may be available to varying degrees --
it can be a non-trivial task.  System-level checkpointing generally
can't handle sockets, and that rules out both PVM and MPI.

-- 
David N. Lombard
MSC.Software


From K.Haigh-Hutchinson at Bradford.ac.uk  Mon Jun 12 08:35:15 2000
From: K.Haigh-Hutchinson at Bradford.ac.uk (Kathy Haigh Hutchinson)
Date: Mon, 12 Jun 2000 16:35:15 +0100 (BST)
Subject: updating the Linux kernel
In-Reply-To: <3944FD8C.D9429CBF@mscsoftware.com>
Message-ID: <Pine.GSO.3.96.1000612162522.20657B-100000@kite.cen.brad.ac.uk>

On Mon, 12 Jun 2000, David Lombard wrote:

> Crutcher Dunnavant wrote:
> > 
> > Now, I might completly miss something here, but shouldn't all *distibuted*
> > parallel programs assume that a node may not return. After all, what do you
> > assume about hardware failures? ...
> 

How can you tell the difference between 'never return' and 'take a very
long time to return.

Assuming you have done that, you could have the master periodically check
whether the whole machine was still there.


If a node has failed what do you do about it?

A message passing parallel program often does this by splitting an array
over several machines, each machine then processing part of the array.
When machines need data from their neighbours the appropriate segment of
data is passed between them.

The bulk of the data remains on the nodes on which it is originally
distributed. The intermediate results of computation remain on those nodes
until they are ready to report back to the master.

If a node goes down it is generally catastrophic, requiring a restart of
the program.

I do have a program which factorises large numbers, each node is
independant and it can cope with a failed node. 

But a finite difference time domain problem, if a node crashes, then all
the intermediate results of a segment of the problem space are lost. A
node failure requires restarting the entire problem.

How do I solve this?

Do I save all results at every program statement to a file?
Do I have the slaves send the results back to the master after every step
in the calculation?

Do I duplicate the entire program so that a copy of the data set can be
picked up from another node? Thus only effectively use half of the nodes?

The way parallelism is done, the way data is distributed, generally means
that when a node crashes its data is lost. There is then no way the rest
can carry on.

Task based parallelism could recover, my factorisation is essentially task
based. But parallelism is more often about data parallelism. Preserving
copies of the data every time something changes would be so much effort in
storage and in time to do the storing, that benefits of distributing the
data in the first place would be significantly reduced, or am I missing
something?

Kathy HH


From glindahl at hpti.com  Mon Jun 12 09:47:10 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Mon, 12 Jun 2000 12:47:10 -0400
Subject: updating the Linux kernel
In-Reply-To: <Pine.GSO.3.96.1000612162522.20657B-100000@kite.cen.brad.ac.uk>
Message-ID: <000001bfd48d$da745ec0$198bfea9@hptilap.hpti.com>

> Do I save all results at every program statement to a file?
> Do I have the slaves send the results back to the master after every step
> in the calculation?

There are 2 solutions.

The traditional one is to have the program save its state every Nth timestep
to disk. Since failures are rare, the fact that you lose some work and have
to wait for disk I/O is acceptable.

A more clever but expensive solution is to use only 1/2 of your RAM, and
save a copy of all the state to memory of a different node. You can do this
more frequently than saving to disk since you have a lot more network
bandwidth than disk bandwidth. The bonus is that you lose less work when you
do have a failure, and you never wait for (wasted) disk I/O to save the
state in the first place.

-- greg


From rgb at phy.duke.edu  Mon Jun 12 10:40:57 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 12 Jun 2000 13:40:57 -0400 (EDT)
Subject: updating the Linux kernel
In-Reply-To: <3944FD8C.D9429CBF@mscsoftware.com>
Message-ID: <Pine.LNX.4.10.10006121149400.8551-100000@ganesh.phy.duke.edu>

On Mon, 12 Jun 2000, David Lombard wrote:

> Crutcher Dunnavant wrote:
> > 
> > Now, I might completly miss something here, but shouldn't all *distibuted*
> > parallel programs assume that a node may not return. After all, what do you
> > assume about hardware failures? ...
> 
> Um, no.
> 
> It all depends upon the software.  PVM does provide the ability to
> recover from a node failure, while an MPI program will just tank.

> > ... So, while it may not be a *good* way to do it,
> > In a properlly paralized application, shouldn't you be able to take down any
> > random node other than the job allocation node, AT ANY TIME, and have that job
> > reallocated.> reallocated (Yeah, you lose the local work, but those tasks should be
> > checkpointed frequently)...
> 
> As for checkpointing, that too is an "it depends" answer. 
> Application-level checkpointing may be available to varying degrees --
> it can be a non-trivial task.  System-level checkpointing generally
> can't handle sockets, and that rules out both PVM and MPI.

Depends indeed, and on hardware as much (or even more) than the software
per se.  In many, if not most cases, checkpointing in particular (or any
other means of ensuring a fully robust capability of a parallel program
to recover from a node failure) can be mathematically shown to
>>increase<< on the average, the time to completion, possibly
considerably.  This is therefore a decision that should be left up to
the developer after considering their own particular task(s), parallel
environment, and associated scaling properties.

This is purely a statistical cost-benefit problem.  In many/most cases
the additional (expectation value of the) costs of developing and
running a fully node-robust parallel application will outweigh the
(expectation value of the) benefits and it should NOT be done;
checkpointing code (parallel or otherwise) is not something done either
lightly or as standard practice because "usually" the probability of a
failure that might be "rescued" by the checkpoint is small and the cost
of the checkpoint(s) large.  

In others, either nonlineariites in the perceived benefits (for example,
in a distributed database supporting 911 operations for a
police/fire/emergency group where the "cost" of failure might be a human
life) or a statistical analysis of the expected time of completion make
checkpointing desirable, or even essential if you with to complete the
calculation at all (have an expected time of completion much less than
infinity).

To do the decision analysis (checkpoint/don't checkpoint -- make code
node robust or not) correctly, one needs a set of parameters.  From
these parameters one can see roughly how to make the decision.  One
parameter -- the additional time required to develop a robust
application -- is semi-heuristic. Making an application robust is silly
if it takes you weeks to develop it and you plan to use it only one time
and it will complete in at most a few days even if you have to deal with
a few node failures.  

The others are (as a function of number of nodes N):

  a) The time required for the calculation to complete normally on N
nodes T_calc(N).  
  b) The time required to checkpoint the calculation (e.g.  store all
active memory onto a RAID server) T_save(N)
  c) The time interval selected between checkpoints T_check (which is a
parameter chosen optimally as a function of N, but which doesn't depend
directly on N).
  d) The probability density (probability per unit time) of node failure
p.  With luck one can assume that node failures are independent events
and that all nodes have the same probability density (presuming that
node design is homogeneous and that nodes have independent UPS and so
forth).

In a nutshell, if one massages these numbers (several of which are
themselves highly nonlinear functions of e.g. the number of nodes N and
other hardware design parameters) and guestimates that the probability
of overall node-based failure of a calculation is very small in the time
T_calc(N), the estimated time of completion is unlikely to be
significantly reduced by checkpointing and indeed may be very
significantly increased if the time cost of checkpointing itself,
T_save(N)*T_calc(N)/T_check, is at all comparable to T_calc(N).

If the overall probability of a node-based failure is "high" (above some
threshold clearly related to the additional time cost of checkpointing)
checkpointing is worth it if the additional time required to develop the
code isn't unreasonable compared to the amount of time one expects to
use the program.  In some cases (for example, on a cluster based on an
unstable and unnamed operating system that might have one node crash per
day in a thirty-node cluster, while the parallel calculation takes at
least a week to complete) code that is NOT checkpointed will simply
"never" complete.

I would therefore NOT make auto-migration, robustness under node failure
or checkpointing a "required feature of well-written parallel programs",
or even the much weaker "advised/nice/suggested feature".  Sometimes
this sort of bet-hedging is appropriate.  Sometimes it is essential.
Sometimes it would be idiotic.  Only the programmer/designer, with a
clear picture of the scaling properties and probabilities of failure for
their own application and cluster hardware, is in a position to
determine which category >>their<< particular application/cluster falls
into.

BTW, this sort of tradeoff occurs in a lot of other areas in
computational optimization design -- e.g. TCP (robust against packet
loss) vs UDP (not so robust against packet loss but somewhat faster).
There are also plenty of analogs in quality control in manufacturing and
so forth.  Statistical cost/benefit analysis rules, and good programmers
should do (and hence should be able to do) at least seat-of-the-pants
estimates as standard practice in code design. IMHO, anyway...;-)

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From david.lombard at mscsoftware.com  Mon Jun 12 10:56:18 2000
From: david.lombard at mscsoftware.com (David Lombard)
Date: Mon, 12 Jun 2000 10:56:18 -0700
Subject: updating the Linux kernel
References: <Pine.LNX.4.10.10006121149400.8551-100000@ganesh.phy.duke.edu>
Message-ID: <39452442.E88069D@mscsoftware.com>

"Robert G. Brown" wrote:
> 
[deletia]
> 
> I would therefore NOT make auto-migration, robustness under node failure
> or checkpointing a "required feature of well-written parallel programs",
> or even the much weaker "advised/nice/suggested feature".  Sometimes
> this sort of bet-hedging is appropriate.  Sometimes it is essential.
> Sometimes it would be idiotic.  Only the programmer/designer, with a
> clear picture of the scaling properties and probabilities of failure for
> their own application and cluster hardware, is in a position to
> determine which category >>their<< particular application/cluster falls
> into.
> 

AGREED!

-- 
David N. Lombard
MSC.Software


From j at cnb.uam.es  Mon Jun 12 11:23:01 2000
From: j at cnb.uam.es (j at cnb.uam.es)
Date: Mon, 12 Jun 2000 20:23:01 +0200 (DST)
Subject: updating the Linux kernel
In-Reply-To: <Pine.GSO.3.96.1000612162522.20657B-100000@kite.cen.brad.ac.uk>
Message-ID: <200006121823.e5CIN581153662@embnet.cnb.uam.es>

-----BEGIN PGP SIGNED MESSAGE-----

Kathy Haigh Hutchinson <K.Haigh-Hutchinson at Bradford.ac.uk> wrote:
> On Mon, 12 Jun 2000, David Lombard wrote:
> 
> Task based parallelism could recover, my factorisation is essentially task
> based. But parallelism is more often about data parallelism. Preserving
> copies of the data every time something changes would be so much effort in
> storage and in time to do the storing, that benefits of distributing the
> data in the first place would be significantly reduced, or am I missing
> something?
> 
> Kathy HH
> 

I guess the answer is "it depends".

It depends on whether you have a copy of the data somewhere else and
whether you may do without analyzing some data. Two examples:

If the data is kept at a redundant site, you can always restart the
job. An example might be a central repository that handles data to
workers and wait for answers, or a distributed data set split into
pieces replicated on different nodes. You might start with a configuration
like
	data = union of sets 1..n
	store set i at node i AND node[ (i+1) % n] <AND node...>
then you start your job allocating node[i] for set[i].
If node[i] fails, you can then lookup which other nodes have a
copy and restart its part at one of those (a suboptimal situation
since one node would do double work, but still feasible). Then
you would code it as:

    for i = 1 to n do {
*	node[i] = allocate_one_node();
*    	send_data(node[i]);
*   	fd_set(socket[i], myset);
    }
    set_timeout(long_enough);
    do {
	done = YEPP;
       select(n, myset, NULL, NULL, long_enough);
	for i = 1 to n do {
	    if ( ! FD_ISSET(i, myset)) { /* this node hasn't finished yet */
		if (! isalive(node[i]) { /* failed to answer, e.g. UDP echo */
*		    node[i] = allocate_one_node();
*		    send_data (node[i]);
*		    fd_set(socket[i], myset);
		    done = NOT_YET;
		}
		else {
		   /* Node is OK, assume it just takes longer */
		    done = NOT_YET;
		}
	    }	
	}
    }
    until (done == YEPP);

YMMV. You would gain much more efficiency if you splitted data into
n*n sets instead of n sets and replicated it such that if one node
fails, its start data would be replicated among all other nodes. Then
instead of only one node taking up the work of the deceased, all
nodes would take 1/n-th of the dead node's work, and load would be
split evenly. E.g. simplified:

make n  pieces
for i = 0 to n-1
	allocate piece i to  to node i
	divide piece i into n-1 sub-pieces
	for j = 0 to i-1
		allocate sub-piece j to node j
	for j = i to n
		allocate sub-piece j to node i+j

It may be more elegant, I know and more complex schemes may be devised.

Then, on failure of node i you know that its data is n*(i-1) to n*i and
that it is split on sub-pieces spread over all other nodes evenly, so, 
there you go:
	run job/iter on n nodes with n pieces
	wait for answers
	if one node dies, collect all n-1 answers
	start job again on n-1 nodes with n-1 subpieces 
	wait for sub-answers
	consolidate n-1 answers with n-1 sub-answers
	done.

And time would be 1/n in the best case, or 1/n+1/n^2 in the second.

Then for the next job/iter you may check if the node is back and 
repeat the scheme, or if it isn't and then increase work dataset 
of each remaining node with its sub-piece copy.


The second example is when you may as well try to survive with a partial
answer, e.g. you are looking for an answer and it *might* be close enough
or even already found, e.g. in a database search:

	start_subjobs();
	set_timeout(long_enough, my_sig_handler);
	wait_for_answers
	when an answer arrives {
		save_it_for_later
	}
	consolidate_all_answers
	show_user
...
	my_sig_handler() {
		check_all_nodes_are_alive
		if someone_failed {
			consolidate_answers_to_date
			show_user
			if (user_is_glad)
				OK
			else
				die_miserably
		}
	}

A signal handler might as well have been used in the first case, as an
asynchronous multiplexer would as well do on the second. It's up to you
to decide which suits you best. A signal handler may be combined with
a setjmp/longjmp jump if needed.

The point is that there is always a way out. 

				j


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBOUUqh7gsTQLvQjxFAQGVRAf/SrfS9NfCvGNMbHpiUXbITA6s5xOycmwx
Id9UpymyQm26D73bbF5loAOhmOQ/OBvhK3yfRahyD92Ubugdb7LSp1N6v28NRhBw
VdzKq8mPWFl4Wgbpumm2grzJmDZLTRXsRwuYiXq12DtH/Yf+SnlhTbI7/OxLBWxO
/J1wEzzPv+VHzeXhpBqZ86nb/97TZ118PPw+zgnvOxDt+SBNzbIrN/SWf4euZaxo
engcTE1RV5su8zSr3bbd/gbIsWpPn7tmuz6oVTB5nBU3ioKzdnVI4dyqh1SeZfQ6
yY0v0XOno0Tn5qMW6qJFNpOIxVK3fAZdlx7IU3P0fVUDYE6sy44Lfg==
=HuIX
-----END PGP SIGNATURE-----


From j at cnb.uam.es  Mon Jun 12 12:34:04 2000
From: j at cnb.uam.es (j at cnb.uam.es)
Date: Mon, 12 Jun 2000 21:34:04 +0200 (DST)
Subject: updating the Linux kernel
In-Reply-To: <200006121823.e5CIN581153662@embnet.cnb.uam.es>
Message-ID: <200006121934.e5CJY6X1123594@embnet.cnb.uam.es>

-----BEGIN PGP SIGNED MESSAGE-----

Sic, that'll teach me not to write messages when I'm so tired. Of course
in the last example the signal handler was incomplete, as I was thinking
of setjmp/longjmp's. It might work out like

<j at cnb.uam.es> wrote:
> 
> 	start_subjobs();

	  setjmp(start_waiting);	/* long explanation here please */

> 	set_timeout(now+long_enough, my_sig_handler);
> 	wait_for_answers
> 	when an answer arrives {
> 		save_it_for_later
		  answers_got++		/* note possible race condition here */

> 	}
> 	consolidate_all_answers
> 	show_user
> ...
> 	my_sig_handler() {
	     if (answers_got < n)	/* we may get caught after the
					answer but before the increment */

> 		check_all_nodes_are_alive
> 		if someone_failed {
> 			consolidate_answers_to_date
> 			show_user
> 			if (user_is_glad)
> 				OK
> 			else
> 				die_miserably
> 		}

		  else
			/* start another waiting loop */
			longjmp(start_waiting);

> 	}
> 

Resolving the grace condition gracefully is a nice treat, but I won't go
into more depth under my current mental conditions (too tired). It is an
easy task if the other logic is correctly coded.

The nice thing about such an scheme is that in principle it should be
independent of other underlying facilities, it's not that hard to code
and should work. OTOH it uses some arcane coding tricks like jmp's
largely forgotten and may be considered ugly as a GOTO.

Seeya,
				j


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBOUU7LLgsTQLvQjxFAQGH4gf/SgbLkxujsmlEQChhc5U+3xxtytEHhMPJ
xwT07jBiziaWvSoPtKS94smoXagwJwJYdOg6QkLcubrQH4YZMxVSTsnWy//Tbl0J
2ZF2pyvaajUTpxSkqDNfd4GAOfSt3dzz8MWBkrIsD7k5omFZXI4Q3fXmJoWE51uv
NCP/7ws369vJotOMk/RClUgoheJyC6tOJFg1aJ3/pvLwELhvn6pgIVeH3N9s5l4t
VbZsf0s3Hzy8FZ3lBfJqfe6AtgDjzuN07B1QTSVgaMoYZjgeIstHHK2V2XW+nq0j
OEDJnBV5bTder2QUapZZ2RqwkNdItEX80YjVGk6PMUL1CFeRX2W6oA==
=h4qS
-----END PGP SIGNATURE-----


From siegert at sfu.ca  Mon Jun 12 13:55:48 2000
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 12 Jun 2000 13:55:48 -0700 (PDT)
Subject: Linux kernel bug
In-Reply-To: <001901bfd3d5$755283e0$45a2fea9@hptilap.hpti.com> from "Greg Lindahl" at Jun 11, 2000 02:47:13 PM
Message-ID: <200006122055.NAA04573@fraser.sfu.ca>

Martin Siegert wrote:
> > Yesterday a bug was found (or made public on bugtraq) in the Linux
> > kernel (in all 2.2 versions up to and including 2.2.15) that allows
> > local users to gain root.

Greg Lindahl wrote:
> It does not. It just allows buffer overflow attacks to be as likely to
> succeed as other OSes. That was the most misleading CERT advisory I've ever
> read.

I'm not sure whether we are talking about the same thing: there hasn't been
a CERT advisory on this (yet).
Nevertheless, the bug is real, the exploits are published.
[see www.securityfocus.com -> Forums -> mailing lists -> bugtraq -> archive
 there are numerous articles on this starting Jun. 7 and several exploits]
I have tried one of the exploits myself (published by W. Purczynski on Jun. 9)
and it is trivial to gain root.

I'm afraid there is no alternative other than upgrading to 2.2.16

Cheers,
Martin


From glindahl at hpti.com  Mon Jun 12 14:29:19 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Mon, 12 Jun 2000 17:29:19 -0400
Subject: Linux kernel bug
In-Reply-To: <200006122055.NAA04573@fraser.sfu.ca>
Message-ID: <000801bfd4b5$449d6cc0$198bfea9@hptilap.hpti.com>

> I'm not sure whether we are talking about the same thing: there
> hasn't been
> a CERT advisory on this (yet).

You're right; what I remember reading was at sendmail.com and it was
extremely incomplete at the time. It is user-exploitable, ah well. Just goes
to show that capabilities in Linux are about as useful as oxygen masks in
airplanes, but I digress.

-- g


From jim at ks.uiuc.edu  Mon Jun 12 15:06:45 2000
From: jim at ks.uiuc.edu (Jim Phillips)
Date: Mon, 12 Jun 2000 17:06:45 -0500 (CDT)
Subject: Scalability of CHARMM on various architectures
In-Reply-To: <200006071839.e57Idtx18234@axe1.med.upenn.edu>
Message-ID: <Pine.GSO.4.10.10006121704240.5696-100000@verdun.ks.uiuc.edu>

Hi,

Depending on the features you require, NAMD is compatible with CHARMM and
should scale better, especially on clusters.  New version out soon.

http://www.ks.uiuc.edu/Research/namd/

-Jim


On Wed, 7 Jun 2000 axelsen at axe1.med.upenn.edu wrote:

> 
> 
> 
> 
> We are designing a cluster for which the most important code will be the
> computational chemistry program, CHARMM.  In our preliminary tests on
> an existing cluster, we have confirmed our expectation that interthread 
> communications will be our first bottleneck.  A test run on a typical 
> problem did not scale beyond 4 nodes on a cluster composed of P2-450
> processors and fast ethernet interconnections.  In contrast, the same
> code scales almost perfectly up to at least 12 processors on an SGI-Unix 
> SMP machine.  CHARMM uses PVM.
> 
> I would appreciate contact from anyone who has run CHARMM on a cluster
> and has considered the best way to make this code scale better.  From 
> anyone, I would appreciate general guidance on several issues:
> 
>  * If we go with pentium II/III processors, how far is Myrinet likely to
>    permit us to scale these calculations?
> 
>  * With Myrinet, would alpha nodes tend to scale any better than pentium nodes?
> 
>    (a single 500 MHz alpha processor is about 1.6-fold faster than a 
>     single 450 MHz P2 on a typical problem, but it is about 3-fold the
>     cost.  With this question, I am looking for any additional advantages 
>     of alpha to justify this cost)
> 
>  * Are there differences between different pentium chip sets that will impact
>    this problem?
> 
>  * Is there any advantage to buying dual-processor machines, either alpha
>    or pentium, with respect to scaling?  Any advantage to dedicating one 
>    processor in each dual-box to communications?  If we dedicate processors
>    in this way, would both processors have to have the same clock speed?
> 
> 
> 
> 
> ------------- axe at pharm.med.upenn.edu -----------------
>                                                        
> Paul H. Axelsen               ....   ....  .   .  .   .
> Department of Pharmacology    .   .  .     ..  .  ..  .
> University of Pennsylvania    ....   ...   . . .  . . .
> 3620 Hamilton Walk            .      .     .  ..  .  ..
> Philadelphia, PA 19104-6084   .      ....  .   .  .   .
> 
> -------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From jsquyres at lsc.nd.edu  Mon Jun 12 18:54:45 2000
From: jsquyres at lsc.nd.edu (Jeff Squyres)
Date: Mon, 12 Jun 2000 20:54:45 -0500 (EST)
Subject: Some basic Questions...
In-Reply-To: <012301bfd2f6$33973ac0$396d2fc2@lyan.vxu.se>
Message-ID: <Pine.LNX.4.21.0006122046250.11716-100000@pokey.lsc.nd.edu>

On Sat, 10 Jun 2000, Nacho Ruiz wrote:

> - It makes sense to have on each node PVM/MPI installed or is better
> to have it via NFS?

The LAM/MPI FAQ talks about this.  See the section "Typical Setup of LAM",
and the questions "How should I setup LAM for multiple users?", "Do I need
a common filesystems on all my nodes?", and "What directory do I install
LAM to?"

These questions are obviously LAM-specific, but apply to most current
parallel run-time systems.

> - RSH o SSH for the cluster?

There are arguments both ways; advantages and disadvantages to each.  
It's almost a religious debate.  You pretty much have to decide what it
right for your site.

>    Nobody has acces to the nodes machines, in theory. The only one
> that could acces them is the root for administration reasons... and if
> somebody get the root password, why care about the 486 without
> internet acces?

This is a Very Bad position to take.  A firewall is *not* complete
protection; it is only one level in a protection system.  Your 486 nodes
are only as safe as your gateway machine.

Just my $0.02.

{+} Jeff Squyres
{+} squyres at cse.nd.edu
{+} Perpetual Obsessive Notre Dame Student Craving Utter Madness
{+} "I came to ND for 4 years and ended up staying for a decade"


From walt at parl.ces.clemson.edu  Tue Jun 13 06:11:32 2000
From: walt at parl.ces.clemson.edu (Walter B. Ligon III)
Date: Tue, 13 Jun 2000 09:11:32 -0400
Subject: Some basic Questions... 
In-Reply-To: Your message of "Mon, 12 Jun 2000 20:54:45 CDT."
             <Pine.LNX.4.21.0006122046250.11716-100000@pokey.lsc.nd.edu> 
Message-ID: <200006111416.KAA14250@krang.parl.clemson.edu>

--------

> >    Nobody has acces to the nodes machines, in theory. The only one
> > that could acces them is the root for administration reasons... and if
> > somebody get the root password, why care about the 486 without
> > internet acces?
> 
> This is a Very Bad position to take.  A firewall is *not* complete
> protection; it is only one level in a protection system.  Your 486 nodes
> are only as safe as your gateway machine.

This is not necessarily a bad position to take.  This is not the same as a
firewall situation.  In a network protected by a firewall there is useful
data and/or functionality on a node behind the firewall and the firewall
attempts to filter packets routed to that node in order to provide protection.
In a properly configured beowulf there isn't anything of value on a node that
isn't on the master node AND the master node does not route packets.  Thus,
in order to attack the node the attacker has to compromise the master first,
and having done so has already gained access to the useful parts of the
system.

The reason for secure shell on a beowulf is in the case where users do not
trust one another.  If users are working with information that must be
secured from other users, and thus need to be sure that their passwords
are not compromised and one user cannot masquerade as another - THEN security
is an issue.  Of course *I* would contend a beowulf is inherently insecure
in that situation and should not be used as such, but that's *my* opinion.

To take this a step further, a "well-designed" beowulf shouldn't allow logins
to the nodes anyway - they should not have rsh or ssh or telnet or FTP or
password files or any of that.  Nodes should exist as slave processors for
executing processes under the control of the master node.  Inter-node
security becomes a non-issue.  Think about it, do you need security to
keep someone who gains acceess to one processor of an SMP from getting access
to another processor?

Walt

-- 
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University


From ying at almaden.ibm.com  Tue Jun 13 09:37:58 2000
From: ying at almaden.ibm.com (ying at almaden.ibm.com)
Date: Tue, 13 Jun 2000 09:37:58 -0700
Subject: channel bonding
Message-ID: <872568FD.005B5E46.00@d53mta03h.boulder.ibm.com>


Hi,

Has anyone extended channel bonding to provide high availability?
Can someone give me a pointer to the lastest channel bonding package?
Thanks a lot in advance.
Would channel bonding work with the lastest development kernels 2.3.xx or
2.4.-testxx?

Thanks!

Ying


From demeler at bioc09.v19.uthscsa.edu  Tue Jun 13 20:11:45 2000
From: demeler at bioc09.v19.uthscsa.edu (Borries Demeler)
Date: Tue, 13 Jun 2000 22:11:45 -0500 (CDT)
Subject: rsh probl. with Slackware 7
Message-ID: <200006140311.WAA19566@bioc09.v19.uthscsa.edu>

Sorry if this is a bit off-topic:
I am trying to use rsh without passwords on Slackware 7, and I can't get it to
work. The system is isolated, and I am the only user, so I won't need to 
worry about security. The problem I have is that if I try to login with rsh
I am always prompted for a password. Here is what I have changed:

1. edited $HOME/.rhosts
2. edited /etc/hosts.equiv
3. edited /etc/inetd.conf:
4. edited /etc/securetty to also allow root to connect without passwd 
from remote terminal.

shell   stream  tcp     nowait  root    /usr/sbin/tcpd  in.rshd -L -h -a

(from the installation it had only option -L)

and killed -HUP inetd.pid

I am still prompted for a password when doing a simple `rsh host`.
Strangely enough, I can do something like this:

rsh remotehost ls /

without being prompted for a password. But why can't I log in?

Any help would be very much appreciated. I guess I could use ssh, but I'd
rather not since the encryption probably slows down the mpi/pvm calls.

Thanks for any help with this, -Borries


From shahin at labf.org  Wed Jun 14 01:46:56 2000
From: shahin at labf.org (Mofeed Shahin)
Date: Wed, 14 Jun 2000 18:16:56 +0930
Subject: Java.
Message-ID: <39474680.4525F7D@labf.org>

G'day Fella's

I was wandering whether anyone here has had any experience with
distributed Java programs.

I wanted to write an application that could run a beowulf.

I was interested in things like RMI, etc....

Ideally I would like to be able to get a class and tell it to run on a
particular machine and every now and then, poke it, and give it a
message, etc....

Any ideas will be appreciated, thanks.

Cheers Mof.

-- 
"Never underestimate the power of a dark clown"


From RSchilling at affiliatedhealth.org  Wed Jun 14 10:14:44 2000
From: RSchilling at affiliatedhealth.org (Schilling, Richard)
Date: Wed, 14 Jun 2000 10:14:44 -0700
Subject: Java.
Message-ID: <51FCCCF0C130D211BE550008C724149E8FECBE@mail1.affiliatedhealth.org>

I've been working with distributed Java for some time now (experimentally,
for a couple of years).  Hopefully, with Java 1.3, and embedded Java, the
speed will pick up enough to warrant some real EDI transaction processing.

Check out JVPM on the PVM web site http://www.epm.ornl.gov/pvm/  .  It's
basically a version of PVM written completely in Java.  On my own projects,
I've invented a distributed intelligent agent system in Java.

Java's a good choice to do prototyping of Beowulf systems.  Because Java
classes are complete entities (they contain code and data), programming to
get data passed between nodes is easy with serializable classes.  You just
convert the class to a byte stream, send it down the TCP/IP socket
connection, decode it on the other end, and run the class.

Hope that helps.

Richard Schilling
Web Integration Programmer
Affiliated Health Services
Mount Vernon, WA


-----Original Message-----
From: Mofeed Shahin [mailto:shahin at labf.org]
Sent: Wednesday, June 14, 2000 1:47 AM
To: Beowulf Mailing List
Subject: Java.


G'day Fella's

I was wandering whether anyone here has had any experience with
distributed Java programs.

I wanted to write an application that could run a beowulf.

I was interested in things like RMI, etc....

Ideally I would like to be able to get a class and tell it to run on a
particular machine and every now and then, poke it, and give it a
message, etc....

Any ideas will be appreciated, thanks.

Cheers Mof.

-- 
"Never underestimate the power of a dark clown"

_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20000614/7085eeeb/attachment.html>

From waldow at rainier.chem.plu.edu  Wed Jun 14 11:31:31 2000
From: waldow at rainier.chem.plu.edu (Dean Waldow)
Date: Wed, 14 Jun 2000 11:31:31 -0700
Subject: Motherboard / Benchmark Questions...
Message-ID: <3947CF83.E11E70C7@rainier.chem.plu.edu>

(I hope this only arrives on the list once since there were some mailing problems.)

Hi,  I am trying to settle on hardware for building our cluster and have
some motherboard / benchmark related questions.  (Sorry if I missed any
discussions/FAQ's...)

I have benchmarked my monte carlo code on what machines I could access. 
The monte carlo code is in the 'embarrassingly parallel' class with some
potential for parallelization with another version which includes
significant numbers of fourier transforms. (given our small budget - we
looked at celeron 400Mhz, PIII katmai 512K cache dual 600MHz, and a PIII
Cu-mine 550MHz 256K cache and not alpha's)  The benchmark was simply how
long it took my actual simulation to complete.  The results seem a
little surprising since I thought celeron's might be more competitive...
We compared the three basic systems mentioned above normalizing within a
processor class to clock speed.  We also tried to estimate a benchmark
time for athalons which is most likely a poor estimate (assuming equal
performance to a PIII and scaling to clock speed). We looked at our
budget, estimates in price from pricewatch.com, and benchmark numbers.
We then calculated the number simulations we could run on the
hypothetical cluster in a day: celeron's systems were the slowest, with
PIII and PIII duals basically equivalent at about a 25% increase in
throughput, and lastly (using an estimate for athalons) hypothetically
came out the highest with an additional ~15% increase in throughput over
the PIII's.  I am not confident in that estimate but it is interesting
and would likely be heavily code specific.

On one level, the differences in throughput are not terribly significant
compared to the increase  I will be able to get on a cluster vs. the
current machines I have.  Thus, I am left with a few questions that if
anyone might have comments on that would be great.  If these questions
may not have as much general interest, I could summarize off-list
comments later.   

1)  Since my tests indicate little difference in throughput for single
cpu vs. dual cpu nodes, are there other advantages one way or the other
in using dual vs. a single cpu nodes? 

2) In the case of the PIII processor, the question seems to be one of a
mainboard choice which in turn is mostly about chipsets - right?  From
what I have read...  The two chipsets that seem prevalent are the 440BX
and the VIA Apollo 133A.  The newer intel chipsets (i8xx) make me a
little cautious from what I have read though that may be mostly due to
the i820.  The 440BX boards sound well tested and stable performers but
on the "older side" without much difference in price.  The VIA Apollo
133A seems like it would have advantages if code is benefited by the
133FSB and PC133 memory. Does this summary make sense? And are there
folks successfully using the newer chipsets? :)

3) Since I have not been able to benchmark my code on an athalon, does
anyone have experience in comparing performance on athalons versus
PIII's for a real world example?  Or, are the performance differences
really so code dependent that it is difficult to "generalize."  8-)  The
potential for increased throughput is tempting but without better
estimates the stability/certainty of the PIII maybe more important in
the long run.

Thanks for any input and I hope these questions are not too simple...

Dean W.
-- 
-----------------------------------------------------------------------------
Dean Waldow, Associate Professor      (253) 535-7533 
Department of Chemistry               (253) 536-5055 (FAX)
Pacific Lutheran University           waldowda at plu.edu
Tacoma, WA  98447   USA               http://www.chem.plu.edu/waldow.html
-----------------------------------------------------------------------------
---> CIRRUS and the Chemistry homepage: http://www.chem.plu.edu/         <---
-----------------------------------------------------------------------------


From siegert at sfu.ca  Wed Jun 14 11:48:29 2000
From: siegert at sfu.ca (Martin Siegert)
Date: Wed, 14 Jun 2000 11:48:29 -0700 (PDT)
Subject: rsh probl. with Slackware 7
In-Reply-To: <200006140311.WAA19566@bioc09.v19.uthscsa.edu> from "Borries Demeler" at Jun 13, 2000 10:11:45 PM
Message-ID: <200006141848.LAA29525@fraser.sfu.ca>

Hi,
> 
> Sorry if this is a bit off-topic:
> I am trying to use rsh without passwords on Slackware 7, and I can't get it to
> work. The system is isolated, and I am the only user, so I won't need to 
> worry about security. The problem I have is that if I try to login with rsh
> I am always prompted for a password. Here is what I have changed:
> 
> 1. edited $HOME/.rhosts
> 2. edited /etc/hosts.equiv
> 3. edited /etc/inetd.conf:
> 4. edited /etc/securetty to also allow root to connect without passwd 
> from remote terminal.
> 
> shell   stream  tcp     nowait  root    /usr/sbin/tcpd  in.rshd -L -h -a
> 
> (from the installation it had only option -L)
> 
> and killed -HUP inetd.pid
> 
> I am still prompted for a password when doing a simple `rsh host`.
> Strangely enough, I can do something like this:
> 
> rsh remotehost ls /
> 
> without being prompted for a password. But why can't I log in?


From rgb at phy.duke.edu  Wed Jun 14 13:29:51 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 14 Jun 2000 16:29:51 -0400 (EDT)
Subject: Motherboard / Benchmark Questions...
In-Reply-To: <3947CF83.E11E70C7@rainier.chem.plu.edu>
Message-ID: <Pine.LNX.4.10.10006141608370.995-100000@ganesh.phy.duke.edu>

On Wed, 14 Jun 2000, Dean Waldow wrote:

> We then calculated the number simulations we could run on the
> hypothetical cluster in a day: celeron's systems were the slowest, with
> PIII and PIII duals basically equivalent at about a 25% increase in
> throughput, and lastly (using an estimate for athalons) hypothetically
> came out the highest with an additional ~15% increase in throughput over
> the PIII's.  I am not confident in that estimate but it is interesting
> and would likely be heavily code specific.

A lot of this depends on how cache-local your code is.  From the numbers
you post (presuming you've adjusted for clock speed differences, since
you were comparing Celerons and PIII's at different clocks), it sounds
like the application is very NONlocal -- the larger L2 cache on the PIII
and its faster memory seem to make a significant difference.  If your
application were a bit more local, you would likely see much more nearly
equivalent performance between these two.  The Athalon has a different
(and presumably faster) cache, so it might well outperform the PIII on
moderately nonlocal code.

> On one level, the differences in throughput are not terribly significant
> compared to the increase  I will be able to get on a cluster vs. the
> current machines I have.  Thus, I am left with a few questions that if
> anyone might have comments on that would be great.  If these questions
> may not have as much general interest, I could summarize off-list
> comments later.   
> 
> 1)  Since my tests indicate little difference in throughput for single
> cpu vs. dual cpu nodes, are there other advantages one way or the other
> in using dual vs. a single cpu nodes?

This, too, depends on how memory intensive the applications are.  The
major "weakness" of a dual is that two processors running flat out on
memory access can saturate the memory bus of Intel systems.  If the
program does enough computation per memory access, the memory accesses
will antibunch and your applications will still complete (nearly) twice
as fast on a dual system.  My embarrassingly parallel Monte Carlo code
works like this -- I get nearly perfect scaling on duals as well as
across the cluster.  However, on memory-intensive code performance can
drop off so that it takes (for example) 1.3-1.5x as long to complete a
job on a dual running two jobs.  You still generally get gain relative
to one processor running two jobs, but two separate nodes will be
faster (completing 2 jobs in 1x the single CPU time).

> 2) In the case of the PIII processor, the question seems to be one of a
> mainboard choice which in turn is mostly about chipsets - right?  From
> what I have read...  The two chipsets that seem prevalent are the 440BX
> and the VIA Apollo 133A.  The newer intel chipsets (i8xx) make me a
> little cautious from what I have read though that may be mostly due to
> the i820.  The 440BX boards sound well tested and stable performers but
> on the "older side" without much difference in price.  The VIA Apollo
> 133A seems like it would have advantages if code is benefited by the
> 133FSB and PC133 memory. Does this summary make sense? And are there
> folks successfully using the newer chipsets? :)

I have no comment on stability.  As far as performance goes, since your
application >>seems<< to be fairly memory intensive based on the
celeron-PIII differentiation, the faster memory might well make a
difference.  The only way to know for sure is to test it (or understand
the memory access pattern of your code in detail).  Is your Monte Carlo
algorithm is doing a random site update (and hence jumping all over
memory)?  Is there any way to organize it to operate more locally?

> 3) Since I have not been able to benchmark my code on an athalon, does
> anyone have experience in comparing performance on athalons versus
> PIII's for a real world example?  Or, are the performance differences
> really so code dependent that it is difficult to "generalize."  8-)  The
> potential for increased throughput is tempting but without better
> estimates the stability/certainty of the PIII maybe more important in
> the long run.

The only safe way to compare is to test it.  My own tests of Athalons
with my Monte Carlo code were very disappointing -- I get by far the
best price performance on Celerons, as my code is generally local enough
to run satisfactorily with a 128 K L2 cache (even allowing for slower
memory).  The benchmarks I've run suggest that the Athalon's real
strength is its cache and memory subsystem.  However, your mileage may
vary considerably.

You can "generalize" (perhaps) only after you understand your code and
the things that are determining its effective speed.  As a rule, a CPU
bound process is primarily affected by clock more than anything else.
As a process becomes memory bound, speeds are very nonlinearly affected
by stride and memory access pattern and so forth.  This can all be
understood and guestimated, but it is difficult to predict what the
answers will be for your application without the source code or a
description of the algorithm.

  Hope this helps,

        rgb

> 
> Thanks for any input and I hope these questions are not too simple...
> 
> Dean W.
> -- 
> -----------------------------------------------------------------------------
> Dean Waldow, Associate Professor      (253) 535-7533 
> Department of Chemistry               (253) 536-5055 (FAX)
> Pacific Lutheran University           waldowda at plu.edu
> Tacoma, WA  98447   USA               http://www.chem.plu.edu/waldow.html
> -----------------------------------------------------------------------------
> ---> CIRRUS and the Chemistry homepage: http://www.chem.plu.edu/         <---
> -----------------------------------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rbw at networkcs.com  Wed Jun 14 13:39:41 2000
From: rbw at networkcs.com (Richard Walsh)
Date: Wed, 14 Jun 2000 15:39:41 -0500 (CDT)
Subject: Stream numbers from Tyan Tiger 133 with VIA Pro 133a chipset ...
Message-ID: <200006142039.PAA26098@us.networkcs.com>

All,

I wanted to get some feedback from others on the 
reasonableness of the following stream benchmark
for a Tyan Tiger 133 based cluster we have assembled. 
They are substantially less than those reported for 
a PIIIEB_600 on the stream site and are about the 
same as those reported for the 440_BX chipset with 
a Katmai 600. I was expecting better numbers.  Perhaps 
others have run the benchmark as well with a different 
compiler or options. 

Interested in your comments ...

rbw

System Specifications:

1. Dual PIII/EB 667 CPUs
2. 256 MB PC133 Memory (Micron no ECC)
3. VIA Pro 133a chipset with 133 FSB.

Compilation:

gcc -O3 stream_d.c second_wall.c -lm  
(under RHAT Linux 6.2/ kernel 2.2.15)

Results:

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 5000000, Offset = 0
Total memory required = 114.4 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 217494 microseconds.
   (= 217494 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         335.6732       0.2385       0.2383       0.2391
Scale:        336.2616       0.2380       0.2379       0.2380
Add:          436.9053       0.2747       0.2747       0.2748
Triad:        309.8261       0.3877       0.3873       0.3882

#---------------------------------------------------
#
# Richard Walsh
# NetAPSx, Inc. 
# 1200 Washington Ave. So. 
# Minneapolis, MN 55415
# VOX:    612-337-3467
# FAX:    612-337-3400
# EMAIL:  rbw at networkcs.com
#
#---------------------------------------------------
# "What you can do, or dream you can, begin it;
#  Boldness has genius, power, and magic in it."
#                                       -Goethe
#---------------------------------------------------


From brian at chpc.utah.edu  Wed Jun 14 13:55:23 2000
From: brian at chpc.utah.edu (Brian D. Haymore)
Date: Wed, 14 Jun 2000 14:55:23 -0600
Subject: Stream numbers from Tyan Tiger 133 with VIA Pro 133a chipset ...
References: <200006142039.PAA26098@us.networkcs.com>
Message-ID: <3947F13B.204A0A93@chpc.utah.edu>

Here are our results from our new AMD 950Mhz boxes using the ABIT KA7
Motherboard and the latest Beta "RY" bios.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 10000000, Offset = 0
Total memory required = 228.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 268355 microseconds.
   (= 268355 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         492.3896       0.3263       0.3249       0.3280
Scale:        492.1473       0.3265       0.3251       0.3296
Add:          557.8127       0.4317       0.4303       0.4348
Triad:        570.6840       0.4240       0.4205      
0.4256                                                                                               


The numbers you got are close to what we saw before we started using the
"memory interleaving" features in this beta bios.

Here are the pre-interleaving numbers:

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 10000000, Offset = 0
Total memory required = 228.9 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 283765 microseconds.
   (= 283765 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         380.1740       0.4215       0.4209       0.4224
Scale:        381.4227       0.4219       0.4195       0.4243
Add:          446.3248       0.5391       0.5377       0.5406
Triad:        441.2017       0.5460       0.5440      
0.5472                                                                                               


I'm not sure but it seems that some if not many of the KX133 based
boards are still working on stability and have not put all the
performance options in yet.


We have been working without issues on this latest beta bios for about a
3 weeks now.  


Richard Walsh wrote:
> 
> All,
> 
> I wanted to get some feedback from others on the
> reasonableness of the following stream benchmark
> for a Tyan Tiger 133 based cluster we have assembled.
> They are substantially less than those reported for
> a PIIIEB_600 on the stream site and are about the
> same as those reported for the 440_BX chipset with
> a Katmai 600. I was expecting better numbers.  Perhaps
> others have run the benchmark as well with a different
> compiler or options.
> 
> Interested in your comments ...
> 
> rbw
> 
> System Specifications:
> 
> 1. Dual PIII/EB 667 CPUs
> 2. 256 MB PC133 Memory (Micron no ECC)
> 3. VIA Pro 133a chipset with 133 FSB.
> 
> Compilation:
> 
> gcc -O3 stream_d.c second_wall.c -lm
> (under RHAT Linux 6.2/ kernel 2.2.15)
> 
> Results:
> 
> -------------------------------------------------------------
> This system uses 8 bytes per DOUBLE PRECISION word.
> -------------------------------------------------------------
> Array size = 5000000, Offset = 0
> Total memory required = 114.4 MB.
> Each test is run 10 times, but only
> the *best* time for each is used.
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 217494 microseconds.
>    (= 217494 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function      Rate (MB/s)   RMS time     Min time     Max time
> Copy:         335.6732       0.2385       0.2383       0.2391
> Scale:        336.2616       0.2380       0.2379       0.2380
> Add:          436.9053       0.2747       0.2747       0.2748
> Triad:        309.8261       0.3877       0.3873       0.3882
> 
> #---------------------------------------------------
> #
> # Richard Walsh
> # NetAPSx, Inc.
> # 1200 Washington Ave. So.
> # Minneapolis, MN 55415
> # VOX:    612-337-3467
> # FAX:    612-337-3400
> # EMAIL:  rbw at networkcs.com
> #
> #---------------------------------------------------
> # "What you can do, or dream you can, begin it;
> #  Boldness has genius, power, and magic in it."
> #                                       -Goethe
> #---------------------------------------------------
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112-0190

Email: brian at chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366


From waldow at rainier.chem.plu.edu  Thu Jun 15 00:08:54 2000
From: waldow at rainier.chem.plu.edu (Dean Waldow)
Date: Thu, 15 Jun 2000 00:08:54 -0700
Subject: Motherboard / Benchmark Questions...
References: <Pine.LNX.4.10.10006141608370.995-100000@ganesh.phy.duke.edu>
Message-ID: <39488104.F38923C@rainier.chem.plu.edu>

> > the PIII's.  I am not confident in that estimate but it is interesting
> > and would likely be heavily code specific.
> 
> A lot of this depends on how cache-local your code is.  From the numbers
> you post (presuming you've adjusted for clock speed differences, since
> you were comparing Celerons and PIII's at different clocks), it sounds
> like the application is very NONlocal -- the larger L2 cache on the PIII
> and its faster memory seem to make a significant difference.  If your
> application were a bit more local, you would likely see much more nearly
> equivalent performance between these two.  The Athalon has a different
> (and presumably faster) cache, so it might well outperform the PIII on
> moderately nonlocal code.

Thanks for the comments.  The raw numbers are in the table below and in
our calculation of runs/cluster-day we did correct the clock speed to
the processor in the hypothetical cluster.  So, our numbers had both raw
speed and economic factors in there...  Basically, what would could buy
for our budget.  Here are the raw numbers without the current price data
in the calculation.   

For my monte carlo code: (no overclocking)

system                          one run -->    (min)  norm-> 1GHz (min)
-------------------------------------------------------------------------
celeron 400MHz (128k cache, 128MB ram)         153         61
PIII 550MHz Cu-mine (256K cache 128MB ram)      76         42
  asus p3b-f board with pc100 mem
PIII 600MHz Katmai/dual (512K cache 504 MB ram)      
    1 proc running only                         75         45
    2 concurrent processes on dual              79         47
  asus p2b-d board 

I think it is still the same basic conclusion if I am understanding your
argument...  pretty significant difference between celeron and PIII but
smaller differences between PIII's / dual.
 
> > 1)  Since my tests indicate little difference in throughput for single
> > cpu vs. dual cpu nodes, are there other advantages one way or the other
> > in using dual vs. a single cpu nodes?
> 
> This, too, depends on how memory intensive the applications are.  The
> major "weakness" of a dual is that two processors running flat out on
> memory access can saturate the memory bus of Intel systems.  If the
> program does enough computation per memory access, the memory accesses
> will antibunch and your applications will still complete (nearly) twice
> as fast on a dual system.  My embarrassingly parallel Monte Carlo code
> works like this -- I get nearly perfect scaling on duals as well as
> across the cluster.  However, on memory-intensive code performance can
> drop off so that it takes (for example) 1.3-1.5x as long to complete a
> job on a dual running two jobs.  You still generally get gain relative
> to one processor running two jobs, but two separate nodes will be
> faster (completing 2 jobs in 1x the single CPU time).

I think the dual performance is similar to your experience regarding the
duals given the numbers above though maybe not quite as perfect
scaling... ~5% but i don't think that is too bad.  It just seems when
going through the complete calculation for numbers of nodes given the
budget the throughput doesn't seem to be much different.

> > 133FSB and PC133 memory. Does this summary make sense? And are there
> > folks successfully using the newer chipsets? :)
> 
> I have no comment on stability.  As far as performance goes, since your
> application >>seems<< to be fairly memory intensive based on the
> celeron-PIII differentiation, the faster memory might well make a
> difference.  The only way to know for sure is to test it (or understand
> the memory access pattern of your code in detail).  Is your Monte Carlo
> algorithm is doing a random site update (and hence jumping all over
> memory)?  Is there any way to organize it to operate more locally?

I think you are right in that it seems memory dependent with 128MB being
enough and likely memory speed influenced at the least. The algorithm
does pick a random spot in my 3D lattice and consequently I would say it
does likely jump all over memory.  As to organizing the code to operate
more locally, I don't know a simple way. I would have to really study
the implications to the results to feel confident about that relative
the simulation time savings.   

> > the long run.
> 
> The only safe way to compare is to test it.  My own tests of Athalons
> with my Monte Carlo code were very disappointing -- I get by far the
> best price performance on Celerons, as my code is generally local enough
> to run satisfactorily with a 128 K L2 cache (even allowing for slower
> memory).  The benchmarks I've run suggest that the Athalon's real
> strength is its cache and memory subsystem.  However, your mileage may
> vary considerably.

I hope to have an athlon test in the near future and will be interesting
to see where it falls.  

> You can "generalize" (perhaps) only after you understand your code and
> the things that are determining its effective speed.  As a rule, a CPU
> bound process is primarily affected by clock more than anything else.
> As a process becomes memory bound, speeds are very nonlinearly affected
> by stride and memory access pattern and so forth.  This can all be
> understood and guestimated, but it is difficult to predict what the
> answers will be for your application without the source code or a
> description of the algorithm.

The (non)linearity with clock speed is much more understandable now.  I
also have some benchmarks on a 733MHz PIII but have not been confident
in them yet since I don't know much about the system they were run on
yet and they seemed to be almost the same as the 550MHz/600MHz PIII
tests I did.  If I get confident in that number, it would be consistent
with the memory intensive nature of the code.  The interesting question
then seems to be connected to memory bus speed.  Hence, processor speed
can keep going up but if the memory bus is the limiting factor then you
might not see much difference.  When I then get a benchmark on a system
with faster bus, it might be pretty informative also. 

>   Hope this helps,

It certainly does! And hope it can help others also. Thanks...

> Robert G. Brown                        http://www.phy.duke.edu/~rgb/

Dean W.
-- 
-----------------------------------------------------------------------------
Dean Waldow, Associate Professor      (253) 535-7533 
Department of Chemistry               (253) 536-5055 (FAX)
Pacific Lutheran University           waldowda at plu.edu
Tacoma, WA  98447   USA               http://www.chem.plu.edu/waldow.html
-----------------------------------------------------------------------------
---> CIRRUS and the Chemistry homepage: http://www.chem.plu.edu/         <---
-----------------------------------------------------------------------------


From rgb at phy.duke.edu  Thu Jun 15 07:10:26 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 15 Jun 2000 10:10:26 -0400 (EDT)
Subject: Motherboard / Benchmark Questions...
In-Reply-To: <39488104.F38923C@rainier.chem.plu.edu>
Message-ID: <Pine.LNX.4.10.10006150948390.2419-100000@ganesh.phy.duke.edu>

On Thu, 15 Jun 2000, Dean Waldow wrote:

> > the memory access pattern of your code in detail).  Is your Monte Carlo
> > algorithm is doing a random site update (and hence jumping all over
> > memory)?  Is there any way to organize it to operate more locally?
> 
> I think you are right in that it seems memory dependent with 128MB being
> enough and likely memory speed influenced at the least. The algorithm
> does pick a random spot in my 3D lattice and consequently I would say it
> does likely jump all over memory.  As to organizing the code to operate
> more locally, I don't know a simple way. I would have to really study
> the implications to the results to feel confident about that relative
> the simulation time savings.

Well, Monte Carlo is my bag in physics, and I've done quite a few
comparative studies of e.g. quench times (and autocorrelation times in
general) comparing the averages obtained and relaxation times and sample
independence times and so forth.  I'd be happy to look over your
problem/solution if you like to see if I think that it would make any
difference to use an e.g. typewriter or checkerboard algorithm instead
of random site selection.  The general rule is that if you are
evaluating macroscopic thermodynamic averages it doesn't -- if you are
simulating a time-dependent process and are measuring e.g. relaxation
rates it does, because autocorrelation relaxation is very much dependent
on thermalization model.  If you're doing something other than
importance sampling Monte Carlo, though, I'd have to look at it and
think about it (or you could just copy your source, alter the core loop
that uses random site selection to use typewriter instead, and do a test
run on a small lattice to compare the averages you obtain).

As a general rule, by the way, random site selection is the SLOWEST
method to converge, slower even than a shuffled (random without
replacement) selection strategy.  This is because the Poissonian process
leaves a lot of sites unvisited in any given Monte Carlo sweep.  In
fact, for a large lattice, there are often sites that aren't visited for
MANY sweeps.  These sites significantly delay the thermalization
process.

Anyway, I should probably not bore the list folks with statistical
physics... the remarks above were relevant enough for the list simply
because they emphasize the point that in many cases the "speed" of a
program depends strongly on the algorithm, and the algorithm of choice
need not be the "physical" one as long as it can be shown that one gets
the same (correct) answers.

> > > the long run.
> > 
> > The only safe way to compare is to test it.  My own tests of Athalons
> > with my Monte Carlo code were very disappointing -- I get by far the
> > best price performance on Celerons, as my code is generally local enough
> > to run satisfactorily with a 128 K L2 cache (even allowing for slower
> > memory).  The benchmarks I've run suggest that the Athalon's real
> > strength is its cache and memory subsystem.  However, your mileage may
> > vary considerably.
> 
> I hope to have an athlon test in the near future and will be interesting
> to see where it falls.  

I have access to one and can run your code for you if you send me a
tarball and instructions.  Or I can likely arrange for you to have an
"account for a day" to play with it if you send me an encrypted passwd
line to stick into our passwd file on the host.  I'm curious myself and
we got the (900 MHz) athlon mostly to test anyway.

> The (non)linearity with clock speed is much more understandable now.  I
> also have some benchmarks on a 733MHz PIII but have not been confident
> in them yet since I don't know much about the system they were run on
> yet and they seemed to be almost the same as the 550MHz/600MHz PIII
> tests I did.  If I get confident in that number, it would be consistent
> with the memory intensive nature of the code.  The interesting question
> then seems to be connected to memory bus speed.  Hence, processor speed
> can keep going up but if the memory bus is the limiting factor then you
> might not see much difference.  When I then get a benchmark on a system
> with faster bus, it might be pretty informative also. 

This really sounds like it is the case.  Random site selection brings
out the worst in your caching subsystem -- very few of the memory
references, especially for a large program, will be in cache so you are
slowed down to 40-150 nanosecond rates per reference, which typically
will leave your CPU twiddling its proverbial thumbs while waiting for
data to arrive.  There is a very nice mental image of this process in
Pfister's "In Search of Clusters" -- he compares CPU's to clerks working
away on a desk at whatever is there.  Every time they need a number or
instruction that isn't there, they have to call out to an old geezer
sitting propped in a chair who shuffles off the the main filing cabinet
(main memory) and eventually drops it on your desk.  I'd guess your
program is constantly waiting for the old codger, and so is relatively
insensitive to CPU clock but very sensitive to memory subsystem.

This is where the PIII most certainly beats the Celeron, although
neither of them comes close to the alpha family.  You might want to
borrow time on an alpha to run your benchmarks there as well.  Its
memory subsystem is MUCH faster than the Intel family.  The athlon might
do much better (highly nonlinear in the nominal clock) better for you as
well, although in ALL cases if you can reorganize your code to be 80-90%
cache-local (like "most" code is) you'll regain CPU clock sensitivity
and reduce the clock-equivalent gap between Intel, Athlon and Celeron.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From charwel at chthry.chem.lsu.edu  Thu Jun 15 08:30:38 2000
From: charwel at chthry.chem.lsu.edu (Chris)
Date: Thu, 15 Jun 2000 10:30:38 -0500 (CDT)
Subject: Motherboard / Benchmark Questions...
In-Reply-To: <Pine.LNX.4.10.10006150948390.2419-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.4.21.0006151025480.5269-100000@chthry.chem.lsu.edu>

On Thu, 15 Jun 2000, Robert G. Brown wrote:
[snip]
> independence times and so forth.  I'd be happy to look over your
> problem/solution if you like to see if I think that it would make any
> difference to use an e.g. typewriter or checkerboard algorithm instead
> of random site selection.  The general rule is that if you are
[snip]
> 
> As a general rule, by the way, random site selection is the SLOWEST
> method to converge, slower even than a shuffled (random without
> replacement) selection strategy.  This is because the Poissonian process
> leaves a lot of sites unvisited in any given Monte Carlo sweep.  In
> fact, for a large lattice, there are often sites that aren't visited for
> MANY sweeps.  These sites significantly delay the thermalization
> process.
[snip]

Dr. Brown,

Could you discuss the typewrite, checkerboard and
 random site selection algorithms for MC a bit more? 
 
I think to the extent they would effect the time to completion
because of cache locality that more information on the algorithm
is on topic for the list.  I am certainly interested :>

thanks,

chris
charwel at chthry.chem.lsu.edu


From JSherman at dainrauscher.com  Thu Jun 15 10:12:56 2000
From: JSherman at dainrauscher.com (Sherman, Jay)
Date: Thu, 15 Jun 2000 12:12:56 -0500
Subject: Motherboard / Benchmark Questions...
Message-ID: <C8E57CE8F221D411A24600A0C99D90FD2D5F31@mail2.ROG.COM>

Could someone tell me how to unsubscribe from this
list, or could the list owner please unsubscribe me?

I'll be gone for 3 weeks and don't
want you nice folks to get my "out of office"
message.

Thanks!  -Jay


From JSherman at dainrauscher.com  Thu Jun 15 10:12:56 2000
From: JSherman at dainrauscher.com (Sherman, Jay)
Date: Thu, 15 Jun 2000 12:12:56 -0500
Subject: Motherboard / Benchmark Questions...
Message-ID: <C8E57CE8F221D411A24600A0C99D90FD2D5F31@mail2.ROG.COM>

Could someone tell me how to unsubscribe from this
list, or could the list owner please unsubscribe me?

I'll be gone for 3 weeks and don't
want you nice folks to get my "out of office"
message.

Thanks!  -Jay


From llonergan at hpti.com  Thu Jun 15 11:09:38 2000
From: llonergan at hpti.com (Luke Lonergan)
Date: Thu, 15 Jun 2000 11:09:38 -0700
Subject: Stream numbers from Tyan Tiger 133 with VIA Pro 133a chipset ...
In-Reply-To: <3947F13B.204A0A93@chpc.utah.edu>
Message-ID: <NDBBLCBEOIGADDKKNNMGGENICLAA.llonergan@hpti.com>

> Here are our results from our new AMD 950Mhz boxes using the ABIT KA7
> Motherboard and the latest Beta "RY" bios.

OK, interesting effect of the BIOS feature, but what compiler are you using?
Are you running the "C" version of stream or the FORTRAN? In order to get
the "correct" memory results, you have to max out the instruction units
first.

Luke


From brian at chpc.utah.edu  Thu Jun 15 12:10:15 2000
From: brian at chpc.utah.edu (Brian D. Haymore)
Date: Thu, 15 Jun 2000 13:10:15 -0600
Subject: Stream numbers from Tyan Tiger 133 with VIA Pro 133a chipset ...
References: <NDBBLCBEOIGADDKKNNMGGENICLAA.llonergan@hpti.com>
Message-ID: <39492A17.BA4BAEA6@chpc.utah.edu>

Luke Lonergan wrote:
> 
> > Here are our results from our new AMD 950Mhz boxes using the ABIT KA7
> > Motherboard and the latest Beta "RY" bios.
> 
> OK, interesting effect of the BIOS feature, but what compiler are you using?
> Are you running the "C" version of stream or the FORTRAN? In order to get
> the "correct" memory results, you have to max out the instruction units
> first.
> 
> Luke
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf


Correct on your last comment.  We are using the C version of Stream and
the Portland Group C Compiler, although GCC does nearly as well.  There
is, to me at least, reason to believe there is a lot of headroom for
performance gains as compilers are made more aware of Athlon
optimizations and the same for Intel systems.

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112-0190

Email: brian at chpc.utah.edu - Phone: (801) 585-1755 - Fax: (801) 585-5366


From rgb at phy.duke.edu  Thu Jun 15 12:48:21 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 15 Jun 2000 15:48:21 -0400 (EDT)
Subject: Motherboard / Benchmark Questions...
In-Reply-To: <Pine.LNX.4.21.0006151025480.5269-100000@chthry.chem.lsu.edu>
Message-ID: <Pine.LNX.4.10.10006151429150.1447-100000@ganesh.phy.duke.edu>

On Thu, 15 Jun 2000, Chris wrote:

> On Thu, 15 Jun 2000, Robert G. Brown wrote:
> [snip]
> > independence times and so forth.  I'd be happy to look over your
> > problem/solution if you like to see if I think that it would make any
> > difference to use an e.g. typewriter or checkerboard algorithm instead
> > of random site selection.  The general rule is that if you are
> [snip]
> > 
> > As a general rule, by the way, random site selection is the SLOWEST
> > method to converge, slower even than a shuffled (random without
> > replacement) selection strategy.  This is because the Poissonian process
> > leaves a lot of sites unvisited in any given Monte Carlo sweep.  In
> > fact, for a large lattice, there are often sites that aren't visited for
> > MANY sweeps.  These sites significantly delay the thermalization
> > process.
> [snip]
> 
> Dr. Brown,
> 
> Could you discuss the typewrite, checkerboard and
>  random site selection algorithms for MC a bit more? 
>  
> I think to the extent they would effect the time to completion
> because of cache locality that more information on the algorithm
> is on topic for the list.  I am certainly interested :>

Hmmm, the best thing to do is to start with a practical reference, like
Binder and Heerman's book on Computational Monte Carlo (or some such, my
copy is out on loan so I don't have the exact title handy -- it's a
Springer-Verlag).

The following is a very short (really:-) survey on importance sampling
Monte Carlo methology.  Many folks on the list will probably want to
just hit "d" about now, if they haven't already.

What?  Still there?  OK, you asked for this, so here it is.

If one is using Monte Carlo to do statistical physics, one generally is
trying to do an average over a distribution weighted with the Boltzmann
factor, \exp{-E_i/kT} where E_i is the energy of the ith configuration
(or state), k is the Boltzmann constant, and T the absolute temperature
in appropriate units (E/kt must obviously be dimensionless).  The
normalization factor for the weights, and the quantities being averaged,
must technically be summed over >>all states<< (or configurations).

There are, as it happens, a rather lot of them (states/configurations).
Even when it is finite (and often it is infinite), as finite numbers go,
this is one that really really wants to be infinity.  Consequently,
doing the actual sum over all states is out.

The next best thing to do is to try to take advantage of the fact that
at any given temperature, the normalized weight factor is nearly zero
(>>very<< nearly zero) for very nearly all of the configurations/states
that can occur.  If only we could find the >>important<< ones and
average over them, we'd be able to (maybe) get a decent thermodynamic
average in less than infinite time.

By great good fortune (or some very elegant mathematics, you choose:-)
it just so happens that there exists something called detailed balance
between all of these configurations.  That basically means that if one
defines a transition operator that carries one from one state or
configuration to another, at equilibrium the flow (of probability
weight) into state a from all other states must be counterbalanced by
the flow out of state a to all other states.  By writing a differential
equation for this greatly-to-be-desired condition and doing a bit of
algebra, one can show that this occurs when the transitions themselves
are weighted according to the Boltzmann distribution, but just on the
>>local<< change in the energy of the configuration(s).

Voila!  Thus the possibility of a Markov Process is born -- if one can
create a Markov Process that carries one from state a (any
configuration) to any other state b with a Boltzmann weight, one can
iterate the process to go from state a->b->c->... AND one can show that
after "a while" the states one ends up in nearly all of the time are
precisely the ones that contribute the bulk of the statistical weight of
the full sum over all states.  In fact, the Markov Chain thus
constructed wanders about through all non-forbidden states and
ergodically populates them according to their density in the original
weighted sum.

However, there are many, many possible Markov Processes that satisfy the
rather weak condition on the ratio of transition probabilities required
to lead to equilibrium and a valid importance sampling sequence.
Without going into detail as to what they are or how they work, some of
the best-known of them are:

Metropolis, which is an accept-reject method that takes configuration a,
randomly generates a new configuration b that is typically "close" to a
in some sense, and compares their energies.  If E_b is less than E_a,
the new configuration is "accepted", and becomes the new configuration
a.  If it is greater than E_a, a (pseudo)random number between 0-1 is
compared to \exp{(E_a - E_b)/kT} (which is always between 0 and 1) and
if it is less than the move is still accepted.  Running back and forth
between a and b with this algorithm one can show that in time they will
be relatively populated by precisely the ratio required by the boltzmann
weight, and ditto as more and more states are opened up as candidates
for moves.

Heat Bath, which actually generates a new configuration b (still "close"
to configuration a) by directly selecting it from a Boltzmann
distribution in a sub-ensemble (sorry about the big words).  This is
possible only if one can invert the known probability distribution for
the entity(s) being changed while leaving the rest of the state alone.
This in turn is only possible for certain models -- the inverse function
is frequently a nasty hypergeometric thingie that can only be evaluated
numerically so one might as well use Metropolis instead.

Both of these use a "local" move -- they typically act on whatever is at
a single point in a lattice while holding the rest of the lattice
frozen.  In my work, for example, I'd alter the state of a single spin
at a lattice vertex in the field of its neighbors, which remain fixed.
This causes the change to be "small", which makes it a lot more likely
that one stays "near" the equilibrium region.

Cluster Methods are a relatively new Markov Process that alters an
entire block of objects (e.g. spins) at once.  They work by applying the
Metropolisish accept-reject decision not to just one object, but to a
cluster.  In a nutshell, a random spin is moved into a new state by
means of an operator (perhaps rotating the spin through some fixed
angle.  A spin next to this spin is selected.  The same operator is
applied to it, and the energy change with only those spins on its
>>boundary<< evaluated (and a random number compared to a Boltzmann
factor).  If the RNG is less, the spin is added to the cluster.  This
process is repeated for >>all spins connected to the cluster by an
interaction bond<< until the "reject" decision terminates the process.

This process selects a whole cluster of spins and moves them identically
in just such a way that the new energy of the whole system has changed
according to the Boltzmann weight.  Note that the "bond" energies inside
the cluster and outside the cluster are unchanged -- the only thing that
changes is the energy of the surface layer between the cluster and the
outside, which was thermalized in detail by the accept-reject procedure.

In all cases, when one has iterated long enough that one believes (or
rather can show) that "equilibrium" has been achieved, one takes one or
more steps in the Markov Chain, evaluates all the quantities that one
wishes to thermodynamically average and adds them (and possibly their
higher moments) to the running sum(s) required to generate the averages
and generally a higher moment or two.

This now leads us to an important propertly of this sort of Markov Chain
and statistics.  As noted repeatedly above, the "moves" selected are
deliberately small ones -- they typically return a new configuration
that is very "close" to the old one.  So close, in fact, that as a
general rule it is NOT statistically independent -- its autocorrelation
with the previous state is very high.  This is Bad.  The (in my opinion
anyway) principle theorem of statistics (the Central Limit Theorem, from
which averages and variances and the like obtain their practical
meaning) requires independent, identically distributed samples from the
beginning, and the Markov Chain produces samples that are not
independent.

However, the further one advances from the initial state, the "more
independent" they get, in the sense that the average autocorrelation
either exponentially vanishes or exponentially approaches its limiting
value, depending on the underlying state of thermal order, as a function
of the number of steps.  The different methods above have VERY different
autocorrelation properties.  In fact, the autocorrelation timescales
can even scale differently with lattice size, especially near a
"critical" temperature.

This leads us (at last) to define the terms requested above and indicate
why and how they are used.

If one views the weighted transitions that are used in the Markov Chain
as being motivated by physics (for example, a random quantum transition
of a spin's state) then of course the location of the transition will be
random.  One might then be tempted to use a random site selection rule
in the Monte Carlo model of the physical process.

If one is measuring autocorrelation relaxation properties, this is a
good idea.  As previously noted, a random site selection will typically
>>miss<< a large fraction of the N lattice sites in N random selections
with replacement.  The missed sites don't have their state changed at
all, and of course contribute strongly to the autocorrelation.  The
autocorrelation times will thus be very different (and far, far longer)
for a random site selection than from almost any other alternative.

On the other hand, if one is NOT concerned with time at all, but just
wants the best possible average in the shortest possible time, it is a
very poor way to proceed.  The Markov Process doesn't care at all if you
select sites randomly or not -- it will inexorably move you toward and
then sample equilibrium provided only that the method of generating new
states doesn't leave any part of the phase space inaccessible (in the
sense of being disconnected -- cannot get there from here in any
possible set of moves).

SO, an alternative is to use a "typewriter" or "left-right" site
selection method -- simply go in loop order through the lattice and
change the state of each spin (or other object) as its index comes up.
In N small moves, every spin is in a new state if heat bath is used.  If
Metropolis is used, unfortunately accept-reject methods have the same
problem that random site selection produced -- a lot of moves were
presumably rejected so some fraction of the spins are unchanged and the
autocorrelation decay thereby proceeds undesirably slowly.

A typewriter heat bath isn't bad at bond thermalization.  Every bond
energy gets changed >>twice<< per "sweep" of N moves.  However,
structures in the lattice still change relatively slowly, which is what
motivates cluster methods that rearrange whole clusters at once.
Cluster methods, however, only thermalize bonds on the cluster surface,
leaving one with a surface to volume problem that slows them down.  The
best thing to do is typically mix cluster methods with a full
local-lattice sweep, ideally heat bath if possible.

A shuffled heat bath is random site selection without replacement.  It
is basically like shuffling the sites like a deck of cards and then
going through the deck one after another, putting the sites completed
into the "discard pile" and NOT back into the deck.  It's a lot more
work, it doesn't give you a valid autocorrelation time, and the
autocorrelation time it DOES give you is very, very close to the
typewriter time.  Ergo, it is a waste of time (generally speaking) to
shuffle, although there may be an exception somewhere that I haven't yet
encountered.

A "checkerboard" method is an improvement on typewriter designed to make
vector (and sometimes cache) operations work better.  The energy of the
ith site typically depends on the state of its nearest neighbors.  On a
lattice, they might be the dark squares surrounding a selected light
square.  If one evaluates the "field" on the light sites due to the dark
sites in one loop pass, one can use the results in a single pass to
update all the light sites (leaving the dark ones untouched).  One then
does the dark sites.  Indeed, one can usually do the site update and the
field update at the same time (with diffs) and just rip through dark
light dark light... hence the term checkerboard.

Clearly both typewriter and checkerboard have considerable locality,
depending on the dimensionality and size of your lattice.  Equally
clearly, the lattice can be blocked and reordered in various ways to
achieve all sorts of superlinear speedups at e.g. cache boundaries, if
it weren't for the fact that it is so damn difficult to write portable
code that is that smart.  Finally, lattices can typically be split
across nodes with IPC's (for local interactions) that scale like surface
to volume.  Put all that together, and lattice Monte Carlo becomes an
interesting parallel computation problem indeed.

My one wish at this point would be for the ATLAS project to spawn a
kernel module that does, once and for all, certain benchmark
measurements (basically the ones used in ATLAS to tune things up, plus
others that might suggest themselves) and publishes them in proc.  One
could then write some simple systems calls in a library to return them
and (re)use the values portably in code.  It is, after all, rather silly
to have to run the entire ATLAS autotuning suite to build the libraries
-- the key parameters should be evaluated separately and used as just
that -- parameters -- that can be used elsewhere to tune similar things
up.  IMHO, of course.  I hink that ATLAS might end up being the most
important single performance enhancing concept to hit computing in the
last four or five years and would truly love to see it extended and
generalized.

 I hope this helps.  I know that it is long, but BELIEVE ME -- it's
short.

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Thu Jun 15 12:50:13 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 15 Jun 2000 15:50:13 -0400 (EDT)
Subject: Motherboard / Benchmark Questions...
In-Reply-To: <C8E57CE8F221D411A24600A0C99D90FD2D5F31@mail2.ROG.COM>
Message-ID: <Pine.LNX.4.10.10006151549500.10522-100000@ganesh.phy.duke.edu>

On Thu, 15 Jun 2000, Sherman, Jay wrote:

> 
> Could someone tell me how to unsubscribe from this
> list, or could the list owner please unsubscribe me?
> 
> I'll be gone for 3 weeks and don't
> want you nice folks to get my "out of office"
> message.

I covered this, in detail, last week.  Check the list archives for full
instructions and suggestions.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From axelsen at axe1.med.upenn.edu  Fri Jun 16 11:00:52 2000
From: axelsen at axe1.med.upenn.edu (axelsen at axe1.med.upenn.edu)
Date: Fri, 16 Jun 2000 14:00:52 -0400 (EDT)
Subject: Checking out a motherboard
Message-ID: <200006161800.e5GI0qH14878@axe1.med.upenn.edu>


We're planning to make our first hardware purchases next week,
and after a lot of benchmarking with our code, it looks like a
16-node (dual-processor x 8)/myrinet system is the way for us
to go.  We've been quoted an attractive-looking price for systems
that include (among other things):

     Intel Lancewood Motherboard
     PIII-700 x 2
     256 RAm ECC PC100
     Embedded U2 SCSI, fast ethernet, graphics

Does anyone have experience with these or advice regarding pro's,
con's, or better alternatives within this design paradigm?


Thanks,


------------- axe at pharm.med.upenn.edu -----------------
                                                       
Paul H. Axelsen               ....   ....  .   .  .   .
Department of Pharmacology    .   .  .     ..  .  ..  .
University of Pennsylvania    ....   ...   . . .  . . .
3620 Hamilton Walk            .      .     .  ..  .  ..
Philadelphia, PA 19104-6084   .      ....  .   .  .   .

-------------------------------------------------------


From Tom.Morris at alpha-processor.com  Fri Jun 16 13:25:41 2000
From: Tom.Morris at alpha-processor.com (Tom Morris)
Date: Fri, 16 Jun 2000 16:25:41 -0400
Subject: Page Coloring in Alpha Linux (was  Benchmarking L2 cache on the A
	lpha 21264 )
Message-ID: <278EEF4F1348D211940600A0C95BCF7FDF59EA@yellow-fin>

There was a discussion last week about run to run performance
variability for codes which are L2 cache resident.  As a couple of 
folks said this is almost certainly due to the lack of page coloring
support in the Linux memory allocator.

We've done some testing with both Greg's and Joe's patches and,
as was indicated on the kernel mailing list, there are side effects
in terms of both allocation time and pool fragmentation.  However,
for machines which are dedicated to codes which are L2 cache
sensitive and which won't be using the memory allocator a lot, 
this could be a perfectly acceptable tradeoff.  I suspect they could
be improved upon though.

Another thing to point out is that page coloring makes the allocator
more deterministic, but not completely deterministic.  Even on 
Tru64 it's possible to get variations between runs, particularly if
they layout of stuff in memory has changed significantly (for example
after a reboot).  Good page coloring support in Linux would make
performance of cache senstive codes both more predictable and faster.

Tom


From alain.coetmeur at icdc.caissedesdepots.fr  Mon Jun 19 00:47:42 2000
From: alain.coetmeur at icdc.caissedesdepots.fr (Coetmeur, Alain)
Date: Mon, 19 Jun 2000 09:47:42 +0200
Subject: Checking out a motherboard
Message-ID: <40C4228EC468D211B04800A0C9DF1D664FB9F3@tsexchange.idt.cdc.fr>

note thatr I've had problems with some
motherboard, and that
"memtest86" and "cpuburn" did help me much
to diagnose the hardware problems...

you should run these test for a few days to check all...

If there is a design misconception or 
some component failure, those burn test should pinpoint it.

-----Message d'origine-----
De: axelsen at axe1.med.upenn.edu [mailto:axelsen at axe1.med.upenn.edu]
Date: vendredi 16 juin 2000 20:01
?: Beowulf at beowulf.org
Objet: Checking out a motherboard


We're planning to make our first hardware purchases next week,
and after a lot of benchmarking with our code, it looks like a
16-node (dual-processor x 8)/myrinet system is the way for us
to go.  We've been quoted an attractive-looking price for systems
that include (among other things):

     Intel Lancewood Motherboard
     PIII-700 x 2
     256 RAm ECC PC100
     Embedded U2 SCSI, fast ethernet, graphics

Does anyone have experience with these or advice regarding pro's,
con's, or better alternatives within this design paradigm?


Thanks,


------------- axe at pharm.med.upenn.edu -----------------
                                                       
Paul H. Axelsen               ....   ....  .   .  .   .
Department of Pharmacology    .   .  .     ..  .  ..  .
University of Pennsylvania    ....   ...   . . .  . . .
3620 Hamilton Walk            .      .     .  ..  .  ..
Philadelphia, PA 19104-6084   .      ....  .   .  .   .

-------------------------------------------------------

_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


From rubeena at sutra.math.iitb.ernet.in  Mon Jun 19 09:08:37 2000
From: rubeena at sutra.math.iitb.ernet.in (Rubina(ASI 2001))
Date: Mon, 19 Jun 2000 21:38:37 +0530 (IST)
Subject: performance.
Message-ID: <Pine.LNX.4.21.0006192136380.30990-100000@sutra.math.iitb.ernet.in>

Hello!
    I am doing my Msc.project on MPI.I have install MPICH and LAM.
For the perforamance of the installed MPI environment,please reply ASAP
which environment has higher performance.Also send me the name of sites on
which these information can be available.Here the performance is high
mean to me that the time taken by the MPI environment in various
functions(synchronous and asynchronous) should be less.

Thanks in advance..
Rubina Memon
Msc(Mathematics).  


-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
      One concrete problem is worth a thousand unapplied abstractions.
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-


From tony at MPI-Softtech.Com  Mon Jun 19 11:45:57 2000
From: tony at MPI-Softtech.Com (Tony Skjellum)
Date: Mon, 19 Jun 2000 13:45:57 -0500 (CDT)
Subject: performance.
In-Reply-To: <Pine.LNX.4.21.0006192136380.30990-100000@sutra.math.iitb.ernet.in>
Message-ID: <Pine.GSO.4.10.10006191345390.9620-100000@mpi.mpi-softtech.com>

You may find our free MPI - MPI/Pro for TCP+SMP for Linux - interesting.

Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Mon, 19 Jun 2000, Rubina(ASI 2001) wrote:

> Hello!
>     I am doing my Msc.project on MPI.I have install MPICH and LAM.
> For the perforamance of the installed MPI environment,please reply ASAP
> which environment has higher performance.Also send me the name of sites on
> which these information can be available.Here the performance is high
> mean to me that the time taken by the MPI environment in various
> functions(synchronous and asynchronous) should be less.
> 
> Thanks in advance..
> Rubina Memon
> Msc(Mathematics).  
> 
> 
> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>       One concrete problem is worth a thousand unapplied abstractions.
> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From wardwe at nswcphdn.navy.mil  Mon Jun 19 12:22:40 2000
From: wardwe at nswcphdn.navy.mil (Ward William E PHDN)
Date: Mon, 19 Jun 2000 15:22:40 -0400
Subject: FW:  [Slightly OT] 6.1 Root Login troubles
Message-ID: <AF67AB108F16D21196F600805F19516D02E95F5F@phdnex01.nswcphdn.navy.mil>

This is slightly OT, because it doesn't concern one of my actual Beowulf
nodes, and instead is one of the workstations I've set aside, but it's close
to the issues that concern Beowulf security, so I thought I'd throw it out
here to be looked at.

One of my workstations recently had to be reinstalled (my fault... I
accidentally hit the power switch during an upgrade from 5.2 to 6.1) and so,
after a complete install, I needed to reset the machine's login
capabilities, specifically, I need to allow root to login and telnet in.
Since the machine is NOT in my cluster, I don't want to allow standard rsh
or ssh, but I do want to allow rlogin (yes, I know, I should be using ssh
and slogin) since it can be seen by anyone on the internal network (it's on
a secure network, but is still much more exposed than in a cluster).  By
using my own knowledge, I was able to modify the /etc/pam.d/rlogin file to
allow root logins... but I ran into a problem.  If I set up pam to be
permissive, it will allow normal users to simply type their name without
requiring a password (during LOGIN, not rlogin... I'm talking someone at the
console, here).  Root needs to have the root password, but can login
normally...  I reverted to the original /etc/pam.d/rlogin file, and modified
it to be less permissive, and voila, mission accomplished.  Normal users can
login as normal, and root can do an rlogin... BUT, there's a catch.  When
root does an rlogin I get the following:

wew at otherhost> su
passwd:
[root at otherhost]# rlogin pigpen
passwd:
passwd:
[root at pigpen]#

In other words, it asks for the password twice (but only for root) before
accepting the password and letting me in.  If I don't properly enter the
password, I cannot login.  While this is an annoyance for a user, it's not
an unlivable situation, except that I also have a cron job that goes to
every one of my machines to do remote backups (Veritas Netbackup) and this
breaks those scripts for pigpen (and since they are commercial, I can't
modify them...)
I finally broke down (when all else fails, read the manual) and checked the
Beowulf howto... and I'm exactly correct as near as I can tell with what
I've done, i.e., I'm "by the book", if I was trying to open up the node but
not putting in the remote hosts in my /etc/hosts.equiv nor putting in
.rhosts files for root, which would imply that I should only require the
user to login.

Oh, and since it's an obvious question, the reason I can do a restore is
that Netbackup can't logon to the machine... a perfect Catch-22.  The
backups are perfect, I just can get them to the machine that needs them.

This all worked properly under 5.2, but doesn't work under 6.1 with the
fresh install... anyone have any ideas?   Note, I haven't upgraded any
packages; this machine doesn't have internet access, but I can get the rpms
onto it if that's the final verdict.

Sorry for straying somewhat off-topic, but thanks in advance.

R/William Ward


From Scott.Delinger at ualberta.ca  Tue Jun 20 08:36:27 2000
From: Scott.Delinger at ualberta.ca (Scott L. Delinger)
Date: Tue, 20 Jun 2000 09:36:27 -0600
Subject: Athlon + PC133: no ECC?
Message-ID: <p0431010cb5753f4f946b@[129.128.2.254]>

I've got a Athlon 700 on the ASUS K7V. I've got PC133 ECC memory, but when
I set the BIOS to ECC for the RAM, the machine refuses to boot. Set it back
to believing it has just PC133 SDRAM, the machine runs fine. The manual
states support for ECC SDRAM, and the BIOS agrees, but I cannot get it to
run in that state.

I'll want to sort this out before buying 64 of them. 8-)
-- 

Scott L. Delinger, Ph.D.
Senior System Administrator
Department of Chemistry, University of Alberta
Edmonton, Alberta, Canada  T6G 2G2
Scott.Delinger at ualberta.ca


From cltkbrust at carolina.rr.com  Tue Jun 20 09:10:21 2000
From: cltkbrust at carolina.rr.com (Kurt Brust)
Date: Tue, 20 Jun 2000 12:10:21 -0400
Subject: quick question
Message-ID: <00062012113501.00504@linux1.peppercornplace.com>

Hello, I am sure you are busy, so i will not take up much of your time.

In regards to clustering, Is it possible to setup a beowulf cluster, to
help process a log file (txt based) over multiple processer's to help
distrube the load? Right now its at 1.5 gigs a day, takes 12 hours to
process, I am looking to cut that down as much as possible.

Thanks for your time!!!


From david.lombard at mscsoftware.com  Tue Jun 20 11:37:56 2000
From: david.lombard at mscsoftware.com (David Lombard)
Date: Tue, 20 Jun 2000 11:37:56 -0700
Subject: quick question
References: <00062012113501.00504@linux1.peppercornplace.com>
Message-ID: <394FBA04.6E6F97CC@mscsoftware.com>

Kurt Brust wrote:
> 
> Hello, I am sure you are busy, so i will not take up much of your time.
> 
> In regards to clustering, Is it possible to setup a beowulf cluster, to
> help process a log file (txt based) over multiple processer's to help
> distrube the load? Right now its at 1.5 gigs a day, takes 12 hours to
> process, I am looking to cut that down as much as possible.
> 

It depends.  That's always standard answer to a question this vague.

It depends upon what you mean by "help process a log file".

What is being logged?

How is the log file processed today?

Be specific.

-- 
David N. Lombard
MSC.Software


From alangrimes at starpower.net  Tue Jun 20 12:40:40 2000
From: alangrimes at starpower.net (Alan Grimes)
Date: Tue, 20 Jun 2000 15:40:40 -0400
Subject: Athlon + PC133: no ECC?
References: <p0431010cb5753f4f946b@[129.128.2.254]>
Message-ID: <394FC8B8.8B615841@starpower.net>

Scott L. Delinger wrote:
> 
> I've got a Athlon 700 on the ASUS K7V. I've got PC133 ECC memory, but >when I set the BIOS to ECC for the RAM, the machine refuses to boot. Set 
>it back to believing it has just PC133 SDRAM, the machine runs fine. The 
>manual states support for ECC SDRAM, and the BIOS agrees, but I cannot 
>get it to run in that state.

Ditto! I got the same board with an 800 mhz CPU and 256 mb of the same ram
with the same problem. 

Here is my theory: The DIMM slots on the board are miss-labeled. 

My fix: Move the DIMM to the slot nearest to the CPU ignoring the labels on
the board. =)

Please tell me if this works for you too.... I'm kinda worried though if I
ever decide to actually use those other three slots wheather they might
simply not work for being too distant from the northbridge. =(((
 
> I'll want to sort this out before buying 64 of them. 8-)

yeah, you should bug ASUS about this...

-- 
Brigands clobber alies.

http://users.erols.com/alangrimes/


From wsb at paralleldata.com  Tue Jun 20 13:03:02 2000
From: wsb at paralleldata.com (W Bauske)
Date: Tue, 20 Jun 2000 15:03:02 -0500
Subject: quick question
References: <00062012113501.00504@linux1.peppercornplace.com> <394FBA04.6E6F97CC@mscsoftware.com>
Message-ID: <394FCDF6.B8D9C68F@paralleldata.com>

David Lombard wrote:
> 
> Kurt Brust wrote:
> >
> > Hello, I am sure you are busy, so i will not take up much of your time.
> >
> > In regards to clustering, Is it possible to setup a beowulf cluster, to
> > help process a log file (txt based) over multiple processer's to help
> > distrube the load? Right now its at 1.5 gigs a day, takes 12 hours to
> > process, I am looking to cut that down as much as possible.
> >
> 
> It depends.  That's always standard answer to a question this vague.
> 
> It depends upon what you mean by "help process a log file".
> 
> What is being logged?
> 
> How is the log file processed today?
> 
> Be specific.

Also, do you control the source code that does the processing?
If not, then the only way to split the work is split the log into
chunks and run the log processing on each chunk. Then you have 
the question of is the data partitionable such that you get the
same analysis when it's split. Should be since you already split
it on daily boundaries. In general, using 100Mbit Enet, you can 
distribute a 1.5GB log to multiple machines in a few minutes
so if you're taking 12 hours now, the transfer cost is a nit.
Might consider tuning your log analysis code before worrying
about parallelizing it. 12 hours seems like a long time to
process a log, even at 1.5GB. Maybe faster HW is a better
solution, cpu/disk/network.

Just some thoughts.


Wes


From kragen at pobox.com  Tue Jun 20 13:37:14 2000
From: kragen at pobox.com (Kragen Sitaker)
Date: Tue, 20 Jun 2000 16:37:14 -0400 (EDT)
Subject: quick question
Message-ID: <Pine.GSO.4.21.0006201625001.3318-100000@kirk.dnaco.net>

W Bauske writes:
> Also, do you control the source code that does the processing?
> If not, then the only way to split the work is split the log into
> chunks and run the log processing on each chunk. Then you have 
> the question of is the data partitionable such that you get the
> same analysis when it's split.

This is not correct.  There are several ways to partition problems in
general, and log-processing problems in particular, and splitting up
the input data is only one of them.

Some examples:  

- if you're running a pipelinable problem --- separable, sequential
  stages, each with a relatively high computation-to-data ratio (say, a
  billion or more instructions for every twelve megabytes, thus a
  thousand instructions for every twelve bytes or so) --- you can build
  a pipeline with different stages on different machines.  In an ideal
  world, you'd be able to migrate pipeline stages between machines to
  load-balance.
- if you want to generate ten reports for ten different web sites whose
  logs are interleaved in the same log file, you can run the log into
  one guy whose job it is to divvy it up, line by line, among ten
  machines doing analysis, one for each web site.
- if you're looking for several different kinds of information in the
  log file --- again, with a high computation-to-data ratio --- you can
  send a copy of the log file to several processes, each extracting one
  of the kinds of information.

Of course, all of this depends on the problem.  My guess is that the
original querent can, as you suggested, rewrite his log-processing
script in C instead of Perl and get the performance boost he needs, and
it will be easier than parallelizing by anything but the simplistic
split-the-log-into-chunks approach.

[I'm just guessing that the log-processing code is currently in Perl. :) ]
-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)


From rgb at phy.duke.edu  Tue Jun 20 14:03:39 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 20 Jun 2000 17:03:39 -0400 (EDT)
Subject: quick question
In-Reply-To: <Pine.GSO.4.21.0006201625001.3318-100000@kirk.dnaco.net>
Message-ID: <Pine.LNX.4.10.10006201655140.8969-100000@ganesh.phy.duke.edu>

On Tue, 20 Jun 2000, Kragen Sitaker wrote:

> This is not correct.  There are several ways to partition problems in
> general, and log-processing problems in particular, and splitting up
> the input data is only one of them.
> 
> Some examples:  
> 
> - if you're running a pipelinable problem --- separable, sequential
>   stages, each with a relatively high computation-to-data ratio (say, a
>   billion or more instructions for every twelve megabytes, thus a
>   thousand instructions for every twelve bytes or so) --- you can build
>   a pipeline with different stages on different machines.  In an ideal
>   world, you'd be able to migrate pipeline stages between machines to
>   load-balance.
> - if you want to generate ten reports for ten different web sites whose
>   logs are interleaved in the same log file, you can run the log into
>   one guy whose job it is to divvy it up, line by line, among ten
>   machines doing analysis, one for each web site.
> - if you're looking for several different kinds of information in the
>   log file --- again, with a high computation-to-data ratio --- you can
>   send a copy of the log file to several processes, each extracting one
>   of the kinds of information.
> 

All good points.  Another good point is that if the reports are the
result of syslogd output, a sensible /etc/syslog.conf can often achieve
a lot of partitioning for you.  If the reports are the result of a
centralized syslog loghost that receives all the syslog output of (say)
100+ hosts, you might look into "syslog-ng", which basically filters
input as it comes into the loghost and squirrels it away in a nice set
of host/loglevel-specific files according to your specification.

Either of these will result in significantly smaller files to process
and a lot of the processing will already be done.

> Of course, all of this depends on the problem.  My guess is that the
> original querent can, as you suggested, rewrite his log-processing
> script in C instead of Perl and get the performance boost he needs, and
> it will be easier than parallelizing by anything but the simplistic
> split-the-log-into-chunks approach.
> 
> [I'm just guessing that the log-processing code is currently in Perl. :) ]

Agreed and agreed.

    rgb

> -- 
> <kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
> The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
> <URL:http://www.pobox.com/~kragen/bubble.html>
> The power didn't go out on 2000-01-01 either.  :)
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From wsb at paralleldata.com  Tue Jun 20 17:12:13 2000
From: wsb at paralleldata.com (W Bauske)
Date: Tue, 20 Jun 2000 19:12:13 -0500
Subject: quick question
References: <Pine.GSO.4.21.0006201625001.3318-100000@kirk.dnaco.net>
Message-ID: <3950085D.11E69968@paralleldata.com>

Kragen Sitaker wrote:
> 
> W Bauske writes:
> > Also, do you control the source code that does the processing?
> > If not, then the only way to split the work is split the log into
> > chunks and run the log processing on each chunk. Then you have
> > the question of is the data partitionable such that you get the
> > same analysis when it's split.
> 
> This is not correct. 

We'll see, see comments below.

> There are several ways to partition problems in
> general, and log-processing problems in particular, and splitting up
> the input data is only one of them.
> 
> Some examples:
> 
> - if you're running a pipelinable problem --- separable, sequential
>   stages, each with a relatively high computation-to-data ratio (say, a
>   billion or more instructions for every twelve megabytes, thus a
>   thousand instructions for every twelve bytes or so) --- you can build
>   a pipeline with different stages on different machines.  In an ideal
>   world, you'd be able to migrate pipeline stages between machines to
>   load-balance.

Pipelining is good if the processing stages are dependent. 
The original request is too vague to say whether it would work 
though. One could always call the "chunk" the whole file and 
give it to separate programs on separate machines, depending 
on whether the processing is dependent or not on previous 
steps, similar to your last example below.

> - if you want to generate ten reports for ten different web sites whose
>   logs are interleaved in the same log file, you can run the log into
>   one guy whose job it is to divvy it up, line by line, among ten
>   machines doing analysis, one for each web site.

This is just chunking it in a special way. I didn't specify
HOW to chunk it. You assumed I meant a simple chunking.
One can always specify many ways to split the data up, depending
on specific processing requirements.

> - if you're looking for several different kinds of information in the
>   log file --- again, with a high computation-to-data ratio --- you can
>   send a copy of the log file to several processes, each extracting one
>   of the kinds of information.

Same problem as above. It's just another form of chunking.
I was vague about what I meant by chunking on purpose figuring
there would be more questions.

> 
> Of course, all of this depends on the problem.  My guess is that the
> original querent can, as you suggested, rewrite his log-processing
> script in C instead of Perl and get the performance boost he needs, and
> it will be easier than parallelizing by anything but the simplistic
> split-the-log-into-chunks approach.
> 

You assumed the split method. I didn't specify an implementation.
Most likely the log is already partitioned in a simple time dependent 
manner so it can be processed offline. I doubt it's done in real
time. So, if one can tolerate time splitting already, then it is 
likely one can partition into 12/6/4/3/2/1/etc. hour chunks and combine
those results to get a picture of what happened for the whole log
time frame.

We agree on tuning. Spending 12 hours running perl is not
such a good plan. Again, though, I was not specific on purpose.
Just tune it, whatever that means for the specific problem.
If one doesn't know how to "tune it", they should describe
the problem and ask for advice.
 

Wes


From pratte at lincweb.com  Tue Jun 20 17:45:26 2000
From: pratte at lincweb.com (Robert Pratte)
Date: Tue, 20 Jun 2000 19:45:26 -0500
Subject: quick question
References: <00062012113501.00504@linux1.peppercornplace.com>
Message-ID: <39501026.CB884931@lincweb.com>

You have probably gone through this list already, but sometimes it is
helpful
to check off the basics at least.

1) examine the hardware you are using.  I wouldn't be surprised if the
biggest
bottleneck you are facing is disk access.  What type of file system are you
using,
OS, etc?  I would guess that you are dealing with disk intensive processes
(unless
you are stuffing 1.5 gig of data into your free memory...:)...), so
increasing processor
throughput via threading the application/running it parallel/etc. may not
gain much.
I have seen HUGE differences with processes like this, though, by upgrading

drive arrays.  If you aren't using them, look at EMCs, or similar products.

2) how is the data set up.  Are you processing one giant log for a small
group
of processes, or is this some concatenated/conglomerated log(s) that can
easily be divided.  In the case of the latter, distributing the logs (not
necessarily
the process) may be a quick answer.

3) examine the script running the process.  You probably have some regular
expression
matching going on if this is a shell script/perl/python/etc.  Read the
O'Reilly Regular Expression
book, if you haven't already....quite elucidating.  If you are using a
binary, check the
source code (if available), there are lots of performance tweaks for
C/C++/etc that
may be useful.  Possibly recompiling using different flags may be useful.

4) examine processes running on the box.  unnecessary daemons, etc. just
drag
down performance....and create security hazards.  Is the kernel optimized?

5) see #1.....I am really suspicious that disk may be chewing up a lot of
your time.

Kurt Brust wrote:

> Hello, I am sure you are busy, so i will not take up much of your time.
>
> In regards to clustering, Is it possible to setup a beowulf cluster, to
> help process a log file (txt based) over multiple processer's to help
> distrube the load? Right now its at 1.5 gigs a day, takes 12 hours to
> process, I am looking to cut that down as much as possible.
>
> Thanks for your time!!!
>
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf


From kragen at pobox.com  Tue Jun 20 19:56:34 2000
From: kragen at pobox.com (Kragen Sitaker)
Date: Tue, 20 Jun 2000 22:56:34 -0400 (EDT)
Subject: quick question
Message-ID: <Pine.GSO.4.21.0006202201510.9642-100000@kirk.dnaco.net>

Bradley Alexander writeth:
> Lets take this a step further. I was, a long time ago, looking at an
> application to parallelize the SHADOW IDS' analysis station. Rather than
> simply running a separate analyzer on each node, I thought that parallelizing
> the process would actually be more efficient. I thought that it could handle a
> number of separate sensors etc. (I should note that SHADOW is a client/server
> setup in the form of sensors that use tcpdump to capture traffic, and an
> analysis station that analyzes these tcpdump files.)
> 
> Unfortunately other duties have kept me from pursuing this as yet, but one of
> the problems I found that I had was getting the logs (9+GB/hour) back to the
> cluster at anything resembling reasonable time, especially since it would have
> to be an out-of-band transfer to keep from choking the network its supposed to
> be watching. ("Your IDS just caused a DoS, so GET OUT." :-)

There may be other ways to do this.

Suppose we have a set of categories C1, C2, C3, etc., and a set of
sensors S1, S2, S3, etc., producing a set of events E1, E2, E3, etc.
Each event En is produced by one sensor Sn and belongs to some set Cn,
Cm, Cp of categories.

Now, suppose you have a set of machines M1, M2, M3, etc., each of which
is devoted to analyzing one category of events: M1 analyzes events in
C1, M2 analyzes events in C2; in general Mn analyzes events in Cn.
Then, when a sensor produces an event, instead of sending it to a
central choke point, it determines which categories (Cn, Cm, Cp) it
belongs to, and sends it to the appropriate machines Mn, Mm, Mp.

This way, no machine receives more traffic than belongs in a single
category; you might have 20 megabits of aggregate event traffic ---
9GB/hour --- but each analysis machine will only have a fraction of
that level of traffic flowing into it.  If your network aggregate
bandwidth is much bigger than 20 megabits --- say, you have a 36-port
100BaseT switch with a 7.2 gigabit backplane bandwidth --- you have
SOLVED THIS PROBLEM.

As a further refinement, M1, M2, M3, etc., can be the same machines
that run the sensors; this prevents you from having to buy and admin a
separate analysis cluster.  You can actually have the Ms be "virtual
machines" that move dynamically from one sensor machine to another for
load-balancing.

> > [I'm just guessing that the log-processing code is currently in Perl. :) ]
> 
> Since most of SHADOW is written in Perl, isn't there a parallelized Perl module?

Not that I know of.  Profile SHADOW and see if you can double its speed
by rewriting 4% of it in C, making it 10-200 times faster :)

-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)


From walt at parl.ces.clemson.edu  Wed Jun 21 06:45:52 2000
From: walt at parl.ces.clemson.edu (Walter B. Ligon III)
Date: Wed, 21 Jun 2000 09:45:52 -0400
Subject: quick question 
In-Reply-To: Your message of "Tue, 20 Jun 2000 19:12:13 CDT."
             <3950085D.11E69968@paralleldata.com> 
Message-ID: <200006191344.JAA17965@krang.parl.clemson.edu>

--------
> Kragen Sitaker wrote:
> > 
> > W Bauske writes:
> > > Also, do you control the source code that does the processing?
> > > If not, then the only way to split the work is split the log into
> > > chunks and run the log processing on each chunk. Then you have
> > > the question of is the data partitionable such that you get the
> > > same analysis when it's split.
> > 
> > This is not correct. 
> 
> We'll see, see comments below.
> 
> > There are several ways to partition problems in
> > general, and log-processing problems in particular, and splitting up
> > the input data is only one of them.
> > 
> > Some examples:
> > 
> > - if you're running a pipelinable problem --- separable, sequential
> >   stages, each with a relatively high computation-to-data ratio (say, a
> >   billion or more instructions for every twelve megabytes, thus a
> >   thousand instructions for every twelve bytes or so) --- you can build
> >   a pipeline with different stages on different machines.  In an ideal
> >   world, you'd be able to migrate pipeline stages between machines to
> >   load-balance.
> 
> Pipelining is good if the processing stages are dependent. 
> The original request is too vague to say whether it would work 
> though. One could always call the "chunk" the whole file and 
> give it to separate programs on separate machines, depending 
> on whether the processing is dependent or not on previous 
> steps, similar to your last example below.

I'll agree that the other issues raised essentially reduce to forms of
data parallelism that your original post claimed was the "only" way
to split the work.  But not this one.  This amounts to what used to (a
long time ago) be call MISD processing - the same data is processed by
multiple programs and it is NOT a form a data parallelism - it is a form
of control parallelism.  Your argument that 'One could always call the
"chunk" the whole file' is really weak.  You implied data parallelism 
was the ONLY option and it ISN'T.  Let's just admit that.

On the other hand, data parallelism is generally the better approach,
and the rest of your comments were valuable.  No point in beating this
dead horse futher, but the lesson should be learned: there are always
other ways and it is sometimes worthing considering them if only to
convince yourself the given solution is best.

Walt

-- 
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University


From wsb at paralleldata.com  Wed Jun 21 14:26:03 2000
From: wsb at paralleldata.com (W Bauske)
Date: Wed, 21 Jun 2000 16:26:03 -0500
Subject: quick question
References: <200006191344.JAA17965@krang.parl.clemson.edu>
Message-ID: <395132EB.2EA9E658@paralleldata.com>

"Walter B. Ligon III" wrote:
> 
> --------

Amazing how one word creates so much discussion.
Should have said "probably the best way".
No more wasted bandwidth from me on this thread.


Wes


From iorfr00 at student.vxu.se  Thu Jun 22 00:20:33 2000
From: iorfr00 at student.vxu.se (Nacho Ruiz)
Date: Thu, 22 Jun 2000 09:20:33 +0200
Subject: Beowulf: A theorical approach
Message-ID: <000b01bfdc1a$5f6e7c80$396d2fc2@lyan.vxu.se>

Hi,

I'm doing my final year project and I'm writting about the Beowulf project
an the Beowulf clusters.
I've been reading several documents about the beowulf clusters, but I would
like to ask all of you some questions about them.

As I've seen the main objective behind any Beowulf cluster is the
price/performance tag, specially when compared to supercomputers. But as
network hardware and commodity systems are becoming faster and faster
(getting closer to GHz and Gigabit speeds), could you think on competting
directly with supercomputers?

As I see it the Beowulf cluster idea could be based in the distributed
computign and the parallel computing: you put more CPUs to get more speedup,
but as you can't have all the CPUs in the same machine you use several. So
the Beowulf cluster could fit in between the distributed computing and the
supercomputers (vetorial computers, parallel computers,..etc). You have
advantages from both sides: parallel programming and high scalability; but
you also have several drawbacks: mainly interconection problems. Do you
think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
and  massive primary and secondary memory devices at low cost, you could
have a chance to beat most of the traditional supercomputers? or is not your
"goal"?

And about the evolution of the Beowulf clusters, do you all follow a kind of
guideness or the project have divided in several flavors and objectives?
Are the objectives of the beggining the same as today or now you plan to
have something like a "super SMP computer" in a distributed way (with good
communications times). I've seen that a lot of you are focusing in the GPID
and whole machine idea, do you think that is reachable? What are the main
objectives vs the MPI/PVM message passing idea?
And what about shared memory (in the HD level or the RAM level), do you take
advantage of having this amount  of resouces?

Is this idea trying to reach the objective of making parallel programs
"independent" to the programmer? I mean, that instead of having to program
having in mind that you are using a parallel machine you can program in a
"normal" way and the compiler will divide/distribute the code over the
cluster. Is this reachable or just a dream? Is somebody working on this?

And what about the administration of a cluster. Having all the machine of
the cluster under control, so you can know which are avaliable to send some
work, is an hazarous task but necessary. Is not as easy as in a SMP machine
where you know or assume that all the CPUs inside are working, in a cluster
you can't do that as the CPU might work but the HD, NIC or memory may fail.
How much computational time do you spend in this task? There's somebody
working in a better way to manage with this?

I know that sometime ago HP had a machine woth several faulty processors
working and achiving high computational speeds without any error. They used
some kind of  "control algorithm" that manages to use only the good CPUs. Do
you have something like this or there is no point? Does it make sense?

That's all for now, thanks to all of you.
If you know of some sources where I can get more information, please let me
know.

Nacho Ruiz.


From walt at parl.ces.clemson.edu  Thu Jun 22 06:43:53 2000
From: walt at parl.ces.clemson.edu (Walter B. Ligon III)
Date: Thu, 22 Jun 2000 09:43:53 -0400
Subject: Beowulf: A theorical approach 
In-Reply-To: Your message of "Thu, 22 Jun 2000 09:20:33 +0200."
             <000b01bfdc1a$5f6e7c80$396d2fc2@lyan.vxu.se> 
Message-ID: <200006201342.JAA22575@krang.parl.clemson.edu>

--------
> Hi,
> 
> I'm doing my final year project and I'm writting about the Beowulf project
> an the Beowulf clusters.
> I've been reading several documents about the beowulf clusters, but I would
> like to ask all of you some questions about them.
> 
> As I've seen the main objective behind any Beowulf cluster is the
> price/performance tag, specially when compared to supercomputers. But as
> network hardware and commodity systems are becoming faster and faster
> (getting closer to GHz and Gigabit speeds), could you think on competting
> directly with supercomputers?

If it becomes possible to compete with a "supercomputer" in all applications
using COTS HW that would be great!  I don't see that in the near future, but
we may get there eventually.  Depends on how much Beowulf cuts into the
development of "supercomputers."
 
> As I see it the Beowulf cluster idea could be based in the distributed
> computign and the parallel computing: you put more CPUs to get more speedup,
> but as you can't have all the CPUs in the same machine you use several. So
> the Beowulf cluster could fit in between the distributed computing and the
> supercomputers (vetorial computers, parallel computers,..etc). You have
> advantages from both sides: parallel programming and high scalability; but
> you also have several drawbacks: mainly interconection problems. Do you
> think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
> and  massive primary and secondary memory devices at low cost, you could
> have a chance to beat most of the traditional supercomputers? or is not your
> "goal"?

The "goal" is to do the best we can with COTS HW.  The problem right now really
isn't in link speeds (though better link speeds are good), its in how close/far
the network interface is from the CPU.  COTS HW doesn't place a high value
on direct access to IO devices - there is a higher value on a standardized
bus interface to allow different system components to be integrated and updated
independently.  A "supercomputer" can have the network engineered directly
into the node architecture.  This is a huge advantage.  Luckily, this advantage
has the most effect in only some programs.  Beowulf attempts to exploit
those programs where that advantage isn't as much of an issue.
 
> And about the evolution of the Beowulf clusters, do you all follow a kind of
> guideness or the project have divided in several flavors and objectives?
> Are the objectives of the beggining the same as today or now you plan to
> have something like a "super SMP computer" in a distributed way (with good
> communications times). I've seen that a lot of you are focusing in the GPID
> and whole machine idea, do you think that is reachable? What are the main
> objectives vs the MPI/PVM message passing idea?
> And what about shared memory (in the HD level or the RAM level), do you take
> advantage of having this amount  of resouces?

No, there isn't a single approach.  Beowulf is all about "roll your own"
technology.  There are those who would like to see some kind of 
standardization,
but noone can quite agree on what that standard should look like.  This
indicates to *me* that we aren't ready for a standard yet.

GPID is really close.  We are running that every day around here and I feel
like it makes a *huge* impact on the usability and programability of the
system.  I still don't know if the way we are doing it will emerger as THE
way to do it, but that's what this community is all about - do it and then
let the rest of the world decide what it thinks.
 
> Is this idea trying to reach the objective of making parallel programs
> "independent" to the programmer? I mean, that instead of having to program
> having in mind that you are using a parallel machine you can program in a
> "normal" way and the compiler will divide/distribute the code over the
> cluster. Is this reachable or just a dream? Is somebody working on this?

No, that's really a seperate issue.  There are (and have been) many, many
people working on this.  My *personal* opinion is we will never have a
perfect - or even really good - parallelizing compiler (no flames guys, this
is MY opinion).  I think in the end we will evolve until programmers are
capable of programming in parallel.
 
> And what about the administration of a cluster. Having all the machine of
> the cluster under control, so you can know which are avaliable to send some
> work, is an hazarous task but necessary. Is not as easy as in a SMP machine
> where you know or assume that all the CPUs inside are working, in a cluster
> you can't do that as the CPU might work but the HD, NIC or memory may fail.
> How much computational time do you spend in this task? There's somebody
> working in a better way to manage with this?

Several people working on this.  There are some good approaches out there.
I think we'll see some achievements in this in the next couple of years and
this will make a big difference in getting Beowulf off the ground as a "real"
computing technology used in a wide range of applications.
 
> I know that sometime ago HP had a machine woth several faulty processors
> working and achiving high computational speeds without any error. They used
> some kind of  "control algorithm" that manages to use only the good CPUs. Do
> you have something like this or there is no point? Does it make sense?

Good idea - again, this is an area people are (and should be) working in.


So, anyway, this is ONE answer to your questions.  I'm sure you will get
several others.

Walt

-- 
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University


From dsg at super.org  Thu Jun 22 08:45:53 2000
From: dsg at super.org (David S. Greenberg)
Date: Thu, 22 Jun 2000 15:45:53 +0000
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical 
 approach]
References: <000b01bfdc1a$5f6e7c80$396d2fc2@lyan.vxu.se>
Message-ID: <395234B1.5B45622A@super.org>

I'm going to use Mr. Ruiz' question as a springboard for a little pontificating
and some conference advertising so caveat lire.
I paraphrase Mr. Ruiz' question: Can commodity clusters compete with
supercomputers on a performance basis and not just on a price-performance
basis?  Some of us who believe that the answer is yes have been promoting the
idea through the Extreme Linux mailing list, website (www.extremelinux.org),
and workshops.  The challenge is to answer many of Mr. Ruiz' questions and more
and when necessary to work together to fill in missing pieces.

Now a brief digression for a conference advertisement.  The Extreme Linux Track
will be part of the Atlanta Linux Showcase and Conference,
http://www.linuxshowcase.org/, from October 12-14, 2000.   The refereed papers
portion of the track has been determined but we are leaving two 90 minute
sessions open for participation.  The first of these sessions will be devoted
to working cluster updates.  Everyone is encourage to send us one-page
formatted descriptions of their cluster (typically a black-and-white picture
with text telling how big it has grown and what cool things it has done over
the last year - sort of a Christmas/New Years card from your cluster).  We will
publish the one-pagers in the proceedings and give as many folks as possible
(in order of submission) a few minutes to present to the workshop.  Similarly
there will be a session for one-page descriptions of applications.  Here we'd
like to hear about how well your application runs on clusters, about what you'd
like most to have added to clusters, and  about comparisons with runs on
classic supercomputers.  Send your one-pagers to me, dsg at super.org, with the
subject, EL2000 one-pager.  Remember, first come first serve.

Back to the question at hand, how to make a supercomputer from commodity
parts.  Many of us have determined that not only should it be possible to
"build your own" supercomputer but it is likely to be the only way to do so
since "supercomputer" companies are quickly disappearing.  There are several
approaches:
(1) Design it yourself and build it yourself.  The example I'm most familiar
with is the CPlant project at Sandia (www.cs.sandia.gov/~cplant).  Based on the
success of the ASCI Red 9000+ processor Intel machine the Sandia team set out
to duplicate/surpass its performance, usability, and extensibility with
high-end but "commodity" parts.  They chose Alpha processors and Myrinet
interconnect.  They have been a regular in the top third of the top 500 for
several years and continue to grow bigger each year.
2) Customize a stock design and get someone else to build it for you.  There
are several small to mid-size companies which specialize in this.  I've been
meaning to update my list (perhaps some readers will help).  The list includes
at least Altatech, Atlantec, Aspen, DCG,HPTi, Paralogic, TurboLinux, VALinux.
3) Convince a vendor to make a product out of the idea.  The two biggest
examples of this are the Compaq SC series which clusters 4-way Alpha boxes
using the Quadrics interconnect and the IBM move toward Linux clusters, see in
particular the Roadrunner cluster at UNM, www.alliance.unm.edu and the Chiba
City cluster at Argonne, http://www-unix.mcs.anl.gov/chiba/.

A big advantage of clusters is that it is possible to customize to your exact
needs.  Of course, as is often mentioned on these lists, you must first
understand your needs which can take some time.  The range of choices can
sometimes seem overwhelming but the nice thing is that there are many solutions
which will be in some sense 90% optimal.  The real trick is to pick something
reasonable and get it up and running your applications while it is still "hot"
hardware.  You can modify and upgrade later as you learn more about your needs.

One note of interest is that the cost of a supercomputer use to be mostly in
the processors.  Then the cost moved to the memory.  We are currently seeing a
move to putting money in the interconnect (both the memory to processor bus and
the internode network).  Each such change in focus is difficult for buyers to
make since it seems like "too much is being spent on specialized hardware".
My advise is to go with the trend.
Two major software issues for really large machines are system
administration/fault tolerance and parallel IO.  Don't miss the panels and
papers on these topics at the Extreme Linux Track.

David


Nacho Ruiz wrote:

> Hi,
>
> I'm doing my final year project and I'm writting about the Beowulf project
> an the Beowulf clusters.
> I've been reading several documents about the beowulf clusters, but I would
> like to ask all of you some questions about them.
>
> As I've seen the main objective behind any Beowulf cluster is the
> price/performance tag, specially when compared to supercomputers. But as
> network hardware and commodity systems are becoming faster and faster
> (getting closer to GHz and Gigabit speeds), could you think on competting
> directly with supercomputers?
>
> As I see it the Beowulf cluster idea could be based in the distributed
> computign and the parallel computing: you put more CPUs to get more speedup,
> but as you can't have all the CPUs in the same machine you use several. So
> the Beowulf cluster could fit in between the distributed computing and the
> supercomputers (vetorial computers, parallel computers,..etc). You have
> advantages from both sides: parallel programming and high scalability; but
> you also have several drawbacks: mainly interconection problems. Do you
> think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
> and  massive primary and secondary memory devices at low cost, you could
> have a chance to beat most of the traditional supercomputers? or is not your
> "goal"?
>
> And about the evolution of the Beowulf clusters, do you all follow a kind of
> guideness or the project have divided in several flavors and objectives?
> Are the objectives of the beggining the same as today or now you plan to
> have something like a "super SMP computer" in a distributed way (with good
> communications times). I've seen that a lot of you are focusing in the GPID
> and whole machine idea, do you think that is reachable? What are the main
> objectives vs the MPI/PVM message passing idea?
> And what about shared memory (in the HD level or the RAM level), do you take
> advantage of having this amount  of resouces?
>
> Is this idea trying to reach the objective of making parallel programs
> "independent" to the programmer? I mean, that instead of having to program
> having in mind that you are using a parallel machine you can program in a
> "normal" way and the compiler will divide/distribute the code over the
> cluster. Is this reachable or just a dream? Is somebody working on this?
>
> And what about the administration of a cluster. Having all the machine of
> the cluster under control, so you can know which are avaliable to send some
> work, is an hazarous task but necessary. Is not as easy as in a SMP machine
> where you know or assume that all the CPUs inside are working, in a cluster
> you can't do that as the CPU might work but the HD, NIC or memory may fail.
> How much computational time do you spend in this task? There's somebody
> working in a better way to manage with this?
>
> I know that sometime ago HP had a machine woth several faulty processors
> working and achiving high computational speeds without any error. They used
> some kind of  "control algorithm" that manages to use only the good CPUs. Do
> you have something like this or there is no point? Does it make sense?
>
> That's all for now, thanks to all of you.
> If you know of some sources where I can get more information, please let me
> know.
>
> Nacho Ruiz.
>
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf


From jakob at ostenfeld.dtu.dk  Thu Jun 22 04:59:08 2000
From: jakob at ostenfeld.dtu.dk (=?iso-8859-1?Q?Jakob_=D8stergaard?=)
Date: Thu, 22 Jun 2000 13:59:08 +0200
Subject: Beowulf: A theorical approach
In-Reply-To: <000b01bfdc1a$5f6e7c80$396d2fc2@lyan.vxu.se>
References: <000b01bfdc1a$5f6e7c80$396d2fc2@lyan.vxu.se>
Message-ID: <20000622135908.C762@ostenfeld.dtu.dk>

On Thu, 22 Jun 2000, Nacho Ruiz wrote:

> Hi,
> 
> I'm doing my final year project and I'm writting about the Beowulf project
> an the Beowulf clusters.
> I've been reading several documents about the beowulf clusters, but I would
> like to ask all of you some questions about them.
> 
> As I've seen the main objective behind any Beowulf cluster is the
> price/performance tag, specially when compared to supercomputers. But as
> network hardware and commodity systems are becoming faster and faster
> (getting closer to GHz and Gigabit speeds), could you think on competting
> directly with supercomputers?

Competing on what terms ?

SMP supercomputers are very convenient to work with, because then have both a
lot of CPU power, a lot of memory, and you can use either or both as you see
fit.  In other words, a lazy programmer with poor tools can make almost
anything run well on a SMP supercomputer.

Clusters are a different story. They have a lot of CPU power as well, and a lot
of memory, but one CPU can't easily access all the memory. Several CPUs can't
easily share the same memory.    Clusters are a lot less convenient than one
huge SMP machine, they require more effort on the side of the programmer for
general problem solving, if they are to run as well as the supercomputer.
There are special problems though, which are extrememly well suited for
clusters, and I believe that a large number of problems _could_ be solved well
on clusters.  Since we have the Beowulf list, I guess I'm not the only one
thinking that  :)

``very'' parallel problems (extreme example:  seti at home or distributed.net)
will run as well on a cluster than they will on a traditional SMP
supercomputer.  This kind of problems is well suited for clusters, the
sub-problems solved by each CPU are completely isolated so the CPUs in the
cluster (or in the supercomputer) need not communicate.   This type of problem
is rare though.  I will consider ``general'' problems in the following,
problems that can parallelize, but where sub-problems are not independent.


Large SMP supercomputers are (should I say ``usually'' ?) NUMA architectures.
You can think of them as a cluster with incredibly high network bandwidth,
very low latency, and where the hardware (with some help from the operating
system) emulates shared memory between CPUs.   If you want a cluster to work
the same way, you will not only need some software to help you, you will also
need a very high bandwidth network, and even then you will see that the network
latency is becoming a problem.  MOSIX (www.mosix.org) is a piece of software
that tries to emulate a large SMP machine on distributed systems.  They still
lack functionality as far as I know (especially wrt. memory shared between
threads), but the software can already now give you an idea about how hard it
is to compete with SMP supercomputers on _their_ terms.  If you want your
cluster to give any program the impression that it's in fact running on a giant
SMP machine, well, you're in trouble.   It can be done, but it can't be done
well.   Not because the software isn't good (MOSIX _is_ good) but because you
just don't set up switched networks with multiple GByte/s bandwidth and very
low latency.   Gigabit networks are _nothing_ compared to the crossbar switch
in your average Origin system.

Clusters can't fake a large SMP system _generally_.   I mean, you can do
it yes, but you cannot get good speed - generally.  If the operating system
and the hardware are the parts that work together to give a program the
impression that it can run on 16 CPUs sharing the same meory, you will
need the speed of the backbone in the supercomputers.

Having the abstraction layer at the hardware or operating system level is
sub-optimal.   The application knows (or could know) when it is going
to move data, when it can use spare CPUs, etc. etc.   The hardware and the
operating system can never know.   The only reason why this works so well
in supercomputers, is because they have an _incredibly_ fast ``network''
between their CPUs, so the cost of moving information from CPU to CPU in this
suboptimal manner, is acceptable.

> 
> As I see it the Beowulf cluster idea could be based in the distributed
> computign and the parallel computing: you put more CPUs to get more speedup,
> but as you can't have all the CPUs in the same machine you use several. So
> the Beowulf cluster could fit in between the distributed computing and the
> supercomputers (vetorial computers, parallel computers,..etc). You have
> advantages from both sides: parallel programming and high scalability; but
> you also have several drawbacks: mainly interconection problems. Do you
> think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
> and  massive primary and secondary memory devices at low cost, you could
> have a chance to beat most of the traditional supercomputers? or is not your
> "goal"?

To me that's not a goal.  It a game we've already lost if we start playing it.

We could set up 10Gbit/s networks, but your average supercomputer (SMP) has
a 10GByte/s ``network'' _today_.  Wait another five years, set up your 100Gbit/s
network, and guess what your SMP competitor has.   We're not even talking about
latency here...  I don't have latency numbers handy for a typical NUMA machine
interconnect, but I think it's safe to assume that it's pretty damn good and that
we're not going to get there ever with a traditional network.  If for nothing
else then because our wires are longer.

One way to work around some of the shortcomings of clusters in general problem
solvning is to make cluster more SMP.  Eg. use two or four CPUs in each box,
then connect a number of those.   Sure, it's one way to go, if you want to
improve things, but you will still not be taking a lead.

I don't mean to sound negative about clusters, really.  I just want to make
it clear that I don't think it's wise to spend money and effort trying to
beat the supercomputers on their terms - eg. having fairly stupid parallel
applications believing they're on share memory.   I think the way to go is
to build smarter applications that _know_ that they aren't on shared memory.
This is what people do with MPI and PVM.  And I think that could be taken
much further.

> 
> And about the evolution of the Beowulf clusters, do you all follow a kind of
> guideness or the project have divided in several flavors and objectives?
> Are the objectives of the beggining the same as today or now you plan to
> have something like a "super SMP computer" in a distributed way (with good
> communications times). I've seen that a lot of you are focusing in the GPID
> and whole machine idea, do you think that is reachable? What are the main
> objectives vs the MPI/PVM message passing idea?
> And what about shared memory (in the HD level or the RAM level), do you take
> advantage of having this amount  of resouces?

Again, I think GPID and fake shared memory are interesting ideas, and we might
even end up being able to compete with supercomputers in terms of
price/performance, for _some_ problems.

But trying to build a shared memory system from memory that just isn't shared
is hard enough for the SMP Supercomputer people (ccNUMA architectures are
very complicated hardware and software systems).   We can't do that better than
they do (and I'd love to eat those words, but I don't think it will happen).

> 
> Is this idea trying to reach the objective of making parallel programs
> "independent" to the programmer? I mean, that instead of having to program
> having in mind that you are using a parallel machine you can program in a
> "normal" way and the compiler will divide/distribute the code over the
> cluster. Is this reachable or just a dream? Is somebody working on this?

Lots of people are working on new ways to use clusters.  An hour of surfing
from the beowulf site, or any national laboratory should give you plenty of
pointers     :)

And I'm working on something.  It will take time before I have results, but my
basic idea is:
*)   The programmer writes a serial program (in a language designed for this
     purpose) There must be no way in the language to represent parallelism.
     The programmer should not be concerned with it.  Besides, he doesn't know
     whether there one or a hundred idle nodes in the cluster when he runs his
     program.
*)   The program is submitted to a parallel virtual machine running on all
     nodes of the cluster.   This virtual machine can automatically parallelize
     the program, it can be fault tolerant, it can predict/guess where/when
     data might be needed on what nodes and move data before it's needed.
*)   The hardware and the operating system should do nothing but providing
     base services (network and disk I/O) on each node in the cluster. The OS
     doesn't know about the flow of the program being run, so it should
     basically just stay out of the way doing what it does well already.

The goal is to make the network bandwidth/latency less important, by moving
data to remote nodes before the data are being waited for there.  Also,
parallelizing with regard to the current state of the cluster should improve
the usage of the cluster, making sure that all nodes are busy at all times.  As
each sub-task consists of a well defined set of input data, the system can
simply re-send a task to some other node, in case of node failure.

This is fiction today.  Hang on for another year and I might have something  :)

(there's a report available at http://ostenfeld.dtu.dk/~jakob/TONS-1/
 describing some of the work - but a lot changed since then.  You can also get
 the software from http://sslug.dk/TONS/, but really, unless you like parallel
 Fibonacci number calculations there's little you can do with this software
 today)

> 
> And what about the administration of a cluster. Having all the machine of
> the cluster under control, so you can know which are avaliable to send some
> work, is an hazarous task but necessary. Is not as easy as in a SMP machine
> where you know or assume that all the CPUs inside are working, in a cluster
> you can't do that as the CPU might work but the HD, NIC or memory may fail.
> How much computational time do you spend in this task? There's somebody
> working in a better way to manage with this?

There are various queue systems available that can take some of this into
account.   This is not really my area, so I'll leave that for someone else  :)

> 
> I know that sometime ago HP had a machine woth several faulty processors
> working and achiving high computational speeds without any error. They used
> some kind of  "control algorithm" that manages to use only the good CPUs. Do
> you have something like this or there is no point? Does it make sense?

Fault tolerance is important in clusters.  If you have 100 nodes (or like
Google, 4000) then some of them is going to have problems.  A previous
discussion here at the Beowulf list covered that subject pretty well I think.
Especially a good point was made that if the cost of fault tolerance is high,
you might as well achieve ``fault-tolerance'' by just restarting the program
that failed.

-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:


From david.lombard at mscsoftware.com  Thu Jun 22 08:15:50 2000
From: david.lombard at mscsoftware.com (David Lombard)
Date: Thu, 22 Jun 2000 08:15:50 -0700
Subject: Beowulf: A theorical approach
References: <200006201342.JAA22575@krang.parl.clemson.edu>
Message-ID: <39522DA6.7718CD6C@mscsoftware.com>

"Walter B. Ligon III" wrote:
> 
> --------
> > Hi,
> >
> > I'm doing my final year project and I'm writting about the Beowulf project
> > an the Beowulf clusters.
> > I've been reading several documents about the beowulf clusters, but I would
> > like to ask all of you some questions about them.
> >
> > As I've seen the main objective behind any Beowulf cluster is the
> > price/performance tag, specially when compared to supercomputers. But as
> > network hardware and commodity systems are becoming faster and faster
> > (getting closer to GHz and Gigabit speeds), could you think on competting
> > directly with supercomputers?
> 
> If it becomes possible to compete with a "supercomputer" in all applications
> using COTS HW that would be great!  I don't see that in the near future, but
> we may get there eventually.  Depends on how much Beowulf cuts into the
> development of "supercomputers."

It's already happened.  A current cluster composed of Intel,
Intel-compatible, or Alpha hardware can certainly provide absolute
performance advantages over current supercomputer offerings for some
problems.  I don't just mean useless demonstration problems, but
real-life industrial applications.  Remember, I said some, not all --
but the classes of suitable problems and programs are growing.

As for price-performance, well there's nothing much to say...

-- 
David N. Lombard
MSC.Software


From jcownie at etnus.com  Thu Jun 22 09:18:06 2000
From: jcownie at etnus.com (James Cownie)
Date: Thu, 22 Jun 2000 17:18:06 +0100
Subject: Beowulf: A theorical approach 
In-Reply-To: Your message of "Thu, 22 Jun 2000 09:43:53 EDT."
             <200006201342.JAA22575@krang.parl.clemson.edu> 
Message-ID: <043815218161660PCOW024M@blueyonder.co.uk>

>  The problem right now really isn't in link speeds (though better
> link speeds are good), its in how close/far the network interface is
> from the CPU.  COTS HW doesn't place a high value on direct access
> to IO devices - there is a higher value on a standardized bus
> interface to allow different system components to be integrated and
> updated independently.  A "supercomputer" can have the network
> engineered directly into the node architecture.  This is a huge
> advantage.  Luckily, this advantage has the most effect in only some
> programs.

If Infiniband does all that it is supposed to do, then it will rapidly
become the network of choice, since it _does_ have support for direct
(user-space) access to the comms, and has some nifty switches.

Of course in the short term it will be limited by the CPU side
interfaces being PCI, but that's only the same limitation as
for Quadrics, Myrinet, SCI and so on. 

Once it becomes the standard for connection to storage it _should_ be
cheap, and a standard component of any "server-class" commodity
machine, whether IA* or other architecture. (IBM announced that
they'll be selling their interface chips and switches just today).

So, I expect that Inifinband will be engineered intimately into the
node architecture of COTS hardware, and that will help a lot.

It'll be interesting how long it takes before the Linux drivers are
available !

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, Inc.     +44 117 9071438
http://www.etnus.com


From rgb at phy.duke.edu  Thu Jun 22 09:20:19 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 22 Jun 2000 12:20:19 -0400 (EDT)
Subject: Beowulf: A theorical approach
In-Reply-To: <000b01bfdc1a$5f6e7c80$396d2fc2@lyan.vxu.se>
Message-ID: <Pine.LNX.4.10.10006221042060.12915-100000@ganesh.phy.duke.edu>

On Thu, 22 Jun 2000, Nacho Ruiz wrote:

> Hi,
> 
> I'm doing my final year project and I'm writting about the Beowulf project
> an the Beowulf clusters.
> I've been reading several documents about the beowulf clusters, but I would
> like to ask all of you some questions about them.
> 
> As I've seen the main objective behind any Beowulf cluster is the
> price/performance tag, specially when compared to supercomputers. But as
> network hardware and commodity systems are becoming faster and faster
> (getting closer to GHz and Gigabit speeds), could you think on competting
> directly with supercomputers?

They already do, in many arenas.  Greg Lindahl has given some wonderful
talks where he has showed alpha/myrinet beowulves that outperformed a
number of the big iron systems.  His company (and some other turnkey
beowulf companies) are winning big contracts because their systems
aren't just cheaper, they are also in many cases faster AND cheaper.

> As I see it the Beowulf cluster idea could be based in the distributed
> computign and the parallel computing: you put more CPUs to get more speedup,
> but as you can't have all the CPUs in the same machine you use several. So
> the Beowulf cluster could fit in between the distributed computing and the
> supercomputers (vetorial computers, parallel computers,..etc). You have
> advantages from both sides: parallel programming and high scalability; but
> you also have several drawbacks: mainly interconection problems. Do you
> think that with 10 Gb conections (OC-192 bandwith), SMP in chip (Power 4)
> and  massive primary and secondary memory devices at low cost, you could
> have a chance to beat most of the traditional supercomputers? or is not your
> "goal"?

You should skim over some of the talks and documents on
http://www.phy.duke.edu/brahma (near the top).  In particular, read the
sections that describe the scaling of parallel computer performance
(Amdahl's Law and various improved estimates thereof).  Then meditate
upon the fact that many of the big iron supercomputers have what amounts
to a beowulf architecture, even if they are SMP in that all the
processors reside in a single box (that is, they still use a de facto
"network" to communicate).  It might help to read Pfister's book "In
search of clusters", especially his discussion of (CC-)NUMA to see the
primary alternative, where shared memory is used to communicate between
processors/tasks.  It would be worthwhile for you to check out the
Trapeze project as well, which seeks to extend virtual memory over a
"beowulf-style" network architecture, exploiting the fact that a network
connection, however slow, is still three orders of magnitude faster than
disk access (so swapping or paging to EVEN an NFS-mounted RAM disk on a
second system might well be considerably faster than swapping or paging
to a real disk in the same cabinet -- emphasis because NFS is not a
particular fast or efficient protocol).

Finally, recognize that there is no "goal" shared by all the people on
this list other than getting our work done.  We are process oriented,
not goal oriented;-).  For nearly everybody on the list, beowulfs
ALREADY "beat" traditional supercomputers in one or more critical
dimensions of the highly multidimensional cost/benefit decision all
humans with work to do face when trying to decide how to go about doing
it.  If my goal is to surf the web, I don't buy a Cray, I buy a PC.  If
my goal is to complete as many independent Monte Carlo simulations as
possible per unit time per dollar, I ALSO don't buy a Cray, I buy a
network of PC's organized into a "beowulf".  If my goal is to solve CFD
problems, weather problems, cosmology problems (closer to the "grand
challenge" level) then maybe I buy a Cray (or whatever) or maybe I build
or buy a beowulf of advanced and competitive design (which is likely to
cost a lot more than a simple network of PC's, but a lot less than the
Cray). Even the required time to problem completion matters -- if I
>>must<< solve the problem as rapidly as possible whatever the cost I'll
probably pick a different system than I'd pick if my goal is to solve
the problem as rapidly as possible given a fixed budget of X.

Driven by a mix of the overwhelming cost-benefit advantage of
beowulf/cluster supercomputing for MANY problems and the fact that
playing with this in the open source world is damn good and interesting
and useful computer science (Bell prizes have been awarded for
contributions in beowulfery) beowulfs have evolved into the
supercomputer architecture of choice for thousands of sites, many of
them "tiny" by comparison with the big sites.  I have a five CPU beowulf
in my >>home<< office (might get it up to eight this year, with luck and
another $1.5K or so invested).  A joke by the standards of a T3x or SPx,
but even so, I get excellent performance on some problems I'm interested
in AND it gives me a very convenient laboratory for my amateurish forays
into computer science.  Lots of physics, or chemistry, or computer
science, or engineering departments in universities have built small
16-32 node beowulfs.  

This needs to be compared to the far more serious laboratories run by
Don Becker, Erik Hendriks, et. al. at Scyld.com, Greg Lindahl's systems,
the systems run by e.g. Walter Ligon and Rob Ross at Clemson, the
thousand node system doing genetic code development (at Stanford?) and
more, where they focus on the underlying computer science and
development of advanced infrastructure and software support.  Between
the high road (interesting computer science that might enable
interesting problems to be tackled one day) and the low road (USEFUL
results from the interesting computer science being applied to solve
real world problems right now) beowulfery has a "mind of its own" and a
unique pattern of evolution.

Really, the way the beowulf "movement" has proceeded is a fascinating
concept and well worth writing about, but don't start off by ascribing
to it a fixed goal.  It is an amalgam, a hodgepodge, it is like damascus
steel with soft grains of iron intermixed with hard grains of high
carbon cementation producing something amazingly tough >>and<< flexible.
Its evolution pattern involves the open exchange of ideas, the hard nosed
acceptance or rejection of those ideas on the basis of the economic
benefit that derives from them, the alteration of bad ideas into good
ones and good ones into better ones as clever idea are batter around in
between smart people.

Ahhh, but I digress.  Or is it regress.  In a minute I'll be spouting
poetry, "Ode to the Beowulf...";-)

> And about the evolution of the Beowulf clusters, do you all follow a kind of
> guideness or the project have divided in several flavors and objectives?

As I said, there isn't really a "project".  There isn't even a
"consortium" or "IEEE standards committee".  There is only a website
(really a LOT of websites) and an informal, unmoderated list that anyone
can join (and leave again if it doesn't suit them).  Well, it is
moderated by common consensus and occasional flames -- anybody who gets
egregiously off-topic or out of line gets razzed or roasted or both.  Or
worse, just ignored.

There are lots of places (to the list and on websites) where successful
solutions are posted and documented, and more are being developed every
day.  People participate because they want to, because they have a
>>use<< for the idea.  The process is stabilized by those folks that
have participated for a long time and by now are invested in
beowulf-oriented development as part of their research, their business
plan, their professional focus, or just their hobby.  There are a number
of folks on the list who have been doing one sort of distributed
parallel computing or another for a LONG time (which in this business is
anything over five years:-) including some of the original inventors of
the term "beowulf".

The only two flavors that have emerged on the list that are worth
mentioning are the "true beowulf" flavor, which is by definition a
collection of COTS computers interconnected by a COTS (and usually
private) network with (usually) a single "head".  The idea is that a
"beowulf" is a supercomputer assembled out of COTS parts, and so it can
be given a single "name", usually the hostname of its head node, and the
internal nodes are viewed as "parts of the supercomputer" and are not
generally accessed or utilized as separate compute entities.

However, many (possibly even most, I don't know) of the folks on the
list are interested in or use more general "clusters" of computers --
things that have been dubbed NOWs, COWs, POPs -- or hybrids, where there
is a cluster that is architected much like a beowulf and used for
relatively fine grained synchronous tasks but it is part of and
node-accessible from a larger network/cluster that might be viewed as a
NOW/COW/POP or just a plain old LAN where coarse grain or embarrassingly
parallel tasks can be farmed out and harvested.

Both groups have similar problems to solve and share a common language
of IPC's, latencies, bandwidths, communication patterns, and so forth.
Tools developed for use in one context can often be used in the other.
I'd say that the computer scientists tend to focus on the "true beowulf"
side of things (as that is where the more interesting and tougher
problems are) and the stuff they develop filters down into the more
general cluster world as appropriate to a task.

These two groups generally coexist amicably on the list, provided that
one doesn't try to call a sloppy old compute cluster (or anything
running WinNT or Win2k) a "beowulf";-).

> Are the objectives of the beggining the same as today or now you plan to
> have something like a "super SMP computer" in a distributed way (with good
> communications times). I've seen that a lot of you are focusing in the GPID
> and whole machine idea, do you think that is reachable? What are the main
> objectives vs the MPI/PVM message passing idea?
> And what about shared memory (in the HD level or the RAM level), do you take
> advantage of having this amount  of resouces?

I could write a book to answer all these questions (the ones that in
fact can be answered).  Hopefully the discussion above indicates that
many of them don't have answers (or have obvious/silly answers -- of
course the objectives change with time as the evolution of hardware and
software opens up new possibilities).  There are groups doing work on
most of the things you mention.  Many of those groups are not directly
engaged with "beowulfery" and may not even have a member on the list,
but that doesn't stop the work they do from percolating in, as long as
it is "open".  "Open" is as much a part of the definition of the beowulf
as "COTS".

> Is this idea trying to reach the objective of making parallel programs
> "independent" to the programmer? I mean, that instead of having to program
> having in mind that you are using a parallel machine you can program in a
> "normal" way and the compiler will divide/distribute the code over the
> cluster. Is this reachable or just a dream? Is somebody working on this?

I personally don't even think that it is a dream at this point -- it is
a fantasy. The space of possible program parallelizations and
reorganizations is a >>very, very complex one<<.  Sure, one can write
parallel libraries that will autoparallelize some simple "atomic"
operations (like multiplying two vectors or doing a sort).  However, to
>>optimally<< parallelize (or even execute as a single threaded task)
even something this "simple", one's compiler/library simply has to know
all sorts of things about the parallel computer on which they are to be
run.  How fast are the IPC's?  Where are the L1 and L2 cache boundaries?
How fast is memory?  What's the cost of a context switch?  This problem
is difficult enough on an SMP system, where many of the answers are
homogeneous and can in principle be made known to a parallel compiler --
in a beowulf, where there are no guarantees of homogeneity or even a
common standard of design it is nearly "impossible".

Nevertheless, I wouldn't say that nobody is working on this.  The
problem is that they are working on the tools that will enable the tools
that will enable the tools to be built that MIGHT one day permit at
least the efficient and automatic parallelization of a small subset of
the standard operations one might wish to perform -- e.g. matrix
operations and sorts and the like.  Even then I'd guess that the
programmer will have to be aware that they are programming for parallel
operation as there are issues like data organization and separability
that I doubt any compiler can manage.

Consequently I think that the day will never come when one can take, for
example, the off-the shelf source code to, say, netscape or ls or sort,
and compile it with the -parallel flag and produce a binary that will
automatically parallelize itself at all, let alone efficiently.  The
compiler will also be utterly unable to make the most important decision
of all -- that it is STUPID to parallelize netscape or ls, STUPID to run
sort in parallel for small problems, but that it MIGHT be smart to
parallelize sort for problems bigger than some systems and network speed
dependent threshold!

> And what about the administration of a cluster. Having all the machine of
> the cluster under control, so you can know which are avaliable to send some
> work, is an hazarous task but necessary. Is not as easy as in a SMP machine
> where you know or assume that all the CPUs inside are working, in a cluster
> you can't do that as the CPU might work but the HD, NIC or memory may fail.
> How much computational time do you spend in this task? There's somebody
> working in a better way to manage with this?
> 
> I know that sometime ago HP had a machine woth several faulty processors
> working and achiving high computational speeds without any error. They used
> some kind of  "control algorithm" that manages to use only the good CPUs. Do
> you have something like this or there is no point? Does it make sense?

There are plenty of people working on adminstrative tools, and a number
of tools already exist to solve many of the problems you mention
(although the tools may still need work).

Fault tolerance is a whole different question.  There are certainly
folks interested in the problem and are very likely people working on
it, but fault tolerance is >>also<< a cost-benefit problem and it has
two general KINDS of solutions.  One is to engineer the tolerance into
the underlying systems architecture.  Dual power supplies.  RAID 5 run
by dedicated controllers.  ECC memory.  This is VERY difficult to extend
to beowulf architectures and isn't easy even on SMP systems -- it is
hard to design a computer system that cannot be brought down in its
entirety by >>any<< failure of its parts, especially when one of the
parts in question is a CPU.

The second approach is to engineer the tolerance into the software using
the hardware you've got.  In many cases this involves checkpointing the
code, one way or another (whether the checkpoint goes to memory, to
disk, to another system is almost irrelevant).  The point is that code
checkpointing takes quite a lot of time regardless of the medium
compared to the work that would be done in that time, and that this time
reduces the efficiency of the program.  This is really true even if the
redundancy is engineered in at the systems level, but there clever
engineering can sometimes hide the extra work or do IT in parallel.

One then has to examine the cost-benefit equation.  It costs you X
amount of time to checkpoint at some frequency.  During the intervals
you are at risk, with some probability.  If the probability of failure is
low, the intervals that lead to the smallest expected value for the time
to completion will be large, often "infinite" (you should just run the
damn program and not bother checkpointing, and risk having to run it
over again once every five thousand or so times, because checkpointing
it even once costs you one part in a thousand in performance).  In other
cases (for example, when you're going to run a program that takes a year
to complete on 1000 nodes at once) the interval may need to be
relatively short just to ensure that the problem EVER completes.

Confronted with this, for many folks on the list engineering
fault-tolerance into their programs is a total waste of time and a
cost-benefit loss.  One can write fault tolerant parallel software NOW
with existing tools, but it is a lot of work and will slow down your
job, possibly quite a bit.  Failover is usually more interesting to
folks who use clusters in different ways than are discussed on the list
-- for example corporate parallelized database servers or really big
(multisystem) webservers like yahoo.  In these cases, the "cost of
failure" (even one failure every umpty-ump days) may be so high as to be
"unacceptable", even compared to the relatively high costs of fault
tolerance.  These operations aren't really "beowulfs" although of course
they are "close" and their operators/designers would be welcome on the
list.

At a guess, this is the kind of problem that will -- eventually -- be at
least partly addressed by work being done at a number of places.  I
believe that there is at least one group working on certain core pieces
of software that will build beowulf support directly into the kernel,
where it can benefit from increases in speed and efficiency and where
one can BEGIN to think about issues like fault tolerance at a lower
level than program design.  This is the kind of thing the "true beowulf"
computer science groups think about.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From demeler at bioc09.v19.uthscsa.edu  Thu Jun 22 09:22:36 2000
From: demeler at bioc09.v19.uthscsa.edu (Borries Demeler)
Date: Thu, 22 Jun 2000 11:22:36 -0500 (CDT)
Subject: beowulf apps for bioinformatics
Message-ID: <200006221622.LAA22440@bioc09.v19.uthscsa.edu>

Hi everyone,

I have a poorly defined question, and I am hoping that people on this list
can help me put a little more focus into it: We have an opportunity to
get our hands on about $150,000 in our biochemistry dept. to be used for
"bioinformatics". The exact meaning of "bioinformatics" is poorly defined
by the sponsor, and we are essentially free to define what this overused
term exactly means for our use. We need to come up with a consensus on
how to spend the money for bioinformatics-related applications/computers,
whatever that means.

My idea was to find a list of applications that can be run on a beowulf 
system and may be useful to researchers in our dept. We have X-ray
crystallographers, molecular biologists, kineticists, geneticists and all
flavors of biochemistry represented in our dept. Once we have a good list
of available software that may address a subject dealt with by one or the
other faculty, we could come up with a viable proposal for a medium scale
beowulf system, geared towards the software packages most useful to us.

Can you share your ideas for a list of software packages that can be run
on a beowulf system that may address subjects of interest to the above
mentioned research directions?

Thanks for your help, -Borries
*******************************************************************************
* Borries Demeler                                                             *
* The University of Texas Health Science Center at San Antonio                *
* Dept. of Biochemistry, 7703 Floyd Curl Drive, San Antonio, Texas 78284-7760 *
* Voice: 210-567-6592, Fax: 210-567-4575, Email: demeler at biochem.uthscsa.edu  *
*******************************************************************************


From tony at MPI-Softtech.Com  Thu Jun 22 09:54:44 2000
From: tony at MPI-Softtech.Com (Tony Skjellum)
Date: Thu, 22 Jun 2000 11:54:44 -0500 (CDT)
Subject: Beowulf: A theorical approach 
In-Reply-To: <043815218161660PCOW024M@blueyonder.co.uk>
Message-ID: <Pine.GSO.4.10.10006221154250.7281-100000@mpi.mpi-softtech.com>

Rumor has it that Infiniband is only a 64-way maximum size
infrastructure...  perhaps that will change over time.

Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Thu, 22 Jun 2000, James Cownie wrote:

> 
> >  The problem right now really isn't in link speeds (though better
> > link speeds are good), its in how close/far the network interface is
> > from the CPU.  COTS HW doesn't place a high value on direct access
> > to IO devices - there is a higher value on a standardized bus
> > interface to allow different system components to be integrated and
> > updated independently.  A "supercomputer" can have the network
> > engineered directly into the node architecture.  This is a huge
> > advantage.  Luckily, this advantage has the most effect in only some
> > programs.
> 
> If Infiniband does all that it is supposed to do, then it will rapidly
> become the network of choice, since it _does_ have support for direct
> (user-space) access to the comms, and has some nifty switches.
> 
> Of course in the short term it will be limited by the CPU side
> interfaces being PCI, but that's only the same limitation as
> for Quadrics, Myrinet, SCI and so on. 
> 
> Once it becomes the standard for connection to storage it _should_ be
> cheap, and a standard component of any "server-class" commodity
> machine, whether IA* or other architecture. (IBM announced that
> they'll be selling their interface chips and switches just today).
> 
> So, I expect that Inifinband will be engineered intimately into the
> node architecture of COTS hardware, and that will help a lot.
> 
> It'll be interesting how long it takes before the Linux drivers are
> available !
> 
> -- Jim 
> 
> James Cownie	<jcownie at etnus.com>
> Etnus, Inc.     +44 117 9071438
> http://www.etnus.com
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From billm at troikanetworks.com  Thu Jun 22 10:14:19 2000
From: billm at troikanetworks.com (Bill Moshier)
Date: Thu, 22 Jun 2000 10:14:19 -0700
Subject: Infiband (was RE: Beowulf: A theorical approach)
Message-ID: <C7CA595F9B9FD311A40D009027DC4A856C9201@host03.troikanetworks.com>

Tony - by 64-way maximum size are you implying that infiniband
has a 64-node limit?  I was under the impression that, from at
least the hw point of view it was similar to VI Architecture,
which is more-or-less unlimited in its interconnections.

Bill

-----Original Message-----
From: Tony Skjellum [mailto:tony at MPI-Softtech.Com]
Sent: Thursday, June 22, 2000 9:55 AM
To: James Cownie
Cc: Walter B. Ligon III; Nacho Ruiz; Beowulf Mailing List
Subject: Re: Beowulf: A theorical approach 


Rumor has it that Infiniband is only a 64-way maximum size
infrastructure...  perhaps that will change over time.

Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS
39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Thu, 22 Jun 2000, James Cownie wrote:

> 
> >  The problem right now really isn't in link speeds (though better
> > link speeds are good), its in how close/far the network interface is
> > from the CPU.  COTS HW doesn't place a high value on direct access
> > to IO devices - there is a higher value on a standardized bus
> > interface to allow different system components to be integrated and
> > updated independently.  A "supercomputer" can have the network
> > engineered directly into the node architecture.  This is a huge
> > advantage.  Luckily, this advantage has the most effect in only some
> > programs.
> 
> If Infiniband does all that it is supposed to do, then it will rapidly
> become the network of choice, since it _does_ have support for direct
> (user-space) access to the comms, and has some nifty switches.
> 
> Of course in the short term it will be limited by the CPU side
> interfaces being PCI, but that's only the same limitation as
> for Quadrics, Myrinet, SCI and so on. 
> 
> Once it becomes the standard for connection to storage it _should_ be
> cheap, and a standard component of any "server-class" commodity
> machine, whether IA* or other architecture. (IBM announced that
> they'll be selling their interface chips and switches just today).
> 
> So, I expect that Inifinband will be engineered intimately into the
> node architecture of COTS hardware, and that will help a lot.
> 
> It'll be interesting how long it takes before the Linux drivers are
> available !
> 
> -- Jim 
> 
> James Cownie	<jcownie at etnus.com>
> Etnus, Inc.     +44 117 9071438
> http://www.etnus.com
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Jun 22 10:22:56 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 22 Jun 2000 13:22:56 -0400 (EDT)
Subject: Beowulf: A theorical approach 
In-Reply-To: <043815218161660PCOW024M@blueyonder.co.uk>
Message-ID: <Pine.LNX.4.10.10006221256230.13157-100000@ganesh.phy.duke.edu>

On Thu, 22 Jun 2000, James Cownie wrote:

> 
> >  The problem right now really isn't in link speeds (though better
> > link speeds are good), its in how close/far the network interface is
> > from the CPU.  COTS HW doesn't place a high value on direct access
> > to IO devices - there is a higher value on a standardized bus
> > interface to allow different system components to be integrated and
> > updated independently.  A "supercomputer" can have the network
> > engineered directly into the node architecture.  This is a huge
> > advantage.  Luckily, this advantage has the most effect in only some
> > programs.
> 
> If Infiniband does all that it is supposed to do, then it will rapidly
> become the network of choice, since it _does_ have support for direct
> (user-space) access to the comms, and has some nifty switches.
> 
> Of course in the short term it will be limited by the CPU side
> interfaces being PCI, but that's only the same limitation as
> for Quadrics, Myrinet, SCI and so on. 

Just as a matter of curiosity -- Once upon a time some two or three
years ago I suggested on the list that a development company consider
building a network communications device that plugged into the second
CPU slot of a dual CPU board.  After all, one would guess that a
cleverly designed controller (which might even have a full CPU on it and
be in a sense a "network-specialized single board computer") would then
be able to access and be accessible to the entire CPU/memory subsystem
at full memory speeds and latencies (that is, sub-microsecond latencies
and 100-200 MBytes (not bits) per second at least onto the device
itself, where it would presumably bottleneck through the actual
communications channel and switch).

I would >>think<< that such a design, with just the right firmware on
the "network communications processor" plugged into the slot and a
kernel module or two, could be made to provide de facto CC-NUMA
pseudo-smp operation.  After all, even on a dual system cache coherence
is already addressed, all that is needed in addition is an algorithm for
extending that across the attached network.  I'd bet that one could
design such a device/system to run with existing dual (Intel) CPU MoBo
chipsets and -- provided that my repeated (and hence documented)
description of the idea on this list suffice to prevent somebody else
from being able to patent it -- even qualify as a COTS technology.

Dual CPU motherboards are cheap - a few tens of dollars more than a
single.  The chipsets and firmware are well documented.  CPU's can
presumably generate interrupts and the like.  I'd expect that the
engineering would be straightforward and a marketable new device (or
marketable variant of an existing device that is bottlenecked at the PCI
bus) could be developed quickly and sold relatively inexpensively.  If
the idea is successful implemented in this way, it might even spawn a
specialized interface on motherboards derived from the existing second
CPU slot but even better suited toward commodity supercomputer assembly.

Is anybody even THINKING of doing this?  Yet?  Or is there something I'm
ignoring that makes this impossible or hideously expensive?

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From tony at MPI-Softtech.Com  Thu Jun 22 10:48:57 2000
From: tony at MPI-Softtech.Com (Tony Skjellum)
Date: Thu, 22 Jun 2000 12:48:57 -0500 (CDT)
Subject: Infiband (was RE: Beowulf: A theorical approach)
In-Reply-To: <C7CA595F9B9FD311A40D009027DC4A856C9201@host03.troikanetworks.com>
Message-ID: <Pine.GSO.4.10.10006221247330.9669-100000@mpi.mpi-softtech.com>

Well, this is scuttlebutt, you hear at a lot of meetings and around
water coolers.  Let me not embarrass myself by saying other than I have
heard it a bunch of times.   Not having paid the $10,000 to be NDAd on
Infiniband, I can't and won't say more.

We should invite an appropriate Intel or other leader in Infiniband to
provide a public briefing of what's real.

I will say that several people have mentioned that Infiniband is for
server area networks, not system area networks (ie clusters).

Tony

Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Thu, 22 Jun 2000, Bill Moshier wrote:

> Tony - by 64-way maximum size are you implying that infiniband
> has a 64-node limit?  I was under the impression that, from at
> least the hw point of view it was similar to VI Architecture,
> which is more-or-less unlimited in its interconnections.
> 
> Bill
> 
> -----Original Message-----
> From: Tony Skjellum [mailto:tony at MPI-Softtech.Com]
> Sent: Thursday, June 22, 2000 9:55 AM
> To: James Cownie
> Cc: Walter B. Ligon III; Nacho Ruiz; Beowulf Mailing List
> Subject: Re: Beowulf: A theorical approach 
> 
> 
> Rumor has it that Infiniband is only a 64-way maximum size
> infrastructure...  perhaps that will change over time.
> 
> Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
> MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS
> 39759
> +1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
> "Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."
> 
> On Thu, 22 Jun 2000, James Cownie wrote:
> 
> > 
> > >  The problem right now really isn't in link speeds (though better
> > > link speeds are good), its in how close/far the network interface is
> > > from the CPU.  COTS HW doesn't place a high value on direct access
> > > to IO devices - there is a higher value on a standardized bus
> > > interface to allow different system components to be integrated and
> > > updated independently.  A "supercomputer" can have the network
> > > engineered directly into the node architecture.  This is a huge
> > > advantage.  Luckily, this advantage has the most effect in only some
> > > programs.
> > 
> > If Infiniband does all that it is supposed to do, then it will rapidly
> > become the network of choice, since it _does_ have support for direct
> > (user-space) access to the comms, and has some nifty switches.
> > 
> > Of course in the short term it will be limited by the CPU side
> > interfaces being PCI, but that's only the same limitation as
> > for Quadrics, Myrinet, SCI and so on. 
> > 
> > Once it becomes the standard for connection to storage it _should_ be
> > cheap, and a standard component of any "server-class" commodity
> > machine, whether IA* or other architecture. (IBM announced that
> > they'll be selling their interface chips and switches just today).
> > 
> > So, I expect that Inifinband will be engineered intimately into the
> > node architecture of COTS hardware, and that will help a lot.
> > 
> > It'll be interesting how long it takes before the Linux drivers are
> > available !
> > 
> > -- Jim 
> > 
> > James Cownie	<jcownie at etnus.com>
> > Etnus, Inc.     +44 117 9071438
> > http://www.etnus.com
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list
> > Beowulf at beowulf.org
> > http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From glindahl at hpti.com  Thu Jun 22 11:10:32 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Thu, 22 Jun 2000 14:10:32 -0400
Subject: Beowulf: A theorical approach 
In-Reply-To: <Pine.LNX.4.10.10006221256230.13157-100000@ganesh.phy.duke.edu>
Message-ID: <000001bfdc75$27c0e2e0$e4844b89@hptilap.hpti.com>

> Just as a matter of curiosity -- Once upon a time some two or three
> years ago I suggested on the list that a development company consider
> building a network communications device that plugged into the second
> CPU slot of a dual CPU board.

It's hard to build something that plugs into an Intel-designed CPU bus.
Scali tried it and it didn't work so hot.

You could do it for the Alpha/AMD cpu bus, but it's a high speed design, and
the estimates I got (I was seriously considering this) was 12 months and $2
million.

In 12 months, you'll be able to get either PCI-X machines or Infiniband
machines. PCI-X has split transactions, and Chuck Seitz, who knows more
about networking than all of us combined, says that PCI-X will get rid of
most of the PCI latency.

So, sit back, pop open a cold one, and wait.

-- greg


From glindahl at hpti.com  Thu Jun 22 11:25:42 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Thu, 22 Jun 2000 14:25:42 -0400
Subject: Infiband (was RE: Beowulf: A theorical approach)
In-Reply-To: <Pine.GSO.4.10.10006221247330.9669-100000@mpi.mpi-softtech.com>
Message-ID: <000101bfdc77$45dd1940$e4844b89@hptilap.hpti.com>

> I will say that several people have mentioned that Infiniband is for
> server area networks, not system area networks (ie clusters).

This is true, if the rumors I've heard are true. And server networks are for
accessing storage, with few conversations and huge block sizes, not for tiny
messages to any of thousands of hosts in the entire machine. There will be
an annoying limit on the # of connections, and the standard only guarantees
1 outstanding non-connection message.

The one thing that Infiniband will do is provide a much better bus than PCI.
PCI-X is better in ways other than just large-transfer bandwidth, and
Infiniband is still better. I have yet to see any sign that "native"
Infiniband switches are going to be good. IBM's announcement of 8-way 6
megabit switches (perhaps I've got the details wrong, doesn't really matter)
available 12 months from now just doesn't excite me much.

So what will we do? We'll stick Myrinet cards (or GigabitN Ethernet or
Quadrics or whatever) on Infiniband just like we stick Myrinet on PCI today.

-- g


From billings at helix.nih.gov  Thu Jun 22 11:35:15 2000
From: billings at helix.nih.gov (Eric Billings)
Date: Thu, 22 Jun 2000 14:35:15 -0400
Subject: beowulf apps for bioinformatics
In-Reply-To: <200006221622.LAA22440@bioc09.v19.uthscsa.edu>
Message-ID: <Pine.SGI.4.21.0006221415000.85049-100000@helix.nih.gov>

On Thu, 22 Jun 2000, Borries Demeler wrote:

> My idea was to find a list of applications that can be run on a beowulf 
> system and may be useful to researchers in our dept. We have X-ray
> crystallographers, molecular biologists, kineticists, geneticists and all
> flavors of biochemistry represented in our dept. Once we have a good list
> of available software that may address a subject dealt with by one or the
> other faculty, we could come up with a viable proposal for a medium scale
> beowulf system, geared towards the software packages most useful to us.

You can find a terrific list of software applications at:

http://cmm.info.nih.gov/modeling/software.html

The page is maintained by Peter Steinbach of the Center for Molecular
Modeling.  Many of the applications are becoming available for Beowulf-class
machines and several of the free applications are already running under
Linux.  Of particular interest to the molecular biologists and geneticists
are the BLAST suite (http://www.ncbi.nlm.nih.gov/Tools/index.htmli).
We've been running the BLAST suite under Linux for a year or so.  It is
quite parallelizable and several of the algorithms fit into the 
"embarassingly parallelizable" category for essentially independent
searches.

- Eric 


----------------------------------------------------------------------
Dr. Eric Billings                                Voice: (301) 496-6520
Laboratory of Biophysical Chemistry              FAX:   (301) 496-2172
National Institutes of Health            email: billings at helix.nih.gov
LoBoS: Lots of Boxes on Shelves              http://www.lobos.nih.gov/
----------------------------------------------------------------------


From glindahl at hpti.com  Thu Jun 22 11:38:02 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Thu, 22 Jun 2000 14:38:02 -0400
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]
In-Reply-To: <395234B1.5B45622A@super.org>
Message-ID: <000201bfdc78$ff3df520$e4844b89@hptilap.hpti.com>

> 2) Customize a stock design and get someone else to build it for you.
There
> are several small to mid-size companies which specialize in this. I've
been
> meaning to update my list (perhaps some readers will help).  The list
includes
> at least Altatech, Atlantec, Aspen, DCG,HPTi, Paralogic, TurboLinux,
VALinux.

In the case of HPTi and some of the other companies on this list, this is
not quite right. HPTi provides appropriate solutions to solve your problem,
just like a traditional supercomputer vendor. We do not encourage our
customers to pick out which motherboard they like best; we prefer to take
relevant benchmarks and then pick the right CPU, interconnect, and so forth
to give you the biggest bang for your $. Our design space is general purpose
enough that it fits what most people want out of a supercomputer.

-- greg


From jakob at ostenfeld.dtu.dk  Thu Jun 22 11:59:19 2000
From: jakob at ostenfeld.dtu.dk (=?iso-8859-1?Q?Jakob_=D8stergaard?=)
Date: Thu, 22 Jun 2000 20:59:19 +0200
Subject: Beowulf: A theorical approach
In-Reply-To: <000001bfdc75$27c0e2e0$e4844b89@hptilap.hpti.com>
References: <Pine.LNX.4.10.10006221256230.13157-100000@ganesh.phy.duke.edu> <000001bfdc75$27c0e2e0$e4844b89@hptilap.hpti.com>
Message-ID: <20000622205918.D762@ostenfeld.dtu.dk>

On Thu, 22 Jun 2000, Greg Lindahl wrote:

> > Just as a matter of curiosity -- Once upon a time some two or three
> > years ago I suggested on the list that a development company consider
> > building a network communications device that plugged into the second
> > CPU slot of a dual CPU board.
> 
> It's hard to build something that plugs into an Intel-designed CPU bus.
> Scali tried it and it didn't work so hot.

At least on Intel hardware, SMP systems have cache coherency in the hardware,
and that's _expensive_ wrt. inter-cpu communication.  There's write snooping
and all sorts of the strangest things happening.  I cannot imagine how this
could work in any way over a network with any reasonable performance. Even
given infinite bandwidth, I guess the latency from ten meters of copper wiring
would kill performance.  (any sub-c interconnect would and tunneling is a few
years away I guess  ;)

> 
> You could do it for the Alpha/AMD cpu bus, but it's a high speed design, and
> the estimates I got (I was seriously considering this) was 12 months and $2
> million.

I don't know the EV6 architecture.  But do you think cache coherency could
be done over reasonably over a network among several CPUs ??

Cache coherency is evil as h*ll to implement. ccNUMA machines even rely on
kernel support, they can't build it in hardware alone  (at least that's how
it's done on the Origin series AFAIK).  Linux is getting NUMA support in the
VM,  do you think this could be used with the special hardware you mention to
actually build a ccNUMA system of a cluster ?

> 
> In 12 months, you'll be able to get either PCI-X machines or Infiniband
> machines. PCI-X has split transactions, and Chuck Seitz, who knows more
> about networking than all of us combined, says that PCI-X will get rid of
> most of the PCI latency.
> 
> So, sit back, pop open a cold one, and wait.
> 

I'm skeptical, but I'll take your advice here   :)

-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:


From rgb at phy.duke.edu  Thu Jun 22 12:08:10 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 22 Jun 2000 15:08:10 -0400 (EDT)
Subject: Beowulf: A theorical approach
In-Reply-To: <3952527A.24089C62@bickleywest.com>
Message-ID: <Pine.LNX.4.10.10006221410200.13157-100000@ganesh.phy.duke.edu>

On Thu, 22 Jun 2000, Lyle Bickley wrote:

> Thanks Robert for all your comments, but especially those regarding
> fault tolerance.

You're more than welcome.

> Cost/benefit analysis is a very difficult issue.  How many Beowulf runs
> that take days to complete fail?  What is the cost?  I wish I had a
> better handle on this.  It's a LOT easier to understand the cost of the
> NY Stock Exchange going down for 20 minutes than a Beowulf failure after
> three days....

I'm hoping to tackle this in a chapter in the eternal book I'm working
on.  Part of the answer is objective, and that part can be explained.
In fact, it is mathematically described by e.g. game theory or insurance
company actuarial statistics -- one is selecting a strategy to optimize
some expected return (maximize benefit or minimize cost) based on your
best guess of certain probabilities and cost weights.  There are even
ways to create a feedback correction cycle and tune to a global optimum
based on observed rates of failures and observed costs instead of
guesses, if one gets very fancy and it matters.

The other part of the answer, as you note, is subjective.  What's the
"cost" of a beowulf failure after three days?  Probably very little, if
you are in the middle of a six month project and it doesn't happen
again.  On the other hand, if you have a publication deadline in two
days and needed just one more hour to complete the three day run that
would finish things off in time to write them up...

I worry about the same thing here during the academic year.  During the
bulk of the semester a server failure in the physics department is an
annoyance, but probably isn't "critical".  Every semester, though, there
is a ten day or so period where a server failure could literally be a
disaster -- when I'm writing my final exams (on the computer) and so is
everybody else, or evaluting my gradebooks (on the computer) and so is
everybody else.  If those go away right before I was going to print out
an exam or tally up the grades, the entire academic Universe comes to an
ugly end as no final exam can be given in their one and only final exam
slot, or their failing grade doesn't get in until after they've
graduated.  Heads roll.  Angry students storm your office carrying
torches.

So, we do what we can to guard against this -- keep good backups,
architect things so there is a replacement box that could be turned into
the primary server in a few hours.  This costs some money and time but
is worth it.  On the other hand, what if there is a fire?  Can't say
that our measures are adequate for that.  Insurance for that would
involve off-site storage, and in fact I tend to do just that and try to
keep my entire CVS tree sync'd between home and work so if a (small:-)
meteor landed on the physics building tonight (when I wasn't there) my
sources and writings and papers and so forth would survive.  Even this
wouldn't help if there was a hurricane like Fran -- electricity itself
went away for more than a week at my house, and my laptop won't run that
long and I can't afford an adequate solar recharger...;-)

Backup strategy (the underlying reasoning) is basically the same as
failover strategy -- you basically determine the amount of work you are
willing to lose given scenario X and work cost/value Y and take
preventative measures with that period.  You then cross your fingers
concerning scenario Z that you can't afford to deal with.  After all,
even tandems will go down if they are vaporized in a nuclear blast.
Unless perhaps they are failover protected at sites separated by (say)
several tens of miles and a lot of EMP protection.  Military scenarios
probably require failover protection at even this level, but most of the
rest of us don't.

A lot of people doing beowulfish calculations do failover protection of
sorts without even knowing consciously that that is what they are doing.
For example, why is it a "three day run"?  In most cases, one can pick a
(scientific) calculation size that will run in an hour, or a day, or a
week, or a year (and all would yield interesting results).  You pick a
size that you can afford and that finishes in a "reasonable" amount of
time.  Larger sizes wait for Moore's Law to catch up to them.  What's
reasonable?  A size that you're pretty sure will finish before a system
is likely to fail, which may be as low as the interval between area
thunderstorms in the summer (this was the case at my house before I
installed UPS on everything).

In many cases a one can do better -- For example, it may be possible to
do a year's worth of calculation safely by breaking it into chunks
completed a day at a time, or a week at a time, without having to really
"checkpoint" the code.  In Monte Carlo, for example, one can just run a
large number of independent simulations and do stats to recombine the
results.  One even gains from doing this as the variance of the truly
independent runs is an absolutely reliable measure of error in the mean
(which isn't generally the case for the variance generated by importance
sampling a single Markov chain with internal autocorrelation times, but
I digress:-).

I personally try to time things so that chunk completion times are on
the order of one day, because I'm always willing to lose a day's worth
of compute time as long as it doesn't happen too often.  Sometimes I've
gone as high as a week.  I basically NEVER do three week long runs if
there is any way to rearrange things so I don't have to -- systems don't
break, linux rarely fails, but somehow "something" (lightning, human
error, power fluctuations, somebody tripping over a cord) not
infrequently intervenes somewhere within the timeframe of months.  This
very coarse chunking of work is all the "failover checkpointing" that I
(or, I suspect most beowulf folks) do, and it works quite effectively,
although I'm sure that it isn't always possible to coarsely chunk like
this without writing a lot of nasty code to save a truly restartable
checkpoint state...

> > At a guess, this is the kind of problem that will -- eventually -- be at
> > least partly addressed by work being done at a number of places.  I
> > believe that there is at least one group working on certain core pieces
> > of software that will build beowulf support directly into the kernel,
> > where it can benefit from increases in speed and efficiency and where
> > one can BEGIN to think about issues like fault tolerance at a lower
> > level than program design.  This is the kind of thing the "true beowulf"
> > computer science groups think about.
> > 
> 
> I have been considering the possibility of a single Tandem like system
> which is TRULY fault tolerant, bringing "true" fault tolerance to an
> entire Beowulf cluster via heartbeats, progress monitoring, process
> checkpoints, etc.
> 
> But who would buy such a critter??  Is there really a need??  What
> percentage of the total cost of a Beowulf would be a reasonable cost for
> such a beast??

Well, Tandem systems do sell, of course, so there is a market for this
kind of fault tolerance.  The military might even need it on a small
scale -- a tank might be made more robust if its battle computer was
really a fault tolerant beowulf networked to four or five hard sites
within the tank.  A non-fatal hit might take out one or two nodes, but
not the whole thing.  Ditto the space program (plagued with failures
already and with a very high cost of failure).  Financial markets and 
webservice markets both have a high cost of failure.  Something like an
EMS computer system supporting a 911 center cannot afford to go down in
any dimension, even during a natural or unnatural disaster.

In many of these cases, the people buying the fault tolerance have DEEP
pockets and the cost of failure is VERY high.  However, their needs are
also very, very specific, so one has to basically simultaneously
engineer the system and the software to match.  The one thing bringing
this sort of fault tolerance to beowulfery (at the systems level, with
open source components and COTS hardware) would do is significantly
lower the cost of the dedicated/custom software development.  I think
that is the goal of some of the folks working on the problem.

A very interesting subject, I agree.  Go for it.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From billm at troikanetworks.com  Thu Jun 22 12:09:12 2000
From: billm at troikanetworks.com (Bill Moshier)
Date: Thu, 22 Jun 2000 12:09:12 -0700
Subject: Infiniband (was RE: Beowulf: A theorical approach)
Message-ID: <C7CA595F9B9FD311A40D009027DC4A856C9203@host03.troikanetworks.com>

Greg, Tony - 

Looking at the infiniband info 
(at http://www.infinibandta.org/data/press/illuminata.pdf)
it is intended to be a 'network approach to IO', and is
limited to more-or-less local access (<1000 meters).  As
such it is oriented to clustering of systems - not just
server area networks or storage area networks.  One of the
goals, apparently, is to have VI Architecture implimentation
on top of Infiniband - which is a fast, efficient cluster
messaging mechanism.

Bill


-----Original Message-----
From: Greg Lindahl [mailto:glindahl at hpti.com]
Sent: Thursday, June 22, 2000 11:26 AM
To: Tony Skjellum
Cc: beowulf at beowulf.org
Subject: RE: Infiband (was RE: Beowulf: A theorical approach)


> I will say that several people have mentioned that Infiniband is for
> server area networks, not system area networks (ie clusters).

This is true, if the rumors I've heard are true. And server networks are for
accessing storage, with few conversations and huge block sizes, not for tiny
messages to any of thousands of hosts in the entire machine. There will be
an annoying limit on the # of connections, and the standard only guarantees
1 outstanding non-connection message.

The one thing that Infiniband will do is provide a much better bus than PCI.
PCI-X is better in ways other than just large-transfer bandwidth, and
Infiniband is still better. I have yet to see any sign that "native"
Infiniband switches are going to be good. IBM's announcement of 8-way 6
megabit switches (perhaps I've got the details wrong, doesn't really matter)
available 12 months from now just doesn't excite me much.

So what will we do? We'll stick Myrinet cards (or GigabitN Ethernet or
Quadrics or whatever) on Infiniband just like we stick Myrinet on PCI today.

-- g


_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Jun 22 12:10:04 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 22 Jun 2000 15:10:04 -0400 (EDT)
Subject: Beowulf: A theorical approach 
In-Reply-To: <000001bfdc75$27c0e2e0$e4844b89@hptilap.hpti.com>
Message-ID: <Pine.LNX.4.10.10006221509270.13157-100000@ganesh.phy.duke.edu>

On Thu, 22 Jun 2000, Greg Lindahl wrote:

> So, sit back, pop open a cold one, and wait.

The best advice I've had in a long, long time.  How could I not accept
it?

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From glindahl at hpti.com  Thu Jun 22 12:29:55 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Thu, 22 Jun 2000 15:29:55 -0400
Subject: Beowulf: A theorical approach
In-Reply-To: <20000622205918.D762@ostenfeld.dtu.dk>
Message-ID: <000701bfdc80$3e608220$e4844b89@hptilap.hpti.com>

> > It's hard to build something that plugs into an Intel-designed CPU bus.
> > Scali tried it and it didn't work so hot.
>
> At least on Intel hardware, SMP systems have cache coherency in
> the hardware,

Scali's design (and the one I was considering) do not do cache coherency.
They are for message passing systems. As you may gather, I'm a message
passing bigot. Er, evangelist.

Doing cache coherency would be like the SGI O2000, and that would be far
more expensive than what I was talking about. There are approaches that
attack that problem -- see the Isotach research group's work. They are at
UVa, which is why I am aware of them. But not my cup of tea.

-- greg


From covenant at dirac.org  Thu Jun 22 12:33:47 2000
From: covenant at dirac.org (Peter Jay Salzman)
Date: Thu, 22 Jun 2000 12:33:47 -0700 (PDT)
Subject: rdist
Message-ID: <Pine.LNX.4.10.10006221227220.21132-100000@dirac.org>

hello all,

i'd like to transfer the following files to all the nodes (running redhat):

ssh-server-1.2.27-1.i386.rpm
ssh-1.2.27-1.i386.rpm
ssh2-2.0.13-1i.i386.rpm

each node has the rdist rpm installed, so i'd like to use that.  when i try
to run rdistd, it pauses for a second and returns (which is what i'd expect
a daemon to do).  unfortunately, when i do a ps ax, rdistd does not show up
in the list of processes.

does rdistd have to run on the remote host for rdist to work correctly?
can someone give me an example distfile that would place all these
	files on the nodes?
is there some way to use rsh to cat a localfile (the rpm's i want to
	transfer) to a remote file on the nodes?

thanks!
pete


From jakob at ostenfeld.dtu.dk  Thu Jun 22 12:39:13 2000
From: jakob at ostenfeld.dtu.dk (=?iso-8859-1?Q?Jakob_=D8stergaard?=)
Date: Thu, 22 Jun 2000 21:39:13 +0200
Subject: Beowulf: A theorical approach
In-Reply-To: <000701bfdc80$3e608220$e4844b89@hptilap.hpti.com>
References: <20000622205918.D762@ostenfeld.dtu.dk> <000701bfdc80$3e608220$e4844b89@hptilap.hpti.com>
Message-ID: <20000622213913.E762@ostenfeld.dtu.dk>

On Thu, 22 Jun 2000, Greg Lindahl wrote:

> > > It's hard to build something that plugs into an Intel-designed CPU bus.
> > > Scali tried it and it didn't work so hot.
> >
> > At least on Intel hardware, SMP systems have cache coherency in
> > the hardware,
> 
> Scali's design (and the one I was considering) do not do cache coherency.
> They are for message passing systems. As you may gather, I'm a message
> passing bigot. Er, evangelist.

  :)    It makes sense then.

> 
> Doing cache coherency would be like the SGI O2000, and that would be far
> more expensive than what I was talking about. There are approaches that
> attack that problem -- see the Isotach research group's work. They are at
> UVa, which is why I am aware of them. But not my cup of tea.

I wouldn't hold my breath for something like that either.

-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:


From rgb at phy.duke.edu  Thu Jun 22 13:12:27 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 22 Jun 2000 16:12:27 -0400 (EDT)
Subject: Beowulf: A theorical approach
In-Reply-To: <20000622205918.D762@ostenfeld.dtu.dk>
Message-ID: <Pine.LNX.4.10.10006221549360.13157-100000@ganesh.phy.duke.edu>

On Thu, 22 Jun 2000, [iso-8859-1] Jakob ?stergaard wrote:

> On Thu, 22 Jun 2000, Greg Lindahl wrote:
> 
> > > Just as a matter of curiosity -- Once upon a time some two or three
> > > years ago I suggested on the list that a development company consider
> > > building a network communications device that plugged into the second
> > > CPU slot of a dual CPU board.
> > 
> > It's hard to build something that plugs into an Intel-designed CPU bus.
> > Scali tried it and it didn't work so hot.
> 
> At least on Intel hardware, SMP systems have cache coherency in the hardware,
> and that's _expensive_ wrt. inter-cpu communication.  There's write snooping
> and all sorts of the strangest things happening.  I cannot imagine how this
> could work in any way over a network with any reasonable performance. Even
> given infinite bandwidth, I guess the latency from ten meters of copper wiring
> would kill performance.  (any sub-c interconnect would and tunneling is a few
> years away I guess  ;)

Not arguing with the difficulty of doing the local engineering, but from
whence the remarks concerning copper and latency?  It could be an
optical interconnect, or a three meter copper interconnect for all I
care.  Even ten meters is only 33 nanoseconds at c which is a WHOLE lot
smaller than the other sources of latency in a normal network
interconnect. I thought signals propagated on copper at an appreciable
fraction of c... am I missing something here?

Never mind.  I'll just go get the beer Greg suggested instead;-)

> > In 12 months, you'll be able to get either PCI-X machines or Infiniband
> > machines. PCI-X has split transactions, and Chuck Seitz, who knows more
> > about networking than all of us combined, says that PCI-X will get rid of
> > most of the PCI latency.
> > 
> > So, sit back, pop open a cold one, and wait.
> > 
> 
> I'm skeptical, but I'll take your advice here   :)

Yeah!

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From covenant at dirac.org  Thu Jun 22 13:20:13 2000
From: covenant at dirac.org (Peter Jay Salzman)
Date: Thu, 22 Jun 2000 13:20:13 -0700 (PDT)
Subject: pbs quickstart?
Message-ID: <Pine.LNX.4.10.10006221317330.21335-100000@dirac.org>

does anyone know of a pbs quickstart file?

any good pbs resources?  the admin guide is leaving a lot of questions
unanswered and have too much info that doesn't pertain to our setup; it's
making the reading a bit slow.  :(

help?

pete


From walt at parl.ces.clemson.edu  Thu Jun 22 13:51:52 2000
From: walt at parl.ces.clemson.edu (Walter B. Ligon III)
Date: Thu, 22 Jun 2000 16:51:52 -0400
Subject: Beowulf: A theorical approach 
In-Reply-To: Your message of "Thu, 22 Jun 2000 17:18:06 BST."
             <043815218161660PCOW024M@blueyonder.co.uk> 
Message-ID: <200006202049.QAA24367@krang.parl.clemson.edu>

--------

Well, yeah, but its the PCI interface I'm talking about.  Robert Brown's
posting was really more to the point.  Build a NIC that interfaces
driectly to the CPU and memory.

Walt

> 
> >  The problem right now really isn't in link speeds (though better
> > link speeds are good), its in how close/far the network interface is
> > from the CPU.  COTS HW doesn't place a high value on direct access
> > to IO devices - there is a higher value on a standardized bus
> > interface to allow different system components to be integrated and
> > updated independently.  A "supercomputer" can have the network
> > engineered directly into the node architecture.  This is a huge
> > advantage.  Luckily, this advantage has the most effect in only some
> > programs.
> 
> If Infiniband does all that it is supposed to do, then it will rapidly
> become the network of choice, since it _does_ have support for direct
> (user-space) access to the comms, and has some nifty switches.
> 
> Of course in the short term it will be limited by the CPU side
> interfaces being PCI, but that's only the same limitation as
> for Quadrics, Myrinet, SCI and so on. 
> 
> Once it becomes the standard for connection to storage it _should_ be
> cheap, and a standard component of any "server-class" commodity
> machine, whether IA* or other architecture. (IBM announced that
> they'll be selling their interface chips and switches just today).
> 
> So, I expect that Inifinband will be engineered intimately into the
> node architecture of COTS hardware, and that will help a lot.
> 
> It'll be interesting how long it takes before the Linux drivers are
> available !
> 
> -- Jim 
> 
> James Cownie	<jcownie at etnus.com>
> Etnus, Inc.     +44 117 9071438
> http://www.etnus.com
> 

-- 
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University


From jakob at ostenfeld.dtu.dk  Thu Jun 22 14:09:31 2000
From: jakob at ostenfeld.dtu.dk (=?iso-8859-1?Q?Jakob_=D8stergaard?=)
Date: Thu, 22 Jun 2000 23:09:31 +0200
Subject: Beowulf: A theorical approach
In-Reply-To: <Pine.LNX.4.10.10006221549360.13157-100000@ganesh.phy.duke.edu>
References: <20000622205918.D762@ostenfeld.dtu.dk> <Pine.LNX.4.10.10006221549360.13157-100000@ganesh.phy.duke.edu>
Message-ID: <20000622230931.F762@ostenfeld.dtu.dk>

On Thu, 22 Jun 2000, Robert G. Brown wrote:

> On Thu, 22 Jun 2000, [iso-8859-1] Jakob ?stergaard wrote:
> 
> > On Thu, 22 Jun 2000, Greg Lindahl wrote:
> > 
> > > > Just as a matter of curiosity -- Once upon a time some two or three
> > > > years ago I suggested on the list that a development company consider
> > > > building a network communications device that plugged into the second
> > > > CPU slot of a dual CPU board.
> > > 
> > > It's hard to build something that plugs into an Intel-designed CPU bus.
> > > Scali tried it and it didn't work so hot.
> > 
> > At least on Intel hardware, SMP systems have cache coherency in the hardware,
> > and that's _expensive_ wrt. inter-cpu communication.  There's write snooping
> > and all sorts of the strangest things happening.  I cannot imagine how this
> > could work in any way over a network with any reasonable performance. Even
> > given infinite bandwidth, I guess the latency from ten meters of copper wiring
> > would kill performance.  (any sub-c interconnect would and tunneling is a few
> > years away I guess  ;)
> 
> Not arguing with the difficulty of doing the local engineering, but from
> whence the remarks concerning copper and latency?  It could be an
> optical interconnect, or a three meter copper interconnect for all I
> care.  Even ten meters is only 33 nanoseconds at c which is a WHOLE lot
> smaller than the other sources of latency in a normal network
> interconnect. I thought signals propagated on copper at an appreciable
> fraction of c... am I missing something here?

I don't know exactly how often the CPUs go ``what are you doing ?  nothing ?
oh, keep up the good work!'' in an SMP configuration to maintain cache
coherency.  But with 33ns latency you can only do that 15 million times a
second.  While that might seem like a lot, it's not if the CPUs need to do that
every time a new cache line is accessed.   I don't know if that's exactly the
case, and I'd appreciate if anyone in the know could comment on that.

> Never mind.  I'll just go get the beer Greg suggested instead;-)

Right on !   In any case, this little discussion is sort of moot as the CPU
interconnect was for message passing and not cache coherency anyway.       :)

-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:


From dsg at super.org  Thu Jun 22 15:17:50 2000
From: dsg at super.org (David S. Greenberg)
Date: Thu, 22 Jun 2000 22:17:50 +0000
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical 
 approach]
References: <000201bfdc78$ff3df520$e4844b89@hptilap.hpti.com>
Message-ID: <3952908E.D203004C@super.org>

Thanks Greg, you made my description much clearer.
My point was that companies like HPTi will help you build a machine with the
attributes appropriate for your problem space as opposed to the traditional
supercomputer vendor who would help you change your applications to be
appropriate to their machine.
Who exactly does what portion of the design was not really my point but
rereading the post makes it seem that way.  The spectrum goes from do
everything yourself (pick components, do wiring, write drivers, load OS, etc.)
to get help customizing a system (tell someone about your apps and what you
know about how they work with various hardware and let them build you an
appropriate machine) to buy an off-the-shelf machine.   All points on this
spectrum can be met today with cluster technology.
David

Greg Lindahl wrote:

> > 2) Customize a stock design and get someone else to build it for you. There
>
> > are several small to mid-size companies which specialize in this. I've been
>
> > meaning to update my list (perhaps some readers will help).  The list
> includes
> > at least Altatech, Atlantec, Aspen, DCG,HPTi, Paralogic, TurboLinux,
> VALinux.
>
> In the case of HPTi and some of the other companies on this list, this is
> not quite right. HPTi provides appropriate solutions to solve your problem,
> just like a traditional supercomputer vendor. We do not encourage our
> customers to pick out which motherboard they like best; we prefer to take
> relevant benchmarks and then pick the right CPU, interconnect, and so forth
> to give you the biggest bang for your $. Our design space is general purpose
> enough that it fits what most people want out of a supercomputer.
>
> -- greg


From thakur at mcs.anl.gov  Thu Jun 22 14:54:46 2000
From: thakur at mcs.anl.gov (Rajeev Thakur)
Date: Thu, 22 Jun 2000 16:54:46 -0500
Subject: CFP: Parallel I/O on Clusters Workshop, Brisbane, Australia
Message-ID: <200006222154.QAA23010@abacus.mcs.anl.gov>

			   CALL FOR PAPERS

     The 2001 International Workshop on Parallel I/O on Clusters
			      (PIC 2001)

		http://www.mcs.anl.gov/~thakur/pic2001

			   Organized at the
 IEEE Int'l Symposium on Cluster Computing and the Grid (CCGrid'2001)

			 Brisbane, Australia
			   May 16--18, 2001

     In cooperation with the IEEE Task Force on Cluster Computing


With powerful microprocessors and high-performance networks being
available as commodity components, clusters of computers are
increasingly being used as a low-cost alternative to traditional
parallel machines. While clusters may match parallel machines in
computation and communication capabilities, support for
high-performance parallel I/O on clusters still lags behind in all
respects -- performance, usability, reliability. Significant research
and development challenges remain in improving the parallel I/O
capabilities of clusters.

PIC 2001 solicits research papers that focus on any aspect of
storage-related parallel I/O specifically on clusters. Relevant topics
include:

Parallel file systems
Runtime libraries
Language and compiler support
Storage architecture
Network-attached storage
I/O characterization
I/O-intensive applications
Parallel I/O for databases
Reliability and availability of storage

Papers submitted to PIC 2001 must be unpublished and must not be
submitted for publication elsewhere.  The manuscript must be written
in English, at most ten pages long (including figures and tables, but
excluding references), single- or double-spaced, using an 11-point
font. Electronic submission in PostScript or PDF format is required;
papers should be submitted by email to thakur at mcs.anl.gov. All papers
will be reviewed.  The deadline for submissions is November 4,
2000. Decisions will be announced by December 20, 2000. Accepted
papers will appear in the proceedings of CCGrid'2001, to be published
by IEEE Computer Society Press.


Workshop Chair
==============

Rajeev Thakur
Mathematics and Computer Science Division
Argonne National Laboratory
9700 South Cass Avenue
Argonne, IL 60439, USA
(630) 252-1682
thakur at mcs.anl.gov

Vice Chair
==========

Rob Ross
105 Riggs Hall, Box 340915 
Clemson University
Clemson, SC 29634-0915, USA
(864) 656-7223 
rbross at parl.clemson.edu

Program Committee
=================

Peter Braam, TurboLinux and Carnegie Mellon University
Peter Brezany, University of Vienna
Walt Ligon, Clemson University
Tara Madhyastha, University of California, Santa Cruz
Rob Ross, Clemson University
Liddy Shriver, Bell Laboratories
Rajeev Thakur, Argonne National Laboratory

Important Dates
---------------

Papers due: November 4, 2000 
Notification of Acceptance: December 20, 2000 
Camera Ready Papers due: January 24, 2001 
CCGrid'2001 Symposium: May 16-18, 2001


From timothy.g.mattson at intel.com  Thu Jun 22 15:05:47 2000
From: timothy.g.mattson at intel.com (Mattson, Timothy G)
Date: Thu, 22 Jun 2000 15:05:47 -0700
Subject: Infiniband (was RE: Beowulf: A theorical approach)
Message-ID: <F70F37F77E9FD211AC3F00A0C96B78DA03F8C314@orsmsx47.jf.intel.com>

Bill, Greg and Tony,

This discussion has put me in a difficult position.  

I work for Intel and know quite a bit about what's going on with Infiniband.
However, I am not a member of any of the groups working on Infiniband, and
therefore I don't know what the official word is concerning the issues
you've raised.  

I know I can say this much.  It is our intent to have Infiniband products
that are appropriate for the cluster market.  As for how scalable these
clusters will be, I don't know the answer.  If I did know, I couldnt say
right now -- its just too long before the launch of our Infiniband products.
I hope you all understand.

As the cluster-specific details emerge and go public, I will try and stay on
top of them and pass them onto this group. I'm sorry I can't say more, but
its really too early to get more detailed.

--Tim Mattson


From rgb at phy.duke.edu  Thu Jun 22 15:30:29 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 22 Jun 2000 18:30:29 -0400 (EDT)
Subject: Beowulf: A theorical approach 
In-Reply-To: <200006202049.QAA24367@krang.parl.clemson.edu>
Message-ID: <Pine.LNX.4.10.10006221718180.13495-100000@ganesh.phy.duke.edu>

On Thu, 22 Jun 2000, Walter B. Ligon III wrote:

> --------
> 
> Well, yeah, but its the PCI interface I'm talking about.  Robert Brown's
> posting was really more to the point.  Build a NIC that interfaces
> driectly to the CPU and memory.

Sure, and memory is indeed another way to do it.  Build a small
communications computer that "fits" into a memory chip slot.  I'd guess
that one could make the actual interface a real (but small and very fast
-- SRAM?) memory chip that was on TWO memory buses -- the one in the
computer in question and on the "computer" built into the interface
whose only function is to manage communications and which would be
strictly responsible for avoiding timing collisions -- possibly with a
harness that allows it to generate interrupts to help even more. (Can a
memory chip per se generate trappable interrupts now? Don't know.) Then
accompany it with a kernel module that maps those memory addresses into
a dedicated interface space and manages the interrupts, so the CPU only
tries to write the memory when it is writable and read when it is
readable.

Control the interface with e.g. headers/trailers on the writes -- write
a block of data to it at memory speed (possibly even with DMA transfers
and the CPU doing something else).  Then write an address into a byte
block that initiates the transfer.  Reads require an interrupt -- the
data comes in and is buffered in a generous buffer (perhaps the
post-prom leftovers of an onboard 64 MB or 128 MB SDRAM chip -- build
the thing on top of the DIMM it replaces) and at the appropriate time an
interrupt is generated to tell the kernel/system to execute a read DMA
from the memory buffer into "real" memory.  I think this is pretty much
the way memory mapped, DMA capapble network interfaces operate now,
except that they are bottlenecked at a Gbps (PCI @ 32(bits)*33(MHz)) and
generally have a much higher latency.

This is just the same idea in sheep's clothing, only this way instead of
pulling the transferred data out of the pseudo-CPU's "cache" SRAM the
second CPU lives in an entirely different space not directly accessible
from the main CPU, where you can do with it whatever you wish.  This way
there is a single chip of "shared" memory (which is actually probably an
SRAM cache buffer living in both spaces that can convince the main
mobo's CPU that it is an SDRAM DIMM) and an attached communications
co-computer that does nothing but move stuff into and out of the SRAM
and into its own copious comm buffer (128 MB is actually probably gaudy,
but it gets the point across -- this sucker could exchange BIG blocks of
data or lots of little blocks in parallel with the main CPU because it
operates completely independently) and manage the transfers.  With <10
ns SRAM, there is probably time for it to be loaded and kept full (or
emptied) when attached to a relative slow (presumed SDRAM) interface.
Obviously the interface would ignore or simulate memory refresh and all
that.

Even with a hellacious latency (which there is no reason to expect a
priori) one could imagine a parallel fiber connection that gives you a
VERY high bandwidth to other nodes.  For example, 32 optical fibers and
switches (operating and controlled in parallel, each responsible for one
bit) could transfer data quite rapidly.  There are plenty of problems
that would be accessible with very high internode bandwidth even if one
DID have to pay a hundred microsecond hit in latency.  For one thing, it
would be refreshing to have a network that could transfer data faster
than main memory accesses.  One could think about distributed shared
memory paradigms (e.g. NUMA) with or without the underlying CC which
could be done in software if necessary.  For lots of problems or in a
(shared memory based) message passing environment, it wouldn't be.

I dunno, Greg, two beers into it I'm still tormented with the thought
that $2 million might be a reasonable investment, especially if one is a
company like Transmeta, with a significant investment in making tiny
computers that could -- given the right harnesses -- be assembled into a
beowulf.  But even for Intel it might make sense.

However, Greg's other point (about PCI-X making it all better) is still
well taken.  Not being an IEEE member, I've been unable to get PCI-X
specs from the web (and I've expended some effort in that direction --
it drives me bananas to have a closed forum where significant standards
development efforts occur without even a window where the world can see
in) it may well be that they've basically snugged it up much closer to
the CPU and memory busses.

Still, in the "old days" of PC development, PC's were so obviously
horrible in many features of their design (my original IBM PC was a 64K
-- yes, K -- motherboard) that lots of small companies got very, very
creative in their designs for enhancements and made decent fortunes for
their founders. Coprocessor boards, transputers, CPU plug-ins, and lots
of multifunction boards with memory and peripherals all hit the market
and sold (sometimes "like crazy").  These days, it seems like this
particular kind of bent-coat-hanger engineering has all but disappeared,
which is a bit of a shame.  Understandable enough -- computers now are
adequate for almost any mainstream application "out of the factory" --
but a shame, as it takes away a lot of creativity.

If the "standard" interface you have is too slow and there is an
interface available that is more than fast enough and standards based,
clever engineering and supporting software SHOULD be able to create a
kludge that uses the faster one (outside of the purpose for which it was
designed, of course).

Sigh.  I'll go down for another beer now.  Maybe three will make this
idea go away.  At least I have the blessed advantage of not being an EE
so I won't be tempted to go drum up the requisite $5M (why take risks)
of VC and go for it.  One can only hope, though, that Intel (or Compaq,
or any of the main computer/CPU/motherboard folks) has a light bulb turn
on and actually designs a motherboard/CPU connection a priori, outside
of the existing bus specs, "just for communications".  Something like
the benighted and hellish AGP slot, but instead of being useless and
devoted to graphics (which tend to work just fine on PCI, for the most
part) devoted to way deep, low latency/high bandwidth communications.

How hard can it be to design a standard API spec for this?  Put it on
the memory bus.  Put EVERYTHING on the memory bus and make all device
latencies the responsibility of the peripheral designer.  Make it the
responsibility of the peripheral designer to provide something that can
be read or written at memory bus speeds and latencies (subject to
interrupt line control), and put the burden on their shoulders to
provide interrupt-driven memory bus control of when the particular
device is ready to read or write again.  All of this can be managed with
a shared memory/co-processor paradigm.  In fact, a dual CPU system,
running a shared memory interface between a running program on CPU A and
its communications program on CPU B can emulate it now, sometimes even
profitably, except that the comm program has to go through the same damn
bus to get to the NIC and that a lot of NICs do DMA transfers anyway, so
most of CPU B is effectively wasted.

It's hard (at least for a novice like me) to understand why modern
computers require "a peripheral bus" with separate and distinct
latencies and bandwidths anymore at all.  Put everything on one bus and
let the peripheral itself decide how fast it "can" interface (up to the
actual lat/bw of real memory) and leave the decision of how fast it
"does" interface up to the interrupt-driven kernel software.  I think a
lot of the "peripheral bus" concept is an archaism left over from the
old days, when one had very slow devices that couldn't possibly saturate
a memory channel (so it made sense for several to have to share).

   rgb 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From hjin at ceng.usc.edu  Thu Jun 22 17:00:40 2000
From: hjin at ceng.usc.edu (Hai Jin)
Date: Thu, 22 Jun 2000 17:00:40 -0700
Subject: CFP: Cluster Infrastructure for Web Server and E-Commerce
Message-ID: <3952A8A7.6AB44D12@ceng.usc.edu>

                                          CALL FOR PAPERS

                               The 2001 International Workshop on
   Cluster Infrastructure for Web Server and E-Commerce (CISC'2001)

    http://ceng.usc.edu/~hjin/cisc2001/   or
http://www.buyya.com/cisc2001/

                                        Organized At

 IEEE International Symposium on Cluster Computing and the Grid
(CCGrid'2001)
                    May 16 - 18, 2001, Brisbane, Australia

  In Corporation with the IEEE Task Force on Cluster Computing (TFCC)


Scope

            The availability of high-speed networks and high-performance
microprocessors
is making networks of computers an appealing vehicle for cost-effective
internet and
business computing. Clusters of computers built using commodity hardware
and
software are playing a major role in web server design and E-Commerce.

            The CISC'2001 workshop organized at the CCGrid'2001
symposium will
serve as a major forum to present and share the latest research results
of the
work by international researchers, developers, and users. The focus of
this workshop
will be on both architecture and software aspects of web server and
E-commerce
by using cluster technology. Topics of interest include, but are not
limited to:

Web Server

      Scalability of clusters as web server
      Network architecture and protocol of cluster as web server
      Caching Algorithms for web server
      Web server performance
      Web server benchmark
      Network attached storage for web server
      Database for web server
      Security for web server
      Traffic Models & Statistics
      Job scheduling and load balancing for clustered web server
      Web server design and consolidation
      Web server monitoring and management
      Quality of Service of web server
      Web server for E-Commerce
      Fault tolerance and high availability of cluster as web sever
      Web server case studies by using cluster technology

E-Commerce

      Cluster, Hypercluster and Grid architecture for E-Commerce
      Security and cryptographic issues, methods and applications
      Cyber Guards
      Digital certification, identification and authentication in
E-Commerce
      Information storage, retrieval and update
      Web-based storage for E-Commerce
      Commerce-oriented middleware services
      Computational market for information services
      Formation of supply chains, coalitions, and virtual enterprises
      Software requirements and architectures for E-Commerce
      User interface support for E-Commerce
      Multilevel business support
      Intrusion prevention, detection and tolerance
      Data replication, consistency and caching
      Agent technology for E-Commerce
      Parallel databases and high performance reliable-mass storage
systems
      Data mining for E-Commerce
      Decision making in E-Commerce environments
      Automatic Sales forecasting and Poral Reconfiguration
      E-Commerce application case studies

Paper Submission

        Submission should include authors names, affiliations, addresses
and email
addresses on the cover page. Submission implies the willingness of at
least one of
the authors to register and present the paper. Please submit original,
full paper not
exceeding 10 pages of two column text using single spaced 10 point size
type on
8.5 x 11 inch pages, as per IEEE 8.5 x 11 manuscript guidelines, see:
http://www.computer.org/cspress/instruct.htm. Authors should submit a
PostScript
(level 2) or PDF file that will print on a PostScript printer to the
workshop chair
Hai Jin by email. Hard copies should be sent only if electronic
submission is not possible.

Proceedings

        All papers selected for this workshop by peer-review process
will be published
in the the CCGrid'2001 symposium proceedings through the IEEE Computer
Society Press, USA and they will also be made available online through
the IEEE
digital library.

Workshop Chairs

Hai Jin
Internet and Cluster Computing Lab.
Department of EE-System, EEB-104
University of Southern California
Los Angeles, California, 90089, USA
Tel: +1-213-740-6433
Fax: +1-425-920-8937
Email: hjin at ceng.usc.edu
WWW: http://ceng.usc.edu/~hjin/

Rajkumar Buyya
School of Computing Science and Software Engineering
Monash University
Caulfield Campus, Melbourne, Vic. 3145, Australia
Phone: +61-3-9903 1969
Fax: +61-3-9903  2863
Email: rajkumar at csse.monash.edu.au
WWW: http:www.csse.monash.edu.au/~rajkumar

Program Committee Members

     Amnon Barak     (Hebrew University, Israel)
     Rajkumar Buyya    (Monash University, Australia)
     Peter Graham     (University of Manitoba, Canada)
     Kai Hwang     (University of Southern California, USA)
     Hai Jin     (University of Southern California, USA)
     Veljko Milutinovic     (University of Belgrade, Yugoslavia)
     Clifford Neuman     (ISI, University of Southern California, USA)
     Yi Pan    (University of Dayton, USA)
     Michael Rumsewicz    (University of Melbourne, Australia)
     Cho-Li Wang     (The University of Hong Kong, Hong Kong)
     Wensong Zhang     (National Laboratory for Parallel & Distributed
Processing, China)


Important Dates:

 Draft Papers due on:  November 4, 2000
 Notification of Acceptance:  December 20, 2000
 Camera Ready Papers and Preregistration due on:   January 24, 2001
 CCGrid'2001 Symposium: May 16 - 18, 2001


From SeanWard at msn.com  Fri Jun 23 01:48:16 2000
From: SeanWard at msn.com (Sean Ward)
Date: Fri, 23 Jun 2000 04:48:16 -0400
Subject: Beowulf: A theorical approach 
References: <Pine.LNX.4.10.10006221718180.13495-100000@ganesh.phy.duke.edu>
Message-ID: <004701bfdcef$c6658660$100010ac@alex1.va.home.com>

----- Original Message -----
From: Robert G. Brown <rgb at phy.duke.edu>
To: Walter B. Ligon III <walt at parl.ces.clemson.edu>
Cc: James Cownie <jcownie at etnus.com>; Nacho Ruiz <iorfr00 at student.vxu.se>;
Beowulf Mailing List <beowulf at beowulf.org>
Sent: Thursday, June 22, 2000 6:30 PM
Subject: Re: Beowulf: A theorical approach


> On Thu, 22 Jun 2000, Walter B. Ligon III wrote:
>
> > --------
> >
> > Well, yeah, but its the PCI interface I'm talking about.  Robert Brown's
> > posting was really more to the point.  Build a NIC that interfaces
> > driectly to the CPU and memory.
>
> Sure, and memory is indeed another way to do it.  Build a small
> communications computer that "fits" into a memory chip slot.  I'd guess
> that one could make the actual interface a real (but small and very fast
> -- SRAM?) memory chip that was on TWO memory buses -- the one in the
> computer in question and on the "computer" built into the interface
> whose only function is to manage communications and which would be
> strictly responsible for avoiding timing collisions -- possibly with a
> harness that allows it to generate interrupts to help even more. (Can a
> memory chip per se generate trappable interrupts now? Don't know.) Then
> accompany it with a kernel module that maps those memory addresses into
> a dedicated interface space and manages the interrupts, so the CPU only
> tries to write the memory when it is writable and read when it is
> readable.
[snip]
    From a software standpoint, that is doable. Ram cannot generate
interrupts, to my current understanding, which would leave a polling
architecture, probably on a fixed address to detect a "data received" bit
change, which would mean at least a word of transfer every n microseconds
that you want as your polling interval. Ideally, a nic driver would make
that time configurable, so people wanting to increase memory
bandwidth/reduce cpu time lost to polling could sacrifice latency.
    At least in linux, it would be fairly trivial to mask the memory offsets
assigned to the NIC-as ram module, such as using the approach in
http://home.zonnet.nl/vanrein/badram/ by kmallocing those offsets during
kernel init. The hardware approach would be similar to those (old) 36 to 72
pin ram converters, which stacked several old ram simms. Drop your NIC
logic, a dimm slot for cache, and a cable to an optical jack and hope your
case supports your ungodly high DIMM. It would mostly be an issue of
designing the SDRAM to whatever nic core you use interface which would pose
the design problems.
    Assuming you are using a 133mhz frontside bus for ram access, and are
stacking into a 128 pin sdram slot, you could theoretically be on the order
of ~1 GB per sec of IO, given the 64 bit data path on modern north bridges
like the via kx133. Existing fiber based interconnects can already provide
that, albeit with a latency penalty. Real world, throwing in addressing,
protocol and polling overhead, and the fact that most memory controllers
require interleaving data on several dimms to get full bandwidth, you might
get half the theoretical "wire speed" of the sdram DIMM. The fact that the
receive buffer would be addressable RAM would be useful for many interesting
things ;)
    The real advantage of that type of solution, is you can hack support
into any platform which uses DIMMS, provided the OS is modifiable. A CPU
socket design requires commitment to an architecture, and hence a smaller
market for this nic in a DIMM. When you couple the portrait of a mixed
revenue stream, licensing the NIC on a SIMM to other chip manufacturers,
with software sales of solutions optimized to use the hardware (think
databases and filesystems initially, because they need fast transactional
support that a high speed write to another computer provides), and you have
an attractive proposal. Finally, you would probably have a time window
wherein it was more cost effective to use the NIC on a DIMM than new bus
formats, such as PCI-X, since as always, a new tech costs more than old
tech, and with a NIC on a DIMM you only need a new ram chip, versus new io
controllers, new NICs, new motherboard designs, etc. However, I do not know
enough about PCI-X and infinaband to do anything but shoot myself in the
foot, especially WRT latencies of the different technologies.

-Sean Ward


From jcownie at etnus.com  Fri Jun 23 01:38:13 2000
From: jcownie at etnus.com (James Cownie)
Date: Fri, 23 Jun 2000 09:38:13 +0100
Subject: Beowulf: A theorical approach 
In-Reply-To: Your message of "Thu, 22 Jun 2000 13:22:56 EDT."
             <Pine.LNX.4.10.10006221256230.13157-100000@ganesh.phy.duke.edu> 
Message-ID: <03a4f1239081760PCOW029M@blueyonder.co.uk>

> Just as a matter of curiosity -- Once upon a time some two or three
> years ago I suggested on the list that a development company consider
> building a network communications device that plugged into the second
> CPU slot of a dual CPU board.

This is exactly how the Elan comms interface worked in the MEIKO CS-1,
(available circa 1992/3 ?). It plugged into the second M-Bus slot on a
dual SPARC CPU board.

It provided cache coherent remote user space access (including remote
DMA) with control from user processes in Solaris without requiring a
kernel trap. But _not_ an SMP model, although you had complete remote
store access, you had to _know_ that you wanted to access remote
store. Remote store was not memory mapped into the process address
space, so a simple load/store could never be remote. (I.e. it's a
message passing model, or what Cray called "shmem", but with full
cache coherence). There was also a little processor in there for
message sequencing and so on (which also ran in the user address
space, of course).

Been there, done that...

p.s. the current successor to this is the Quadrics' technology.

The reasons no-one does it any more are

1) You can't get the bus specs from the CPU vendors.
2) The CPU bus-specs change too fast for you to be able to keep up.

As a non-cpu vendor you're driven to working with something which has
a public specification and doesn't change every six months, which
currently means PCI.

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, Inc.     +44 117 9071438
http://www.etnus.com


From oliva at ponza.dma.unina.it  Fri Jun 23 02:40:53 2000
From: oliva at ponza.dma.unina.it (Gennaro Oliva)
Date: Fri, 23 Jun 2000 11:40:53 +0200
Subject: pbs quickstart?
References: <Pine.LNX.4.10.10006221317330.21335-100000@dirac.org>
Message-ID: <395330A5.C220DC79@ponza.dma.unina.it>

Try
http://www.nas.nasa.gov/ACSF/Eagle/Submitting/
or
http://www.nas.nasa.gov/Groups/SciCon/Tutorials/usingpbs/

						Gennaro Oliva


Peter Jay Salzman wrote:
> 
> does anyone know of a pbs quickstart file?
> 
> any good pbs resources?  the admin guide is leaving a lot of questions
> unanswered and have too much info that doesn't pertain to our setup; it's
> making the reading a bit slow.  :(
> 
> help?
> 
> pete
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf


From scheinin at crs4.it  Fri Jun 23 04:05:00 2000
From: scheinin at crs4.it (Alan Scheinine)
Date: Fri, 23 Jun 2000 13:05:00 +0200 (METDST)
Subject: Supermicro PIIIDM3 and PIIIDR3 motherboards
Message-ID: <200006231105.NAA01626@dylandog.crs4.it>

   I am new to this mailing list, but I did try to read the Archives
of this year before posting this question.

   I want to buy (have assembled from boards) a Intel-based PCs
that will give me good bandwidth with Myrinet.  In particular,
I would like a 64-bit PCI bus, so the field is narrowed to just
a few motherboards.  More specifically, there are two motherboards
from Supermicro for which I would like to hear some realworld
experience: PIIIDM3 (memory PC100) and PIIIDR3 (memory Rambus).

   On 25 April 2000 Eric billings submitted a table that showed
the PIIIDME (same as PIIIDM3 but without built-in SCSI) as having
slow STREAM results, slower than the chipset 440BX.  Keith Underwood
replied, "That's odd.  Tim Carlson had posted STREAM benchmarks
earlier for that board that were much better (2x or more).

   In looking over the archive, I did not come across a follow-up
article.  Eric Billings wrote, "We are continuing to test 840-based
motherboards as they become available."  So perhaps it is appropriate
to revisit the question.

   Brian Haymore wrote that he returned 5 or 7 of PIIIDME boards and
runs them without with memory ECC turned off.  Aside from the ECC problem,
which another contributor said has been fixed, could Eric Billings comment
on the speed of running programs in comparison to a 440BX chipset?  Does
the interleaving of PC100 memory access on the 840 chipset give a visible
speedup?  What I mean to say in general, in addition to STREAM tests, does
anyone have comparisons running actual memory-intensive programs?

   With regard to Rambus, I did not see any mailing list messages
concerning the Supermicro PIIIDR3, has anyone tried it?  I read
that for the 820 chipset the memory translator that allows the use of
PC100 adds latency.  Because the 840 chipset is so new (with regard
to its market share), I have not come across comments concerning a
possible analogous problem for the 840 chipset.  If Rambus really
does give a significant speed-up, it might be worth the cost; especially
since the cost of Rambus is gradually declining.

Alan Scheinine  Email: scheinin at crs4.it


From ian_mcleod at primus.com.au  Fri Jun 23 04:20:05 2000
From: ian_mcleod at primus.com.au (Ian McLeod)
Date: Fri, 23 Jun 2000 20:50:05 +0930
Subject: Neural network on beowulf - query
In-Reply-To: <200006211600.MAA24514@blueraja.scyld.com>
References: <200006211600.MAA24514@blueraja.scyld.com>
Message-ID: <00062320523502.00969@micinca>

Hi,

I am new to all of this, and I confess my lack of experience.

I am intrigued by the concept of running a neural network simulator over a
beowulf cluster of a few friends PCs, but as I understand it from the manual
and how-to's, this is easier said than done.

Only FORTRAN can be run in parralel if I read the literature correctly, C++
will not work (or not well at all).

Is there any hope for parralel computing?

Ian McLeod


From alvin at iplink.net  Fri Jun 23 05:13:53 2000
From: alvin at iplink.net (Alvin Starr)
Date: Fri, 23 Jun 2000 08:13:53 -0400 (EDT)
Subject: Beowulf: A theorical approach 
In-Reply-To: <Pine.LNX.4.10.10006221718180.13495-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.OSF.4.05.10006230807070.30689-100000@caesar.iplink.net>

On Thu, 22 Jun 2000, Robert G. Brown wrote:

> On Thu, 22 Jun 2000, Walter B. Ligon III wrote:
> 
> > --------
> > 
> > Well, yeah, but its the PCI interface I'm talking about.  Robert Brown's
> > posting was really more to the point.  Build a NIC that interfaces
> > driectly to the CPU and memory.
> 
> Sure, and memory is indeed another way to do it.  Build a small
> communications computer that "fits" into a memory chip slot.  I'd guess
> that one could make the actual interface a real (but small and very fast
> -- SRAM?) memory chip that was on TWO memory buses -- the one in the
> computer in question and on the "computer" built into the interface
> whose only function is to manage communications and which would be
> strictly responsible for avoiding timing collisions -- possibly with a
> harness that allows it to generate interrupts to help even more. (Can a
> memory chip per se generate trappable interrupts now? Don't know.) Then
> accompany it with a kernel module that maps those memory addresses into
> a dedicated interface space and manages the interrupts, so the CPU only
> tries to write the memory when it is writable and read when it is
> readable.

One other possiblity is to use videoram. They have row shifters and it is
possible to shift data into the ram and then load a whole row in a single
operation. The advantage is that the ram can be being used for other work
while the data is being shifted in. This can also be used as a very fast
method for initialzing chunks of the ram. Write a row of zeros and then
load them multiple times to get a zeroed page. I wonder how much CPU time
is spent just writing 0's to pages of memory?

Alvin Starr                   ||   voice: (416)585-9971
Interlink Connectivity        ||   fax:   (416)585-9974
alvin at iplink.net              ||


From scheinin at crs4.it  Fri Jun 23 05:41:48 2000
From: scheinin at crs4.it (Alan Scheinine)
Date: Fri, 23 Jun 2000 14:41:48 +0200 (METDST)
Subject: Neural network on beowulf - query
Message-ID: <200006231241.OAA01671@dylandog.crs4.it>

Ian McLeod writes:
 > I am intrigued by the concept of running a neural network simulator over a
 > beowulf cluster of a few friends PCs, but as I understand it from the manual
 > and how-to's, this is easier said than done.
 > 
 > Only FORTRAN can be run in parralel if I read the literature correctly, C++
 > will not work (or not well at all).
 > 
 > Is there any hope for parralel computing?
 > Ian McLeod

Since you will not be using shared memory (you mention a cluster
of computers), you will probably use a message-passing language
such as MPI.  Message-passing works equally well in any language.
For example, I am using MPI to parallelize a Navier-Stokes solver
written in C++.

However, it is time-consuming to write routines to pass nested classes
so you should arrange that the communication only involves simple data
as much as possible.

Alan Scheinine  Email: scheinin at crs4.it


From c.best at fz-juelich.de  Fri Jun 23 06:50:03 2000
From: c.best at fz-juelich.de (Christoph Best)
Date: Fri, 23 Jun 2000 15:50:03 +0200 (CEST)
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical
 approach]
In-Reply-To: <3952908E.D203004C@super.org>
References: <000201bfdc78$ff3df520$e4844b89@hptilap.hpti.com>
 <3952908E.D203004C@super.org> <000b01bfdc1a$5f6e7c80$396d2fc2@lyan.vxu.se>
 <Pine.LNX.4.10.10006221042060.12915-100000@ganesh.phy.duke.edu>
Message-ID: <14674.27688.619193.170494@verne.local>

Hi,

just my two cents (of an Euro) to the "Beowulf vs. Supercomputer"
discussion:

I found that the comparison is more often "building/buying your own
group or departmental cluster" vs. "writing applications for
supercomputer time on a nationwide computer center". Even our little
12-processor cluster provides 100000 processor hrs a year, about what
you would get for a smaller project in a supercomputing center, and
the 128-processor ALiCE cluster of Wuppertal University may be a
factor 5-10 smaller than a big Cray, but there are usually much more
than 10 research institutions sharing a supercomputing center. [BTW,
the Wuppertal cluster was chosen over established mid-range
supercomputers in a competition based on price/performance for
selected application benchmarks.]  Add to this the organizational
overhead and inconvenience of a supercomputing center.

So unless you really need O(1024) processors, many projects should be
better off on a cluster. And if you really need that amount of
computer time for a prolonged period, you probably would not be able
to pay for the supercomputer. Some subfields, like ours (Lattice Field
Theory) or astrophysics, have since quite some time resorted to
building their own supercomputers, sometimes combining the Beowulf
idea of off-the-shelf components with custom interconnects. The
closest may be QCDSP from Columbia University, which is built from
Texas Instrument Digital Signal Processors on custom printed-circuit
boards, and delivers in its largest installation about 400 GFlops
(they are aiming at 10 TFlops for their next project). Others are
QCD-PACS in Japan (based on a modified HP chip), and APE in
Italy/Germany (custom designed processors for a single-instruction
multiple data machine).

Also, when writing an application that needs O(100) GFlops-years, many
physics groups are happy to tailor their programs to the machine and
write message passing codes (as long as graduate students come cheap),
so SMP is not really missed. Cray's top-of-the-line T3E actually is
message-passing, so many programs are written for it.

Finally, we found that processor speed is increasing so quickly, that
even our once considered network-hungry application does not exhaust
Myrinet. Myrinet gives you maybe 100 MB/s data transfer, but the
memory transfer rate may also be only 300-500 MB/s/proc. - a 10 GB/s
network would run much faster than a current memory bus.

Actually, the best argument, if any, against Beowulves that I found
was sheer size and power consumption, mainly because the average node
contains much more circuitry than needed. If someone came up with a
small board containing an Alpha processor, cache and main memory, and
a Myrinet or similar connection... But this, of course, would not be
much different from a T3E.

-Chris
-- 
Christoph Best                                        c.best at computer.org
John von Neumann Institute for Computing/DESY   http://www.oche.de/~cbest


From wasshub at ti.com  Fri Jun 23 06:42:35 2000
From: wasshub at ti.com (Christoph Wasshuber)
Date: Fri, 23 Jun 2000 08:42:35 -0500
Subject: Beowulf: A theorical approach
References: <Pine.LNX.4.10.10006221718180.13495-100000@ganesh.phy.duke.edu>
Message-ID: <3953694B.F6DD5919@ti.com>

I am trying a similar approach and would need help. I
do not have $2mil but some money is available to try
the following approach:

We are trying to design a motherboard which is a
'standard' motherboard but with an I/O interface
which directly plugs into the host interface bus
(between CPU and north bridge).

The motherboard is finished, the only thing which
remains is the host bus interface. We try to do a
simple I/O port (32 or 16 bit wide). Our goal is
to just listen in on the data and address busses
and snatch the appropriate data from the bus for
writing to the I/O port. For reading we try to
disable the northbridge and drive the data
bus. We hope to be able to do this with a high
speed FPGA. The main problems are in timing and
PCB layout.

If there are folks who would be willing to help in
the circuit design of this idea or to try the
DIMMS idea mentioned in earlier postings, please
contact me.

As I said, I am funding this myself, so there is not
a lot of cash available but enough to pay some
enthusiast for their hours of work spent during the
night :-)

Since this posting might appear as comming 'out of the blue'
here a very short paragraph about my intentions:

I am planing to design an affordable motherboard for Beowulfery
with a fast low latency network. The current plan is to
use the idea of PAPERS but scale it to 16, 32, or 64 bit
parallel with direct access to the CPU. Latencies of <100ns
should be possible. So the interface itself will be very
cheap because the hardware of PAPERS is trivial. Hooking this
up to the host bus, if possible with an FPGA, does not cost much
more than standard NICs today.
In case some think that designing a new motherboard is not
cost effective, I have several quotes for the layout and manufacturing
of such motherboards. Even with an initial prototype run of only
100 motherboards one could achieve a price of ~$150 per motherboard
for AMD K6 based design (only counting manufacturing and not design).

I am willing to fund such an effort. but I would need circuit designers
who are up to the challenge.

Chris....


From glindahl at hpti.com  Fri Jun 23 08:48:16 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Fri, 23 Jun 2000 11:48:16 -0400
Subject: Beowulf: A theorical approach 
In-Reply-To: <03a4f1239081760PCOW029M@blueyonder.co.uk>
Message-ID: <001501bfdd2a$724bd300$e4844b89@hptilap.hpti.com>

> It provided cache coherent remote user space access (including remote
> DMA) with control from user processes in Solaris without requiring a
> kernel trap. But _not_ an SMP model, although you had complete remote
> store access, you had to _know_ that you wanted to access remote
> store.

Today, this is called the "SALC" programming model: shared address, local
consistancy. You explicitly fetch data to your local address space, and you
are responsible for making sure it's up to date.

> The reasons no-one does it any more are

That would be a surprise for those of us who are planning on doing it. The
UPC++ and CoArray Fortran languages use SALC, and I expect to have SALC
hardware when PCI-X gets here. MPI-2's one-sided communications can be sped
up if you have SALC hardware.

-- greg


From glindahl at hpti.com  Fri Jun 23 08:54:02 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Fri, 23 Jun 2000 11:54:02 -0400
Subject: Beowulf: A theorical approach 
In-Reply-To: <Pine.LNX.4.10.10006221718180.13495-100000@ganesh.phy.duke.edu>
Message-ID: <001701bfdd2b$409c98c0$e4844b89@hptilap.hpti.com>

> Sure, and memory is indeed another way to do it.  Build a small
> communications computer that "fits" into a memory chip slot.

People have put CPUs on memory interfaces, and was discussed on this very
mailing list. That one was an ARM; it was done before on the Cray-[34]/SSS.

However, memory interface standards are advancing rapidly these days, so
that would be annoying. SDRAM, RDRAM, DDR SDRAM.

> I dunno, Greg, two beers into it I'm still tormented with the thought
> that $2 million might be a reasonable investment, especially if one is a
> company like Transmeta, with a significant investment in making tiny
> computers that could -- given the right harnesses -- be assembled into a
> beowulf.  But even for Intel it might make sense.

If you have the $2 million, send it to me. Intel doesn't care about high-end
supercomputing.

> One can only hope, though, that Intel (or Compaq,
> or any of the main computer/CPU/motherboard folks) has a light bulb turn
> on and actually designs a motherboard/CPU connection a priori, outside
> of the existing bus specs, "just for communications".

They believe that Infiniband meets that need.

> It's hard (at least for a novice like me) to understand why modern
> computers require "a peripheral bus" with separate and distinct
> latencies and bandwidths anymore at all.

Because the CPU bus has to change faster than I/O busses are allowed to
change.

-- greg


From jcownie at etnus.com  Fri Jun 23 09:16:30 2000
From: jcownie at etnus.com (James Cownie)
Date: Fri, 23 Jun 2000 17:16:30 +0100
Subject: Beowulf: A theorical approach 
In-Reply-To: Your message of "Fri, 23 Jun 2000 11:48:16 EDT."
             <001501bfdd2a$724bd300$e4844b89@hptilap.hpti.com> 
Message-ID: <0c6231617161760PCOW028M@blueyonder.co.uk>

> Today, this is called the "SALC" programming model: shared address,
> local consistancy. You explicitly fetch data to your local address
> space, and you are responsible for making sure it's up to date.

I don't like that name much. The address space is _not_ shared,
nothing is shared, an address is always local. (Which is a good thing
if you only have a 32 bit address space, since you'd soon blow it away
in a large machine if you had all the cooperating process' address
spaces mapped in).

Personally I always called it a "cache-coherent explicit remote store
access" model, but I suppose that's not a FLA, so insufficiently
hypeable :-)

> The reasons no-one does it any more are
>      ...
> That would be a surprise for those of us who are planning on doing it. The
> UPC++ and CoArray Fortran languages use SALC, and I expect to have SALC
> hardware when PCI-X gets here. MPI-2's one-sided communications can be sped
> up if you have SALC hardware.

Sorry, you misunderstand, I'm not saying that no-one uses that
_programming_ model anymore, I _am_ saying that no third parties build
NICs which connect directly to a (non-standard) CPU bus, rather than a
standard interface. (Which was, after all where this discussion
started, with the suggestion that a good place to put a NIC would be
in the second cpu socket of a dual processor mother-board).

Indeed you seem to be agreeing, "I expect to have SALC hardware when
PCI-X gets here", so you're waiting to connect to a standard bus,
rather than trying to engineer to a CPU bus. 

In any case you don't need to wait, the Quadrics' stuff does this now
(on PCI). Not a big surprise, really, given that it is the current
version of the Meiko interconnect.

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, Inc.     +44 117 9071438
http://www.etnus.com


From glindahl at hpti.com  Fri Jun 23 09:21:52 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Fri, 23 Jun 2000 12:21:52 -0400
Subject: Beowulf: A theorical approach 
In-Reply-To: <0c6231617161760PCOW028M@blueyonder.co.uk>
Message-ID: <002501bfdd2f$23f31d80$e4844b89@hptilap.hpti.com>

> I don't like that name much. The address space is _not_ shared,
> nothing is shared, an address is always local. (Which is a good thing
> if you only have a 32 bit address space, since you'd soon blow it away
> in a large machine if you had all the cooperating process' address
> spaces mapped in).

You don't understand how I was using the word 'address space'. The address
space in question is NOT the memory address space. It is the namespace for
all the data. The processor hardware isn't involved and you don't use the
usual processor "load" instruction to access this data; you call a
subroutine. This subroutine interprets an argument as the name of the data
to fetch. That's shared.

> Personally I always called it a "cache-coherent explicit remote store
> access" model,

But it isn't actually cache coherent. Remember Cray shmem(): To use data,
you fetch it to be close to you, and then it isn't cache coherent while it's
local. It IS cache coherent in that the fetch gets you the right data.

> In any case you don't need to wait, the Quadrics' stuff does this now
> (on PCI). Not a big surprise, really, given that it is the current
> version of the Meiko interconnect.

The PCI latency penalty is quite large. If I had to support SALC today,
Quadrics is one approach, but I have another approach that I feel would be
superior.

-- g


From glindahl at hpti.com  Fri Jun 23 13:28:24 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Fri, 23 Jun 2000 16:28:24 -0400
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]
In-Reply-To: <Pine.GSO.4.21.0006230925290.24425-100000@rcf.rhic.bnl.gov>
Message-ID: <003b01bfdd51$94c214e0$e4844b89@hptilap.hpti.com>

> i would like to emphasize that in supercomputing sites the people are
> interested not only for computing power (though it is very important)
> but for available software. In traditional fields where supercomputers are
> used (like modeling for various reactors, weather forecast) there are
> also a number of programs and applications which can not be moved easy to
> Linux cluster.

Generalizations are always dangerous.

Weather forecasting is a field in which MPI codes are seen to be the wave of
the future. Big US weather forecasting sites, for example, include NCEP (IBM
SP), AFWA (IBM SP), GFDL (currently has an RFP for which 100% of the codes
use MPI), and FSL, which owns an AlphaLinux Myrinet cluster from my
employer. Large European sites own machines which are also programmed using
MPI.

Once you're using MPI, it's a matter of getting the right balance of
communication bandwidth, latency, and processor power. This can be done
today.

> One of the way to speed up the process is to realize the real
> supercomputing site on the base of the Linux cluster and demonstrate the
> traditional application software packages which might be attractive for
> end users in conventional supercomputing sites. It would be collaborative
> project in between cluster suppliers, software developers, universities,
> etc.

Many supercomputer sites use IBM SP machines, which are clusters, so there
isn't that much to prove. The Forecast Systems Lab (FSL) machine is a
production Linux cluster with good uptime. Whenever you fly in the US, the
FSL machine is the backup machine providing the weather forecast for your
flight.

-- greg


From billran at reciprocal.com  Fri Jun 23 14:09:15 2000
From: billran at reciprocal.com (Bill Rankin)
Date: Fri, 23 Jun 2000 17:09:15 -0400
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theo
	rical approach]
Message-ID: <C4E826E59C02D311985B00500463D90B4250B3@SNS2XCH>

Hey Greg,


> Many supercomputer sites use IBM SP machines, which are 
> clusters, so there
> isn't that much to prove. 

One question: from what I remember the nodes within an SP
are moderate sized SMP in their own right.  Mixed mode programming
(threads within a node, MPI between nodes) was becoming popular 
as a way of increasing scalability (or rather avoiding the 
imediate penalties).  I seem to remember Pete Beckman and Co. talking
about this at the SIAM PP/GS conference a couple years back.

Have you guys had any experience with comparing the performance of
similar codes on an SP versus a fully distributed cluster with similar
performance and number of processors?  

Also, speaking of weather prediction, des anyone know of any recent
advances in getting the shallow-water model to perform well on
a cluster?  I was always under the impression that this was a big
sticking point for some of the more complex models.

Thanks,

-bill


From glindahl at hpti.com  Fri Jun 23 14:38:08 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Fri, 23 Jun 2000 17:38:08 -0400
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]
In-Reply-To: <C4E826E59C02D311985B00500463D90B4250B3@SNS2XCH>
Message-ID: <003f01bfdd5b$5263d520$e4844b89@hptilap.hpti.com>

> > Many supercomputer sites use IBM SP machines, which are
> > clusters, so there
> > isn't that much to prove.
>
> One question: from what I remember the nodes within an SP
> are moderate sized SMP in their own right.

Right. They used to be single CPUs, though.

> Mixed mode programming
> (threads within a node, MPI between nodes) was becoming popular
> as a way of increasing scalability (or rather avoiding the
> imediate penalties).

No. The fact is that IBM initially shipped those machines so that you HAD to
use mixed mode programming to use all the CPUs. Mixed mode hardly increases
scalability, unless your MPI is pretty bad at sending local messages, as
IBM's is.

You can run models like MM5 in either mode. MM5 is faster (on an SGI) as a
pure MPI program.

The emperor has no clothes.

> Have you guys had any experience with comparing the performance of
> similar codes on an SP versus a fully distributed cluster with similar
> performance and number of processors?

No -- there no Alpha slow enough for such a comparison ;-)

The MM5 results I keep on showing has an IBM SP line on it. It is slower per
cpu, and scales similarly. The scaling limitation on MM5 is mostly load
imbalance, not interconnect.

> Also, speaking of weather prediction, des anyone know of any recent
> advances in getting the shallow-water model to perform well on
> a cluster?  I was always under the impression that this was a big
> sticking point for some of the more complex models.

I may reveal my ignorance here, but:

Shallow water models are not cache friendly, so the usual problem is that
they run only as fast as main memory does. Vector machines mostly have
relatively good main memory systems, so there's a strike against non-vector
systems. Shallow water models are nicely MPI-friendly, but since they're
shallow, they often don't have very much data, which is a strike against
slower interconnects.

Most clusters have both strikes. They're cost effective, but the absolute
performance level may not be what you want.

-- greg


From kragen at pobox.com  Fri Jun 23 15:08:10 2000
From: kragen at pobox.com (Kragen Sitaker)
Date: Fri, 23 Jun 2000 18:08:10 -0400 (EDT)
Subject: Beowulfs can compete with Supercomputers
Message-ID: <Pine.GSO.4.21.0006231800470.4059-100000@kirk.dnaco.net>

Greg Lindahl writes:
> > Also, speaking of weather prediction, des anyone know of any recent
> > advances in getting the shallow-water model to perform well on
> > a cluster?  I was always under the impression that this was a big
> > sticking point for some of the more complex models.
> 
> I may reveal my ignorance here, but:
> 
> Shallow water models are not cache friendly, so the usual problem is that
> they run only as fast as main memory does. Vector machines mostly have
> relatively good main memory systems, so there's a strike against non-vector
> systems. Shallow water models are nicely MPI-friendly, but since they're
> shallow, they often don't have very much data, which is a strike against
> slower interconnects.
> 
> Most clusters have both strikes. They're cost effective, but the absolute
> performance level may not be what you want.

Vector machines and the MTA are going to survive Beowulf, I think; they
are good at some things Beowulfs are not good at, and they are easier
to program.

Massively parallel machines like the T3E (which is a bunch of Alphas!)
are going down.  They're slightly better at Beowulf stuff than
Beowulfs, but they cost too much, and they suck at the same things
Beowulfs suck at.

Keep in mind that all of the above is based on hearsay: reading
marketing littrachaw and convussations on this list, not actual
experience of my own.  :)

-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)


From wsb at paralleldata.com  Fri Jun 23 16:29:39 2000
From: wsb at paralleldata.com (W Bauske)
Date: Fri, 23 Jun 2000 18:29:39 -0500
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical 
 approach]
References: <003f01bfdd5b$5263d520$e4844b89@hptilap.hpti.com>
Message-ID: <3953F2E3.1762A819@paralleldata.com>

Greg Lindahl wrote:
> 
> > Have you guys had any experience with comparing the performance of
> > similar codes on an SP versus a fully distributed cluster with similar
> > performance and number of processors?
> 
> No -- there no Alpha slow enough for such a comparison ;-)
> 

What models have you benchmarked?
Power3II is quite fast, even compared to a 21264.


Wes


From jcownie at etnus.com  Sat Jun 24 05:44:38 2000
From: jcownie at etnus.com (James Cownie)
Date: Sat, 24 Jun 2000 13:44:38 +0100
Subject: Beowulf: A theorical approach 
In-Reply-To: Your message of "Fri, 23 Jun 2000 12:21:52 EDT."
             <002501bfdd2f$23f31d80$e4844b89@hptilap.hpti.com> 
Message-ID: <0cc892845121860PCOW029M@blueyonder.co.uk>

> > Personally I always called it a "cache-coherent explicit remote store
> > access" model,
> 
> But it isn't actually cache coherent. Remember Cray shmem(): To use
> data, you fetch it to be close to you, and then it isn't cache
> coherent while it's local. It IS cache coherent in that the fetch
> gets you the right data.

Wrong. In the CS2 it _was_ cache coherent both locally and remotely.
Transfers could be made from anywhere in the user address space (with
no special allocation or lock-down requirement), and were fully cache
coherent.

That's why I added "cache coherent" to the description, explicitly to
distinguish it from the Cray shmem which I would call "non-cache
coherent explicit remote store access".

Another of the reasons I don't like the SALC description is precisely
this confusion about what the "local consistency" is intended to mean.

Making the NIC properly cache coherent is one of the main reasons to
be on the processor bus, appearing as a second CPU. It allows the NIC
to implement the full coherency protocol when accessing data (either
bringing it in, or sending it out).


-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, Inc.     +44 117 9071438
http://www.etnus.com


From rgb at phy.duke.edu  Sat Jun 24 09:33:29 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 24 Jun 2000 12:33:29 -0400 (EDT)
Subject: Neural network on beowulf - query
In-Reply-To: <00062320523502.00969@micinca>
Message-ID: <Pine.LNX.4.10.10006241219530.13495-100000@ganesh.phy.duke.edu>

On Fri, 23 Jun 2000, Ian McLeod wrote:

> Hi,
> 
> I am new to all of this, and I confess my lack of experience.
> 
> I am intrigued by the concept of running a neural network simulator over a
> beowulf cluster of a few friends PCs, but as I understand it from the manual
> and how-to's, this is easier said than done.
> 
> Only FORTRAN can be run in parralel if I read the literature correctly, C++
> will not work (or not well at all).
> 
> Is there any hope for parralel computing?

Cheeeze.  Of course there is hope for parallel computing, it's done all
the time by everybody on this list (or very nearly so).

First, I don't know what "literature" you are reading but just throw it
away if it says only FORTRAN will run in parallel.  Or use it to light
your fires next winter.  If this were an earlier, more rustic time I'd
suggest putting it in the cob box of the privy.

Second, since you have a LOT to learn before you can even ask sensible
questions, start reading.  The place to start with starting your reading
is probably the beowulf FAQ, followed by the beowulf HOWTO.  Both of
these cross reference numerous resources.  The main beowulf website
(www.beowulf.org) has additional useful links and resources.
http://www.phy.duke.edu/brahma provides both useful resources (including
a very incomplete but still useful draft book on beowulfery and several
talk/presentation/tutorial type things) as well as links to the FAQ, the
Howto, the beowulf site, the beowulf underground site, and other useful
resources.  Eventually you will likely want to buy and read "How to
Build a Beowulf" by Sterling, et. al. (MIT Press) and either or both of
"PVM" or "MPI" (also by MIT Press, don't remember the authors).

Whey you've waded through all that, come back and ask again to have
whatever you still don't understand explained.  Your guess is correct --
(the training of) neural network simulators can be run in parallel on a
beowulf cluster or more informal cluster consisting of you and your
friends' machines.  It can be programmed to do so in at least C (which
is what I use), C++ and Fortran, and with a bit more work or at a bit
lower speed one could program it to do so in perl, pascal, or pretty
much any programming environment with floating point and transcendental
support (need those nonlinear functions to morph the sum of the previous
layer's input into a neural output, and everything is in float) and ANY
kind of socket support.  Although it would be a bit insane to do so, one
could probably kludge something together in /bin/sh with awk for floats
and netpipes for the socket layer (yuk!).  In a good language and with a
fast network, one can even get gain (that is, the parallelized
application will run faster than a single-threaded, single hosted one).
This relies on your identifying parallelizable sections of the neural
training cycle (e.g. evaluating the error function on the training set)
and coding them to run in parallel.  This explicit example, by strange
chance, is the basis of one of the talks on brahma.  Which was written
in C, by the way...

Good luck.

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Sat Jun 24 09:41:02 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Sat, 24 Jun 2000 12:41:02 -0400 (EDT)
Subject: Beowulf: A theorical approach
In-Reply-To: <3953694B.F6DD5919@ti.com>
Message-ID: <Pine.LNX.4.10.10006241238240.13495-100000@ganesh.phy.duke.edu>

On Fri, 23 Jun 2000, Christoph Wasshuber wrote:

> I am planing to design an affordable motherboard for Beowulfery
> with a fast low latency network. The current plan is to
> use the idea of PAPERS but scale it to 16, 32, or 64 bit
> parallel with direct access to the CPU. Latencies of <100ns
> should be possible. So the interface itself will be very
> cheap because the hardware of PAPERS is trivial. Hooking this
> up to the host bus, if possible with an FPGA, does not cost much
> more than standard NICs today.
> In case some think that designing a new motherboard is not
> cost effective, I have several quotes for the layout and manufacturing
> of such motherboards. Even with an initial prototype run of only
> 100 motherboards one could achieve a price of ~$150 per motherboard
> for AMD K6 based design (only counting manufacturing and not design).
> 
> I am willing to fund such an effort. but I would need circuit designers
> who are up to the challenge.

It was suggested to me offline that one should consider the AGP bus
itself as a possible interface, as it apparently has a lot of the
desired characteristics (and since beowulf nodes typically don't need
the AGP slot anyway).  I'm not an AGP bus expert by any means (it's hard
enough trying to get PCI specs from the net:-) but this might be worth
looking into.

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From vor+ at pitt.edu  Sat Jun 24 11:14:13 2000
From: vor+ at pitt.edu (Victor Ortega)
Date: Sat, 24 Jun 2000 14:14:13 -0400 (EDT)
Subject: Neural network on beowulf - query
In-Reply-To: <Pine.LNX.4.10.10006241219530.13495-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.GSO.3.96L.1000624140537.4190H-100000@unixs1.cis.pitt.edu>

On Sat, 24 Jun 2000, Robert G. Brown wrote:
> On Fri, 23 Jun 2000, Ian McLeod wrote:
> > Only FORTRAN can be run in parralel if I read the literature
> > correctly, C++ will not work (or not well at all).
> 
> First, I don't know what "literature" you are reading but just throw it
> away if it says only FORTRAN will run in parallel.  Or use it to light
> your fires next winter.

Not so fast!  I do believe it's true that only FORTRAN has native
support for parallelism, in the form of vector operations.  MPI and
PVM are libraries external to the languages in which they're used, and
as such, Ian's statement is correct.  It would be wasteful to burn
some good literature just because the guy misunderstood a statement in
it...

Victor


From glindahl at hpti.com  Sat Jun 24 12:28:52 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sat, 24 Jun 2000 15:28:52 -0400
Subject: Beowulf: A theorical approach 
In-Reply-To: <0cc892845121860PCOW029M@blueyonder.co.uk>
Message-ID: <000001bfde12$6e1c76c0$e4844b89@hptilap.hpti.com>

> > But it isn't actually cache coherent. Remember Cray shmem(): To use
> > data, you fetch it to be close to you, and then it isn't cache
> > coherent while it's local. It IS cache coherent in that the fetch
> > gets you the right data.
>
> Wrong. In the CS2 it _was_ cache coherent both locally and remotely.

I wasn't discussing the CS2, I was discussing the T3E and SALC. Sorry if I
gave a different impression. I don't know of any other machines like the
CS2, nor do I think they're interesting.

> Another of the reasons I don't like the SALC description is precisely
> this confusion about what the "local consistency" is intended to mean.

You can take it up with Bob Numrich; I'm just the messenger. Bob invented
the shmem interface in the first place. shmem is interesting mainly because
it's cheaper to build hardware for it than for things like the CS2.

> Making the NIC properly cache coherent is one of the main reasons to
> be on the processor bus, appearing as a second CPU. It allows the NIC
> to implement the full coherency protocol when accessing data (either
> bringing it in, or sending it out).

That's far more expensive and difficult than the other benefit of getting on
the processor bus: reduced latency.

-- g


From glindahl at hpti.com  Sat Jun 24 12:38:25 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sat, 24 Jun 2000 15:38:25 -0400
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]
In-Reply-To: <3953F2E3.1762A819@paralleldata.com>
Message-ID: <000101bfde13$c3bc36a0$e4844b89@hptilap.hpti.com>

> > > Have you guys had any experience with comparing the performance of
> > > similar codes on an SP versus a fully distributed cluster with similar
> > > performance and number of processors?
> >
> > No -- there no Alpha slow enough for such a comparison ;-)
> >
>
> What models have you benchmarked?
> Power3II is quite fast, even compared to a 21264.

No, it's considerably slower. Look at the SPEC95fp results. As for the
graph, it's

http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20000106.html

The graph is a bit misleading; the Compaq system that apparently is faster
than mine at the same clock has DDR Sram, which I now offer. The extremely
slow IBM SP result is a 375 mhz Power3.

-- greg


From glindahl at hpti.com  Sat Jun 24 12:41:46 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sat, 24 Jun 2000 15:41:46 -0400
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]
In-Reply-To: <395491FF.61DBCDBD@moene.indiv.nluug.nl>
Message-ID: <000201bfde14$3b85cd40$e4844b89@hptilap.hpti.com>

> :-) Indeed, but the reason why this one is wrong is easy:  Yes, it is
> hard to change codes from a shared memory threaded model (e.g. OpenMP)
> to a distributed memory model.  However, the field of weather
> forecasting is probably one of the few remaining where programmer time
> is neglegible when compared to other costs, like the global observation
> system (including satellites).

This is true. However, it's not always so hard to parallelize codes. For
example, if you have a stencil code or a spectral weather code, the SMS
system from FSL can be used to parallelize a serial version in just a few
programmer days of effort. Check out:

http://www-ad.fsl.noaa.gov/ac/sms.html

SMS is a good example of a domain-specific system that provides the benefits
that HPF promised, at the promised cost.

I would encourage folks to examine systems like SMS very carefully. It has
limitations -- limited F90 support -- but if your code does what it does
well, you're in great shape. My PPM code would parallelize fairly easily
using it.

-- greg


From dek_ml at konerding.com  Sat Jun 24 14:21:48 2000
From: dek_ml at konerding.com (dek_ml at konerding.com)
Date: Sat, 24 Jun 2000 14:21:48 -0700
Subject: beowulf performance with MPI
Message-ID: <200006242121.OAA09125@adsl-63-202-25-210.dsl.snfc21.pacbell.net>

Tony Skjellum writes:
>You may find our free MPI - MPI/Pro for TCP+SMP for Linux - interesting.
>
>Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
>MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
>+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com

I should mention that I downloaded this software and I found that it
worked great.  I was getting crappy scaling with my software of interest
(AMBER6, see http://www.amber.ucsf.edu).  My cluster is 6 dual P-III 600MHz
w/ 256MB RAM, one of which is the master and 5 of which are compute servers.
Interconnect is simply 100BT (eepro, 2.2.16) with a 100BT 8-port switch
connecting them.  The switch was only $150, nothing impressive. 
AMBER6 is compiled using either LAM or MPICH, the latest respective versions.

AMBER was only going 4 times faster than 1 CPU using all 10 CPUs of
the system.  I was pretty much all but convinced that I needed to scale up
the interconnect to giga-net or myrinet, at very high relative cost.
However, I downloaded the MPI/Pro for TCP+SMP for Linux, and gave it a try
with AMBER.  The scaling is remarkably better!  In particular, here are the
numbers for the performance:

SIMULATION SYSTEM: DHFR in water, 23558 atoms
SIMULATION PARAMETERS: PME, 62.2x62.2x62.2 box, 1000 steps
COMPUTER SYSTEM: 6 dual Pentium-III 600MHz (100MHz bus) running Red Hat
6.2 connected by 100BT switch.  Total cost $15,000 at time of purchase,
early 2000.  Each machine approx $2460 + one 27GB hard drive ($250) + one
100BT switch ($149)

AMBER COMPILATION: g77 (egcs-1.1.2), flags:   -O3  -m486 -malign-double
-ffast-math -fno-strength-reduce

CPUs    Time (sec)      Speedup over 1 g77 CPU
1       5539		1.00

8       1429            3.79            (mpich)
10      1358            3.99            (mpich)

8        794            6.97            (mpipro)
10       692            8.00    (mpipro)

"Time" is wallclock time spent actually calculating the simulation,
not any setup or I/O time. 

I compared the profiling of the two simulations and it appears that
much of the time savings came from a significantly faster MPI_ALLGATHERV,
which AMBER uses to distribute out the new particle positions and
velocities at the end of each timestep.  The allgather occurs in a serialized
section of the code, and therefore scaling is highly dependent on the
performance of the implementation.

I have spoken with MPI/Pro to find out a little bit more.  Actually
there is no specific SMP optimization of yet, in fact, communication
will go through the localhost network code.  However, the design is multi-
threaded and doesn't poll the way MPICH does. I suspect also more effort
has gone into optimizing some of the MPI routines which are implemented
in MPICH with fairly naive code.

Overall the program was quite easy to work with. After downloading
the RPM and installing it on the master, I just created a file called
"/etc/machines" listing all the client nodes by their hostnames,
then recompiled my app with the MPI/Pro provided "mpicc" and "mpif77"
scripts, and ran the app with the provided "mpirun" script.  The syntax
is very similar to MPICH, and it integrates straightforwardly with
our queueing system, PBS, through the use of the PBS_NODEFILE enviroment
variable.

Dave


From glindahl at hpti.com  Sat Jun 24 14:36:38 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sat, 24 Jun 2000 17:36:38 -0400
Subject: beowulf performance with MPI
In-Reply-To: <200006242121.OAA09125@adsl-63-202-25-210.dsl.snfc21.pacbell.net>
Message-ID: <000601bfde24$46fc6c00$e4844b89@hptilap.hpti.com>

> I compared the profiling of the two simulations and it appears that
> much of the time savings came from a significantly faster MPI_ALLGATHERV,

... which is one of the small number of functions in mpich that could use a
rewrite. For a fairly small mount of effort, collective operations can be
much faster.

-- g


From wsb at paralleldata.com  Sat Jun 24 16:16:44 2000
From: wsb at paralleldata.com (W Bauske)
Date: Sat, 24 Jun 2000 18:16:44 -0500
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical 
 approach]
References: <000101bfde13$c3bc36a0$e4844b89@hptilap.hpti.com>
Message-ID: <3955415C.F2E0766F@paralleldata.com>

Greg Lindahl wrote:
> 
> > > > Have you guys had any experience with comparing the performance of
> > > > similar codes on an SP versus a fully distributed cluster with similar
> > > > performance and number of processors?
> > >
> > > No -- there no Alpha slow enough for such a comparison ;-)
> > >
> >
> > What models have you benchmarked?
> > Power3II is quite fast, even compared to a 21264.
> 
> No, it's considerably slower. Look at the SPEC95fp results. 

Ah yes, SPEC, references to it in a minute.

> As for the
> graph, it's
> 
> http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20000106.html
> 
> The graph is a bit misleading; the Compaq system that apparently is faster
> than mine at the same clock has DDR Sram, which I now offer. The extremely
> slow IBM SP result is a 375 mhz Power3.
> 

To clarify, the P3II is SP-WH2, second only to Alphas at
larger numbers of processors, with this code.

Interesting, but, the SC667 is no beowulf. Neither are 
almost all the machines on the chart except that little
PIII line on the bottom. That pretty much takes us out
of the realm of this group but to continue for a short bit.
It is a realm of computing I'm interested in.

First, a plot of linear speed up would put into perspective
exactly how this code scales. That tail off tells me why
you're interested in high speed interconnects, eg., myrinet. 
You would be better off running as two 128 node systems 
than as a 256 node system, assuming the problem can be solved 
on 128 nodes. Some of them probably can't so you suffer. Four 
64 node systems would be even better as would eight 32 node 
systems from a thruput point of view.

This leads us into comparing system interconnects. As an
example of SPEC in action, vs this chart, the SGI O2 400
manages to outperform the SP WH2 and matches the ACL/667, 
even though SPEC says it shouldn't. That is most likely
due to this code not being able to keep the cpus busy doing
useful work. In other words, the interconnect is too slow
for the processor. So, you have options. First, live with 
it. Second, alter the algorithm to require less communication.
(no easy task) Or, third, look for faster interconnects.
If you want improvement, the shortest route to speedups would
probably be to buy a faster interconnect.

Mostly this chart says to me this problem has a substantial
communication portion to it. I'd be interested in a plot of 
average cpu utilization during this code run to see exactly 
how well it used the cpus on each architecture. Unless, it uses 
a spin loop waiting on the interconnect to improve latency. 
In that case, it's difficult to see what's really going on.

One last comment. In the notes section, the SC667 is actually
using only one or two cpus per 4 cpu node. That would indicate
the SC667 nodes run out of bandwidth somewher since they chose not 
to post the 4 processors per node run. The WH2 runs use all 4 cpus 
on each node. So, basically you pay for twice as much machine to 
get that level of performance. That conclusion is also supported 
by the streams results for an ES40 which are the nodes in an SC.
It runs out of memory bandwidth. Changing the number of processors 
used on a node can impact both processor performance (memory) and 
interconnect bandwidth per processor. To put it bluntly, looks to 
me like Compaq was less than honest in their runs. No one will buy 
a system and use it that way, at least not a real company that would 
use the system to make money.


Wes


From glindahl at hpti.com  Sat Jun 24 20:24:49 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sat, 24 Jun 2000 23:24:49 -0400
Subject: Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]
In-Reply-To: <3955415C.F2E0766F@paralleldata.com>
Message-ID: <000001bfde54$eb96a200$e4844b89@hptilap.hpti.com>

> Interesting, but, the SC667 is no beowulf.

The SC667 is extremely similar to the IBM SP. The question was asking about
a comparison between an AlphaLinux/Myrinet cluster and an IBM SP.

> First, a plot of linear speed up would put into perspective
> exactly how this code scales. That tail off tells me why
> you're interested in high speed interconnects, eg., myrinet.

Not really; mm5's scaling is hurt by load imbalance more than interconnect.

> This leads us into comparing system interconnects. As an
> example of SPEC in action, vs this chart, the SGI O2 400
> manages to outperform the SP WH2 and matches the ACL/667,
> even though SPEC says it shouldn't.

That could be because it's extremely unlike SPEC. It may be scaling like one
of the component benchmarks in SPEC, which are pretty wildly different. I
assure you that the ACL/667 beat the snot out of the O2 400 mhz on the
overall FSL benchmarks.

> That is most likely
> due to this code not being able to keep the cpus busy doing
> useful work.

No, I measured for that, and it's a load imbalance. Yes, pretty much
everyone's high-speed interconnects use busy-wait loops when blocking for
messages, so CPU utilization %'s aren't useful for figuring out when
someone's hung. The mpich ch_p4 device does that on the sending side, for
example, if it can't get all the data into the kernel buffer. So I used an
mpi profiling gizmo, and compared the application cpu time for various
nodes.

> One last comment. In the notes section, the SC667 is actually
> using only one or two cpus per 4 cpu node. That would indicate
> the SC667 nodes run out of bandwidth somewher since they chose not
> to post the 4 processors per node run.

It is not considered correct by most folks in the industry to run a
benchmark that way. Most RFPs and formal benchmark situations prohibit such
runs, unless the actual product is a 4 cpu chassis with 2 cpus in it and 2
empty slots.

-- g


From tony at MPI-Softtech.Com  Sun Jun 25 07:20:17 2000
From: tony at MPI-Softtech.Com (Tony Skjellum)
Date: Sun, 25 Jun 2000 09:20:17 -0500 (CDT)
Subject: beowulf performance with MPI
In-Reply-To: <000601bfde24$46fc6c00$e4844b89@hptilap.hpti.com>
Message-ID: <Pine.GSO.4.10.10006250917080.9551-100000@mpi.mpi-softtech.com>

Greg, to be perfectly technical, MPI/Pro also achieves about 10% higher
large message bandwidth in our experiments compared to other MPI's over
TCP over Ethernet.  So, it is not just a matter of rewriting the
collectives, the overall middleware design and implementation matters too.

Since it works, users should take advantage of state of the art MPI,
not wait around and hope.

As may be pointed out, there is room for improvement in our product too,
in some areas, and we're working that very aggressively.

Tony

Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Sat, 24 Jun 2000, Greg Lindahl wrote:

> > I compared the profiling of the two simulations and it appears that
> > much of the time savings came from a significantly faster MPI_ALLGATHERV,
> 
> ... which is one of the small number of functions in mpich that could use a
> rewrite. For a fairly small mount of effort, collective operations can be
> much faster.
> 
> -- g
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From glindahl at hpti.com  Sun Jun 25 13:57:17 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sun, 25 Jun 2000 16:57:17 -0400
Subject: beowulf performance with MPI
In-Reply-To: <Pine.GSO.4.10.10006250917080.9551-100000@mpi.mpi-softtech.com>
Message-ID: <000001bfdee7$f2792320$e4844b89@hptilap.hpti.com>

> Since it works, users should take advantage of state of the art MPI,
> not wait around and hope.

Gee, Tony, I never realized this mailing list was the right place for a
sales pitch. I was talking about things in mpich that would be useful to
rewrite, not whether or not people should buy your product.

-- g


From tony at MPI-Softtech.Com  Sun Jun 25 14:01:10 2000
From: tony at MPI-Softtech.Com (Tony Skjellum)
Date: Sun, 25 Jun 2000 16:01:10 -0500 (CDT)
Subject: beowulf performance with MPI
In-Reply-To: <000001bfdee7$f2792320$e4844b89@hptilap.hpti.com>
Message-ID: <Pine.GSO.4.10.10006251600000.1395-100000@mpi.mpi-softtech.com>

Greg,

I disagree that this was a sales pitch, this is just an argumentative
response.

When you cannot win on technical grounds, you switch to netiquette.

Tony


Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Sun, 25 Jun 2000, Greg Lindahl wrote:

> > Since it works, users should take advantage of state of the art MPI,
> > not wait around and hope.
> 
> Gee, Tony, I never realized this mailing list was the right place for a
> sales pitch. I was talking about things in mpich that would be useful to
> rewrite, not whether or not people should buy your product.
> 
> -- g
> 


From tony at MPI-Softtech.Com  Sun Jun 25 14:16:26 2000
From: tony at MPI-Softtech.Com (Tony Skjellum)
Date: Sun, 25 Jun 2000 16:16:26 -0500 (CDT)
Subject: beowulf performance with MPI
In-Reply-To: <Pine.GSO.4.10.10006251600000.1395-100000@mpi.mpi-softtech.com>
Message-ID: <Pine.GSO.4.10.10006251614310.1395-100000@mpi.mpi-softtech.com>

Folks,

Let me clarify even further before we waste more bandwidth on this,
and people get upset by Greg...

1) I was speaking about something that is totally free
2) My comments were directly aimed at the fact that free-to-free
   comparison suggests that people use the existing free software from us
   for TCP+Linux, rather than wait and hope that someone updates some
   other free software for them.

If anyone wants to debate my judgement in responding to Greg's comments,
please send me a note offline so that the rest of the group can get back
to business.

Thanks for your time.
Tony


Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Sun, 25 Jun 2000, Tony Skjellum wrote:

> Greg,
> 
> I disagree that this was a sales pitch, this is just an argumentative
> response.
> 
> When you cannot win on technical grounds, you switch to netiquette.
> 
> Tony
> 
> 
> Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
> MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
> +1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
> "Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."
> 
> On Sun, 25 Jun 2000, Greg Lindahl wrote:
> 
> > > Since it works, users should take advantage of state of the art MPI,
> > > not wait around and hope.
> > 
> > Gee, Tony, I never realized this mailing list was the right place for a
> > sales pitch. I was talking about things in mpich that would be useful to
> > rewrite, not whether or not people should buy your product.
> > 
> > -- g
> > 
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From glindahl at hpti.com  Sun Jun 25 14:21:43 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Sun, 25 Jun 2000 17:21:43 -0400
Subject: beowulf performance with MPI
In-Reply-To: <Pine.GSO.4.10.10006251614310.1395-100000@mpi.mpi-softtech.com>
Message-ID: <000201bfdeeb$5bfdbe20$e4844b89@hptilap.hpti.com>

> Let me clarify even further before we waste more bandwidth on this,
> and people get upset by Greg...
>
> 1) I was speaking about something that is totally free

Tony, if you knew this community better, you would be aware that "free beer"
is different from "free software", and that calling any software "totally
free" is a really bad idea.

If your product is now free in the liberty sense, that would be news, but I
think you're probably just confusing the two.

> 2) My comments were directly aimed at the fact that free-to-free
>    comparison suggests that people use the existing free software from us
>    for TCP+Linux, rather than wait and hope that someone updates some
>    other free software for them.

My comments were about modifying mpich, not about waiting or hoping. I said
nothing about either. If you wish to reply to my comments in the future, I
would suggest replying to what I said, not what you read into it.

-- g


From tony at MPI-Softtech.Com  Sun Jun 25 14:29:06 2000
From: tony at MPI-Softtech.Com (Tony Skjellum)
Date: Sun, 25 Jun 2000 16:29:06 -0500 (CDT)
Subject: beowulf performance with MPI
In-Reply-To: <000201bfdeeb$5bfdbe20$e4844b89@hptilap.hpti.com>
Message-ID: <Pine.GSO.4.10.10006251628290.1395-100000@mpi.mpi-softtech.com>

Greg, you are obviously right about knowing what you meant.  I too know
what I meant.  So, I concede your point, so you can go on to other
postings.

_Tony

Anthony Skjellum, PhD, President (tony at mpi-softtech.com) 
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."

On Sun, 25 Jun 2000, Greg Lindahl wrote:

> > Let me clarify even further before we waste more bandwidth on this,
> > and people get upset by Greg...
> >
> > 1) I was speaking about something that is totally free
> 
> Tony, if you knew this community better, you would be aware that "free beer"
> is different from "free software", and that calling any software "totally
> free" is a really bad idea.
> 
> If your product is now free in the liberty sense, that would be news, but I
> think you're probably just confusing the two.
> 
> > 2) My comments were directly aimed at the fact that free-to-free
> >    comparison suggests that people use the existing free software from us
> >    for TCP+Linux, rather than wait and hope that someone updates some
> >    other free software for them.
> 
> My comments were about modifying mpich, not about waiting or hoping. I said
> nothing about either. If you wish to reply to my comments in the future, I
> would suggest replying to what I said, not what you read into it.
> 
> -- g
> 


From dan at quetzalcoatl.com  Sun Jun 25 13:58:58 2000
From: dan at quetzalcoatl.com (Daniel Fuka)
Date: Sun, 25 Jun 2000 14:58:58 -0600
Subject: beowulf performance with MPI
References: <000001bfdee7$f2792320$e4844b89@hptilap.hpti.com>
Message-ID: <001301bfdee8$41222940$0300a8c0@syncrasy.com>

Oddly enough, I sometimes get a little (enphesize the little) information
from the sales pitches. I just wish that we could limit it to what they
would like to personally write between the hours of 2 and 3 am (what ever
time zones they are in) on the third sunday of each month.

Sorry for wasting space.
dan

----- Original Message -----
From: Greg Lindahl <glindahl at hpti.com>
To: Tony Skjellum <tony at MPI-Softtech.Com>
Cc: <beowulf at beowulf.org>
Sent: Sunday, June 25, 2000 2:57 PM
Subject: RE: beowulf performance with MPI


> > Since it works, users should take advantage of state of the art MPI,
> > not wait around and hope.
>
> Gee, Tony, I never realized this mailing list was the right place for a
> sales pitch. I was talking about things in mpich that would be useful to
> rewrite, not whether or not people should buy your product.
>
> -- g
>
>
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
>


From gerry at cs.tamu.edu  Sun Jun 25 16:47:28 2000
From: gerry at cs.tamu.edu (Gerry Creager N5JXS)
Date: Sun, 25 Jun 2000 18:47:28 -0500
Subject: beowulf performance with MPI
References: <Pine.GSO.4.10.10006251600000.1395-100000@mpi.mpi-softtech.com>
Message-ID: <39569A10.78B9A9ED@cs.tamu.edu>

Tony Skjellum wrote:
> 
> Greg,
> 
> I disagree that this was a sales pitch, this is just an argumentative
> response.
> 
> When you cannot win on technical grounds, you switch to netiquette.

No, Tony, it looked like a sales pith (pitch?) even to me.

OK, guys, let it drop and go on to another topic.
--
Gerry Creager		gerry at cs.tamu.edu, gerry at page4.cs.tamu.edu
Network Engineering			|Research focusing on
Computer Science Department		|Satellite Geodesy and 
Texas A&M University			|Geodetic Control
979.458.4020  (Phone) -- 979.847.8578  (Fax)


From wasshub at ti.com  Mon Jun 26 06:18:16 2000
From: wasshub at ti.com (Christoph Wasshuber)
Date: Mon, 26 Jun 2000 08:18:16 -0500
Subject: Beowulf: A theorical approach
References: <Pine.LNX.4.10.10006241238240.13495-100000@ganesh.phy.duke.edu>
Message-ID: <39575818.53C9A46@ti.com>

"Robert G. Brown" wrote:

> It was suggested to me offline that one should consider the AGP bus
> itself as a possible interface, as it apparently has a lot of the
> desired characteristics (and since beowulf nodes typically don't need
> the AGP slot anyway).  I'm not an AGP bus expert by any means (it's hard
> enough trying to get PCI specs from the net:-) but this might be worth
> looking into.

Thinking about it a little longer, this looks to me as not such a bad
idea. Has anybody more information on the AGP bus? Or is there maybe
an AGP bus expert lurking on this list?
My only concern is that AGP has to pass through the North Bridge, which
should increase latency quite a bit.

Chris....


From dsg at super.org  Mon Jun 26 09:06:34 2000
From: dsg at super.org (David S. Greenberg)
Date: Mon, 26 Jun 2000 16:06:34 +0000
Subject: Beowulf: A theorical approach
References: <000001bfde12$6e1c76c0$e4844b89@hptilap.hpti.com>
Message-ID: <39577F8A.99029146@super.org>

I'm at least partially responsible for coining the acronym SALC.  I got tired
of talking about "the T3E memory model" when I meant something more general
such as Quadrics.
Several of us  (including some hardware architects, compiler writers, and
applications writers) believe that it is important to populate the space in
between pure distributed memory and pure shared memory models.  I'll describe
why we want something different by first summarizing our problems with the
extrema models and then describing what we believe is necessary.

"A pox on both your houses"
The pure distributed memory models tend to require message passing, emphasize
portability over all else, and please those whose applications are relatively
loosely coupled -- in a word beowulfers.  The pure shared memory models tend to
push flat memory spaces, emphasize the need for global cache coherency, and
please those whose applications have not been tailored to parallelism and data
locality.
The hardware costs (in design, scalability, etc.) of a globally shared and
cache consistent memory can be huge.  Most of the main stream shared memory
systems (SGI, Sun, IBM, Compaq) are, in fact, patinas over distributed memory.
Going beyond 64 processors gets difficult -- the systems tend to become
unstable and not worth the increased design costs.  I am convinced that if you
want shared memory of this type then you are better off following more radical
designs such as the Cray/Tera MTA.
On the other hand, as has been pointed out by several people in this thread,
classic ethernet-connected beowulfs can have limited applications scope.  They
are fantastic for some applications but quickly top out at 8 to 16 processors
for other applications.  Faster (higher bandwidth and lower latency) networks
help somewhat but the processors (and nodes) are becoming more powerful faster
than the interconnects are improving.   PCI-X will be probably be a big step
forward but almost certainly not enough.  (I suspect that Infiniband like AGP
will be more of a marketing coup than a technical break-through except that it
will take all the credit for PCI-X).
As an example of our plight consider that the ASCI red machine and Cray T3E
machines had full featured, 400 MB/s interconnects four years ago with
processing nodes which are anemic compared to today's nodes.  PCI-X, when it
appears, will maybe get us back to this level of bandwidth with many fewer nice
features in early PCI-X NICS/systems than were in the MPP machines.

We deserve better but are not greedy.
Which is where the SALC model comes in.  It is possible, Quadrics is a good
example, to augment NICs with the ability to perform remote direct memory
access.  "Standards" such as VIA make this an option for hardware.  I think we
must make it a requirement.  Furthermore, we need to, as a community, decide
what attributes are necessary.  I believe, based on my experience with Portals
at Sandia and UPC here at CCS, that fairly simple hardware is sufficient.  We
do not need all the bells and whistles of Cray's e-registers (though they are
very nice.)  We do not need to support in hardware all the notification modes
of MPI-2 one-sided (though every mode has its strong adherents.)  We do not
need the on NIC virtual-to-real address translation of Quadrics or Matt Welsh's
myrinet control program (though they are certainly nice also.)
Instead, we need the ability to register large chunks of contiguous physical
memory with the NIC and associate them with a tag usable from user-space on all
nodes.  The registering is likely to occur in a library or be done by the
compiler.  Next we need the ability to issue remote loads and stores by issuing
just a couple of machine instructions (ideally just two loads or a load and a
store).  These instructions should take advantage of all the processor
machinery for outstanding loads and delayed writes.  Lastly we need a
relatively efficient synchronization primitive which allows us (again usually a
library or the compiler) to create a fence guaranteeing that previous remote
operations have completed.  All this should be possible in a $100 NIC and
require little or no increase in switch complexity.

So please help us move forward.
(1) Propose a better name than SALC.  We need to keep the notion that addresses
are somehow shared, i.e. you can address most if not all of the memory of
remote nodes with simple efficient load/store type operations.  We also need to
convey the warning that once a datum is fetched there will be no automatic
notification to the fetcher when someone else has written to the fetched
location.
(2) Think about the real needs of your codes.  Can you save lots of space my
not copying boundary regions?  What are your "memes"?  Do you use shared work
queues?  Do you use ghost cells?  How do you load balance?  How do you improve
data locality and reuse?  How do you synchronize?
(3) Demand more from compiler writers, system architects, and machine
purchasers?  Beowulf's are nice but you can have more if you push for it.
I think I've passed my "one-screen per posting" limit so I'll stop here but I'd
love to continue the discussion one-on-one with anyone who cares to contact
me.  Or come to the Atlanta Linux Showcase and Conference in October.
David

Greg Lindahl wrote:

> > > But it isn't actually cache coherent. Remember Cray shmem(): To use
> > > data, you fetch it to be close to you, and then it isn't cache
> > > coherent while it's local. It IS cache coherent in that the fetch
> > > gets you the right data.
> >
> > Wrong. In the CS2 it _was_ cache coherent both locally and remotely.
>
> I wasn't discussing the CS2, I was discussing the T3E and SALC. Sorry if I
> gave a different impression. I don't know of any other machines like the
> CS2, nor do I think they're interesting.
>
> > Another of the reasons I don't like the SALC description is precisely
> > this confusion about what the "local consistency" is intended to mean.
>
> You can take it up with Bob Numrich; I'm just the messenger. Bob invented
> the shmem interface in the first place. shmem is interesting mainly because
> it's cheaper to build hardware for it than for things like the CS2.
>
> > Making the NIC properly cache coherent is one of the main reasons to
> > be on the processor bus, appearing as a second CPU. It allows the NIC
> > to implement the full coherency protocol when accessing data (either
> > bringing it in, or sending it out).
>
> That's far more expensive and difficult than the other benefit of getting on
> the processor bus: reduced latency.
>
> -- g
>
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf


From deadline at plogic.com  Mon Jun 26 08:03:29 2000
From: deadline at plogic.com (Douglas Eadline)
Date: Mon, 26 Jun 2000 11:03:29 -0400 (EDT)
Subject: Beowulf: A theorical approach
In-Reply-To: <39575818.53C9A46@ti.com>
Message-ID: <Pine.LNX.4.10.10006261059510.4735-100000@plogic.com>

On Mon, 26 Jun 2000, Christoph Wasshuber wrote:

> "Robert G. Brown" wrote:
> 
> > It was suggested to me offline that one should consider the AGP bus
> > itself as a possible interface, as it apparently has a lot of the
> > desired characteristics (and since beowulf nodes typically don't need
> > the AGP slot anyway).  I'm not an AGP bus expert by any means (it's hard
> > enough trying to get PCI specs from the net:-) but this might be worth
> > looking into.
> 
> Thinking about it a little longer, this looks to me as not such a bad
> idea. Has anybody more information on the AGP bus? Or is there maybe
> an AGP bus expert lurking on this list?
> My only concern is that AGP has to pass through the North Bridge, which
> should increase latency quite a bit.

This assumes an AGP interface is present. Some new high end 
server boards do not seem to have this interface.

Doug

-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------


From salim at ee.fit.edu  Mon Jun 26 10:32:38 2000
From: salim at ee.fit.edu (Salim Mounir AlAoui)
Date: Mon, 26 Jun 2000 13:32:38 -0400 (EDT)
Subject: KLAT2
In-Reply-To: <200006222154.QAA23010@abacus.mcs.anl.gov>
Message-ID: <Pine.GSO.3.96.1000626133048.22224A-100000@yacht.ee.fit.edu>

Has anyone heard about KLAT2, it is a cheaper way for supercomputing than
beowulf. It is hard to believe, does someone has any information about it?


--------------------------------------------------------------------------
Salim Mounir Alaoui					salim at ee.fit.edu
Computer Science  Dept.					salaoui at cs.fit.edu
Research Assistant.					salim at ieee.org
Florida Institute of Technology
Melbourne, Florida
Voice: (407) 537-8025.
--------------------------------------------------------------------------


From jbh at biology.usu.edu  Mon Jun 26 11:07:07 2000
From: jbh at biology.usu.edu (John Hanks)
Date: Mon, 26 Jun 2000 12:07:07 -0600
Subject: KLAT2
Message-ID: <DFBD5F00372AD311905900A0C9ED07A1478C4E@bioserver1.biology.usu.edu>

http://www.arstechnica.com/cpu/2q00/klat2/klat2-1.html

jbh

> -----Original Message-----
> From: Salim Mounir AlAoui [mailto:salim at ee.fit.edu]
> Sent: Monday, June 26, 2000 11:33 AM
> To: beowulf at beowulf.org
> Subject: KLAT2
> 
> 
> 
> Has anyone heard about KLAT2, it is a cheaper way for 
> supercomputing than
> beowulf. It is hard to believe, does someone has any 
> information about it?
> 


From billran at reciprocal.com  Mon Jun 26 11:14:57 2000
From: billran at reciprocal.com (Bill Rankin)
Date: Mon, 26 Jun 2000 14:14:57 -0400
Subject: KLAT2
Message-ID: <C4E826E59C02D311985B00500463D90B4250BF@SNS2XCH>


From joelja at darkwing.uoregon.edu  Mon Jun 26 11:28:32 2000
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Mon, 26 Jun 2000 11:28:32 -0700 (PDT)
Subject: KLAT2
In-Reply-To: <Pine.GSO.3.96.1000626133048.22224A-100000@yacht.ee.fit.edu>
Message-ID: <Pine.LNX.4.21.0006261117500.26513-100000@twin.uoregon.edu>

klat2 meets the traditional definition of beowulf... eg. a cluster of off
the shelf. with an interconnected private network... their switch
interconnect is interesting but is only one of several ways to approach
the problem (ie multi-dimenensional hypercubes using one or more quad port
cards per machine, channel bonding, multiple routes for difference
purposes and so on). it will unboudbtedly be a techinque that will be
used in other clusters. however with port density on switches climbing,
switch fabrics getting faster and switches getting cheaper overall (ie
$375 a port for gig-ether from dlink) or $900 a port for a switch with
32Gb/s backplane and up to 64 or mor ports and gig nics being about $300
ea gig-ether is only slightly more expensive than fast ether was when we
built our first cluster...


On Mon, 26 Jun 2000, Salim Mounir AlAoui wrote:

> 
> Has anyone heard about KLAT2, it is a cheaper way for supercomputing than
> beowulf. It is hard to believe, does someone has any information about it?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --------------------------------------------------------------------------
> Salim Mounir Alaoui					salim at ee.fit.edu
> Computer Science  Dept.					salaoui at cs.fit.edu
> Research Assistant.					salim at ieee.org
> Florida Institute of Technology
> Melbourne, Florida
> Voice: (407) 537-8025.
> --------------------------------------------------------------------------
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli				       joelja at darkwing.uoregon.edu    
Academic User Services			     consult at gladstone.uoregon.edu
     PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E
--------------------------------------------------------------------------
It is clear that the arm of criticism cannot replace the criticism of
arms.  Karl Marx -- Introduction to the critique of Hegel's Philosophy of
the right, 1843.


From glindahl at hpti.com  Mon Jun 26 11:25:53 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Mon, 26 Jun 2000 14:25:53 -0400
Subject: KLAT2
In-Reply-To: <Pine.GSO.3.96.1000626133048.22224A-100000@yacht.ee.fit.edu>
Message-ID: <000d01bfdf9b$f618ab40$e4844b89@hptilap.hpti.com>

> Has anyone heard about KLAT2, it is a cheaper way for supercomputing than
> beowulf. It is hard to believe, does someone has any information about it?

Check out:

http://aggregate.org/KLAT2/

I wouldn't say it's "cheaper than beowulf"; most people would say it IS a
beowulf. It's the first one I've seen that uses the 32-bit SIMD instructions
on newer x86 architecture chips for a big speed win.

As for their speed numbers, they are impressive, but do keep in mind that
many codes use 64-bit floating point, and many supercomputer codes require
higher bandwidth and lower latency than KLAT2 provides. So, it's not even
the case that a cluster always provides a cheaper way to do
supercomputing... it always depends on the application in question.

-- g


From DellP at cbs.curtin.edu.au  Mon Jun 26 19:47:24 2000
From: DellP at cbs.curtin.edu.au (Peter Dell)
Date: Tue, 27 Jun 2000 10:47:24 +0800
Subject: Survey of Beowulf users
Message-ID: <s958864c.019@cbs.curtin.edu.au>

Hello all,

Please take a moment to complete a very short web survey of Beowulf users at http://www.dssrg.curtin.edu.au/~cluster/survey/.  The purpose of the survey is to find out about the range of clusters being used, and what they're being used for.

All responses are completely anonymous (unless you choose to identify yourself).  

Results will be made available at http://www.dssrg.curtin.edu.au/~cluster/ in September.

Regards,
Peter


-----------------------------------------------------
 Peter Dell
 School of Information Systems
 Curtin University of Technology
 GPO Box U1987
 Perth  WA  6845
 Australia

 Ph:  +618-9266-4485
 Fax: +618-9266-3076

 Check out http://www.dssrg.curtin.edu.au/~cluster/
-----------------------------------------------------


From waldow at rainier.chem.plu.edu  Tue Jun 27 01:18:00 2000
From: waldow at rainier.chem.plu.edu (Dean Waldow)
Date: Tue, 27 Jun 2000 01:18:00 -0700
Subject: Memory Testing... or problems
References: <Pine.GSO.3.96.1000626133048.22224A-100000@yacht.ee.fit.edu>
Message-ID: <395862C9.89BFBBFD@rainier.chem.plu.edu>

Greetings,

I think this is a little off topic but related to building clusters...

I have been trying to finalize our cluster and bought a board/cpu to
prototype.  I bought an asus p3v-4x board and a 667 PIII/133 cpu with
128MB PC133 ram.  I loaded linux and then ftp-ed my binary code to
benchmark.  I was quite surprised with it finished in 1/3 the time of
what I expected.  As it turned out, the program was basically returning
non-sense. (It is a monte carlo simulation and some how switched labels
on beads and lost beads in the lattice...)  I checked to make sure the
file transferred correctly and it did (via binary) at least in exact
size. I tried a different distro of linux.  I tried compiling on the new
machine. I finally looked at the 10 ram sticks I got and discovered that
they gave me about 4 different brands.  I had randomly pulled one out of
the bag which turned out to be a one of a kind.  I switched to another
pc133 I had running well on another machine (a p3b-f) and my benchmark
code ran fine using same binary. Having thought I figured it out, I
tried another version of the benchmark compiled with the portland group
compilers and the non-sense returned.  

My question is regarding testing memory since my main conclusion was
that it must be a memory problem if switching memory fixes (or partially
fixes) the problem.  However, my 'simple minded' thought was that if it
was bad memory linux would have problems too... I don't notice any.  

I have found Robert Brown's memtest.tar which tests for timing /
performance if I am understanding it correctly.  Are there other
recommended programs which might test memory integrity or basically look
for bad memory.  

Alternatively, am I missing the boat and might this be a mainboard or
other problem?  I have tried switching components with another linux box
with an asus p3b-f mainboard.  I can't get the p3b-f linux box to run
the binaries incorrectly. 

Thanks for any suggestions for what might be a pretty 'newbie' question.

Dean
-- 
-----------------------------------------------------------------------------
Dean Waldow, Associate Professor      (253) 535-7533 
Department of Chemistry               (253) 536-5055 (FAX)
Pacific Lutheran University           waldowda at plu.edu
Tacoma, WA  98447   USA               http://www.chem.plu.edu/waldow.html
-----------------------------------------------------------------------------
---> CIRRUS and the Chemistry homepage: http://www.chem.plu.edu/         <---
-----------------------------------------------------------------------------


From rauch at inf.ethz.ch  Tue Jun 27 02:03:12 2000
From: rauch at inf.ethz.ch (Felix Rauch)
Date: Tue, 27 Jun 2000 11:03:12 +0200 (CEST)
Subject: Memory Testing... or problems
In-Reply-To: <395862C9.89BFBBFD@rainier.chem.plu.edu>
Message-ID: <Pine.LNX.4.21.0006271101140.20134-100000@maloney.inf.ethz.ch>

While I can't solve your problem, I have a little hint:

On Tue, 27 Jun 2000, Dean Waldow wrote:
> I checked to make sure the file transferred correctly and it did
> (via binary) at least in exact size.

To compare binaries in a more reliable way, try the "diff" or "sum"
commands. "Diff" for binary files should at least return wether they
are identical or not, while "sum" calculates a checksum on the
file(s).

Regards,
Felix
-- 
Felix Rauch                      | Email: rauch at inf.ethz.ch
Institute for Computer Systems   | Homepage: http://www.cs.inf.ethz.ch/~rauch/
ETH Zentrum / RZ H18             | Phone: ++41 1 632 7489
CH - 8092 Zuerich / Switzerland  | Fax:   ++41 1 632 1307


From ajl4 at EECS.Lehigh.EDU  Tue Jun 27 04:59:12 2000
From: ajl4 at EECS.Lehigh.EDU (Adam Lazur)
Date: Tue, 27 Jun 2000 07:59:12 -0400
Subject: Memory Testing... or problems
In-Reply-To: <395862C9.89BFBBFD@rainier.chem.plu.edu>; from waldow@rainier.chem.plu.edu on Tue, Jun 27, 2000 at 01:18:00AM -0700
References: <Pine.GSO.3.96.1000626133048.22224A-100000@yacht.ee.fit.edu> <395862C9.89BFBBFD@rainier.chem.plu.edu>
Message-ID: <20000627075912.A23725@calypso.eecs.lehigh.edu>

Dean Waldow (waldow at rainier.chem.plu.edu) said:
> I have found Robert Brown's memtest.tar which tests for timing /
> performance if I am understanding it correctly.  Are there other
> recommended programs which might test memory integrity or basically look
> for bad memory.  

I highly recommend memtester which can be found at
http://www.qcc.sk.ca/~charlesc/software/memtester/ as it has always
caught bad dram in my experience.

There is also another tester called memtest86 at
http://reality.sgi.com/cbrady_denver/memtest86/ which we have used as
well.

.adam

-- 
[               Adam Lazur | Lehigh Univ.             |   _ __      ]
[        icq 3354423 | http://www.lehigh.edu/~ajl4    |__( | /_     ]
   "Linux is only free if your time has no value" - Jamie Zawinski


From ex-freek at yahoo.com  Tue Jun 27 09:32:25 2000
From: ex-freek at yahoo.com (Bob P)
Date: Tue, 27 Jun 2000 09:32:25 -0700 (PDT)
Subject: Verification of a working pvm3 setup
Message-ID: <20000627163225.11231.qmail@web1102.mail.yahoo.com>

I have a 2 node linux beowulf prototype.  I have
loaded pvm3 on each on an nfs mounted directory.  PVM
loads fine on the master and can add the second node
with no problem, however, I have attempted to try a
sample master/slave app with no luck.  I run the
master program but it just enters an endless loop.  I
modified the slave so that it doesn't do calculations,
but just returns a float to the master.  Both apps
compiled without a problem, and the master shows up on
the pvm console with a 'ps -a' command.  Does anyone
have any ideas that might get me past this problem?

--
Bob Phan <ex-freek at yahoo.com>
Discovery Technologies
Neurogen Corp. <www.neurogen.com>

__________________________________________________
Do You Yahoo!?
Get Yahoo! Mail - Free email you can access from anywhere!
http://mail.yahoo.com/


From rgb at phy.duke.edu  Tue Jun 27 10:22:41 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 27 Jun 2000 13:22:41 -0400 (EDT)
Subject: Verification of a working pvm3 setup
In-Reply-To: <20000627163225.11231.qmail@web1102.mail.yahoo.com>
Message-ID: <Pine.LNX.4.10.10006271316560.22200-100000@ganesh.phy.duke.edu>

On Tue, 27 Jun 2000, Bob P wrote:

> I have a 2 node linux beowulf prototype.  I have
> loaded pvm3 on each on an nfs mounted directory.  PVM
> loads fine on the master and can add the second node
> with no problem, however, I have attempted to try a
> sample master/slave app with no luck.  I run the
> master program but it just enters an endless loop.  I
> modified the slave so that it doesn't do calculations,
> but just returns a float to the master.  Both apps
> compiled without a problem, and the master shows up on
> the pvm console with a 'ps -a' command.  Does anyone
> have any ideas that might get me past this problem?

Check your path(s).  PVM requires that the slave binary live in one of a
few very specific places in order to be able to successfully spawn it.
For a private user, for example, this might be ~/pvm3/bin/LINUX/ on a
linux system.

Beyond that check out the contents of e.g. /tmp/pvmX.log to see what pvm
says when it tries to spawn the slave.  You can sometimes get some help
from running at test under xpvm -- it provides you with visual access to
a lot of the logfile's running contents.  However be warned -- an
application that is "dense" in pvm activity can put so much pressure on
xpvm (which basically sucks up memory to buffer the I/O channel) that it
can exhaust swap VERY QUICKLY and crash your system.  Really xpvm could
use some judicious hacking putting some sort of memory limits on it as
it can literally run away with your system before you can respond...

   Hope this helps (as always).  If it doesn't, try to snoop a bit more
and provide more detail about what is going on on the next pass.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From burgu at iitk.ac.in  Tue Jun 27 11:37:05 2000
From: burgu at iitk.ac.in (Burgu Praveen Kumar)
Date: Wed, 28 Jun 2000 00:07:05 +0530 (IST)
Subject: help wanted
Message-ID: <Pine.LNX.4.10.10006280003420.21579-100000@mailer.cc.iitk.ac.in>

 	can anyone suggest me some research problems on parallel
algorithms for Image Processing . If possible, give me some references.


Burgu Praveen Kumar
Graduate Student,
Dept. of Computer Science & Engg.
Indian Institute of Technology,
Kanpur


From burgu at iitk.ac.in  Tue Jun 27 11:33:30 2000
From: burgu at iitk.ac.in (Burgu Praveen Kumar)
Date: Wed, 28 Jun 2000 00:03:30 +0530 (IST)
Subject: packing unsigned char in PVM
In-Reply-To: <Pine.LNX.4.10.10006271316560.22200-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.4.10.10006272347360.21579-100000@mailer.cc.iitk.ac.in>

Can anyone of u tell me how to pack unsigned char in PVM. 

Burgu Praveen Kumar
Graduate Student,
Dept. of Computer Science & Engg.
Indian Institute of Technology,
Kanpur


From rgb at phy.duke.edu  Tue Jun 27 14:41:26 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 27 Jun 2000 17:41:26 -0400 (EDT)
Subject: packing unsigned char in PVM
In-Reply-To: <Pine.LNX.4.10.10006272347360.21579-100000@mailer.cc.iitk.ac.in>
Message-ID: <Pine.LNX.4.10.10006271738430.22200-100000@ganesh.phy.duke.edu>

On Wed, 28 Jun 2000, Burgu Praveen Kumar wrote:

> 
> Can anyone of u tell me how to pack unsigned char in PVM. 

I thought char was always unsigned.  Signed and unsigned differentiates
integers (only), or at least so I thought -- reserving the sign bit.
You should just pack for char, at a guess.

   rgb


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From kragen at pobox.com  Tue Jun 27 15:16:21 2000
From: kragen at pobox.com (Kragen Sitaker)
Date: Tue, 27 Jun 2000 18:16:21 -0400 (EDT)
Subject: packing unsigned char in PVM
Message-ID: <Pine.GSO.4.21.0006271813040.21839-100000@kirk.dnaco.net>

Robert Brown writes:
> I thought char was always unsigned.  Signed and unsigned differentiates
> integers (only), or at least so I thought -- reserving the sign bit.
> You should just pack for char, at a guess.

In C, a char is a kind of integer, and so there is such a thing as an
"unsigned char".

In C, "char" can mean either "unsigned char" or "signed char",
depending on the platform.  (Unlike other kinds of integers, which are
always signed by default.)  The ANSI standard added the "signed"
keyword so you could talk about signed chars on architectures where the
default char was unsigned.

Signed chars are a huge rat's nest of bugs.

You should be able to just pack for char.

-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)


From dek_ml at konerding.com  Tue Jun 27 22:15:16 2000
From: dek_ml at konerding.com (dek_ml at konerding.com)
Date: Tue, 27 Jun 2000 22:15:16 -0700
Subject: packing unsigned char in PVM 
In-Reply-To: Your message of "Tue, 27 Jun 2000 17:41:26 EDT."
             <Pine.LNX.4.10.10006271738430.22200-100000@ganesh.phy.duke.edu> 
Message-ID: <200006280515.WAA09866@adsl-63-202-25-210.dsl.snfc21.pacbell.net>

"Robert G. Brown" writes:
>On Wed, 28 Jun 2000, Burgu Praveen Kumar wrote:
>
>> 
>> Can anyone of u tell me how to pack unsigned char in PVM. 
>
>I thought char was always unsigned.  Signed and unsigned differentiates
>integers (only), or at least so I thought -- reserving the sign bit.
>You should just pack for char, at a guess.

On SGIs, char is unsigned, on Intel, char is signed.  CPUs can do either
(just like endian).  I found this out when the results I got in a program
I had ported to linux were junk-- -1's when they should have been 255s.

Dave


From salim at ee.fit.edu  Wed Jun 28 05:33:28 2000
From: salim at ee.fit.edu (Salim Mounir AlAoui)
Date: Wed, 28 Jun 2000 08:33:28 -0400 (EDT)
Subject: help wanted
In-Reply-To: <Pine.LNX.4.10.10006280003420.21579-100000@mailer.cc.iitk.ac.in>
Message-ID: <Pine.GSO.3.96.1000628083003.23405D-100000@yacht.ee.fit.edu>


I am working on image processing using a beowulf. I had the same problem
of packing unsigned char. What i did, i used regular c.


--------------------------------------------------------------------------
Salim Mounir Alaoui					salim at ee.fit.edu
Computer Science  Dept.					salaoui at cs.fit.edu
Research Assistant.					salim at ieee.org
Florida Institute of Technology
Melbourne, Florida
Voice: (407) 537-8025.
--------------------------------------------------------------------------


From camm at enhanced.com  Wed Jun 28 06:54:34 2000
From: camm at enhanced.com (Camm Maguire)
Date: 28 Jun 2000 09:54:34 -0400
Subject: packing unsigned char in PVM
In-Reply-To: kragen@pobox.com's message of "Tue, 27 Jun 2000 18:16:21 -0400 (EDT)"
References: <Pine.GSO.4.21.0006271813040.21839-100000@kirk.dnaco.net>
Message-ID: <54itutkj1h.fsf@intech9.enhanced.com>

Greetings!  Why not pack for char, and then on the receiving end:

	unsigned char *u;

	u=pvm_received_char_buf;
	(read u[i])

Take care,

kragen at pobox.com (Kragen Sitaker) writes:

> Robert Brown writes:
> > I thought char was always unsigned.  Signed and unsigned differentiates
> > integers (only), or at least so I thought -- reserving the sign bit.
> > You should just pack for char, at a guess.
> 
> In C, a char is a kind of integer, and so there is such a thing as an
> "unsigned char".
> 
> In C, "char" can mean either "unsigned char" or "signed char",
> depending on the platform.  (Unlike other kinds of integers, which are
> always signed by default.)  The ANSI standard added the "signed"
> keyword so you could talk about signed chars on architectures where the
> default char was unsigned.
> 
> Signed chars are a huge rat's nest of bugs.
> 
> You should be able to just pack for char.
> 
> -- 
> <kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
> The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
> <URL:http://www.pobox.com/~kragen/bubble.html>
> The power didn't go out on 2000-01-01 either.  :)
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 

-- 
Camm Maguire			     			camm at enhanced.com
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah


From josip at icase.edu  Wed Jun 28 12:16:22 2000
From: josip at icase.edu (Josip Loncaric)
Date: Wed, 28 Jun 2000 15:16:22 -0400
Subject: Take any two: motherboard performance, compatibility, value
Message-ID: <395A4F06.6C3F1A70@icase.edu>

Hello,

since last year, choosing a motherboard for Beowulf applications has
gotten much more complicated.  Intel has locked itself into a bizarre
RDRAM corner, VIA has only chipsets which tolerate but do not use ECC,
AMD still does not have a successor for its 750 chipset, and it may be
2001 before motherboard market offers attractive choices.

Our needs are fairly basic:

(1) K7 or P3 processor in a compact package (->dual CPU)
(2) lots of fast RAM using ECC
(3) Linux compatibility
(4) commodity pricing

The K7 route offers only single CPU systems.  To use PC133 RAM, one must
pick KX133 chipset, which does not perform ECC (it only tolerates such
memory modules).  No SMP, no ECC yet: K7 is out for now. 

The P3 route is also problematic.  Intel's recent dual CPU chipsets are
i820 and i840, but they insist on RDRAM which (today) costs 3-4 times
more than SDRAM, without providing a consistent benefit.  RDRAM is *not*
an option for us.  However, SDRAM on i820 motherboards requires MTH chip
(now recalled because noise causes random reboots), while i840
motherboards need the MRH-S chip which is now also discontinued by Intel
(incorrect ECC operation was reported).  Since RDRAM price/performance
is so poor and i8x0 chipsets do not work correctly with SDRAM, both i8x0
chipsets are out.  

This leaves VIA Apollo Pro 133A chipset (used by Tyan Tiger 133 S1834),
but this chipset does not use ECC.  About the only remaining option are
ServerWorks chipsets (used by Supermicro Super 370DLE), but I do not
know if anyone has made them work with Linux.

So we are back to square one, i.e. 440BX SMP motherboards and PC100
RAM.  This is Not Good.  High RAM bandwidth is essential, particularly
on dual P3/800 machines (faster clock, smaller cache)...

BTW, I see that ECC corrects about one single bit error per month in
12GB of RAM.  Our total system will have close to 40GB, so errors could
pop up weekly, which is why we need ECC.  

Sincerely,
Josip


-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From jakob at ostenfeld.dtu.dk  Wed Jun 28 13:28:58 2000
From: jakob at ostenfeld.dtu.dk (=?iso-8859-1?Q?Jakob_=D8stergaard?=)
Date: Wed, 28 Jun 2000 22:28:58 +0200
Subject: Take any two: motherboard performance, compatibility, value
In-Reply-To: <395A4F06.6C3F1A70@icase.edu>
References: <395A4F06.6C3F1A70@icase.edu>
Message-ID: <20000628222858.N1603@ostenfeld.dtu.dk>

On Wed, 28 Jun 2000, Josip Loncaric wrote:

> Hello,
> 
...
> 
> So we are back to square one, i.e. 440BX SMP motherboards and PC100
> RAM.  This is Not Good.  High RAM bandwidth is essential, particularly
> on dual P3/800 machines (faster clock, smaller cache)...

We bought a new dual 550 PIII at work recently, and ended up using good
old Asus P2B-D (BX based).  It seemed to be the only real affordable 
and known-stable solution.

> BTW, I see that ECC corrects about one single bit error per month in
> 12GB of RAM.  Our total system will have close to 40GB, so errors could
> pop up weekly, which is why we need ECC.  

Are you absolutely certain that ECC RAM on PC hardware actually *corrects*
bit errors ?

There was a short discussion on this subject on the linux-kernel list some
weeks ago, where someone stated that ECC RAM (for PCs) can only *detect* a
parity error and offer you an NMI when that occurs.   Noone seemed to object to
this.

Yes, I know what ECC stands for, but think about it:  ECC RAM cost about
the same as normal parity-RAM,  why ?    It seemed that the conclusion was
that if you wanted error correction you should go for non-PC hardware.

The statement came up in a memory-detection discussion where someone who had
hooked a logic analyzer on a motherboard found that NT detects the amount of
memory available on a system by  1)  disabling RAM parity check,  2)  writing
until it sees an error,  and then *never* enabling RAM parity again.   Someone
found it amusing that NT didn't ever enable the parity check again, then
someone else pointed out the above, that ECC didn't help you much anyway except
assuring that the kernel would die when a bitflip occured.   The latter may of
course still be preferable to random un-noticed bitflips in data.   But the
essense of the argument was, that ECC RAM did *not* correct bit errors if it
was PC ECC RAM.

Does anyone have further information on this ?    I don't know anything about
this myself, but the price argument seems reasonable, and I guess you could
count the number of chips on your RAM modules to find out if it really has
enough bits for error correction, or only the extra one needed for parity.

-- 
................................................................
: jakob at ostenfeld.dtu.dk  : And I see the elder races,         :
:.........................: putrid forms of man                :
:   Jakob ?stergaard      : See him rise and claim the earth,  :
:        OZ9ABN           : his downfall is at hand.           :
:.........................:............{Konkhra}...............:


From wasshub at ti.com  Wed Jun 28 14:31:33 2000
From: wasshub at ti.com (Christoph Wasshuber)
Date: Wed, 28 Jun 2000 16:31:33 -0500
Subject: Beowulf: A theorical approach
References: <Pine.LNX.4.10.10006261059510.4735-100000@plogic.com> <395A02EE.5A95DF25@srccomp.com>
Message-ID: <395A6EB5.293B21A8@ti.com>

> Intel's chipsets the 440LX and 440BX support AGP, but only support 2 processors
> on
> the host bus.  Their 450KX, GX and NX  varieties designed for servers support 4
> processors
> on the host bus, but lack support for AGP.
> 
> For the chipsets that support AGP, the AGP interface does indeed sit behind the
> host
> bridge as well as memory, pci devices and so on.

Reading a little bit through the AGP spec it seems to be realy powerful.
In the 4x mode it has ~8Gbit/s bandwith - wow!. Latency I assume to be
around 100ns.

Chris....


From paullu at cs.ualberta.ca  Wed Jun 28 14:50:39 2000
From: paullu at cs.ualberta.ca (Paul Lu)
Date: Wed, 28 Jun 2000 15:50:39 -0600
Subject: Take any two: motherboard performance, compatibility, value
In-Reply-To: <395A4F06.6C3F1A70@icase.edu>; from josip@icase.edu on Wed, Jun 28, 2000 at 03:16:22PM -0400
References: <395A4F06.6C3F1A70@icase.edu>
Message-ID: <20000628155039.J2705@cs.ualberta.ca>

Hello:

On Wed, Jun 28, 2000 at 03:16:22PM -0400, Josip Loncaric wrote:
> About the only remaining option are
> ServerWorks chipsets (used by Supermicro Super 370DLE), but I do not
> know if anyone has made them work with Linux.

Well, the new Compaq ProLiant (DL360, DL580) servers appear to be based
on the same (family of) ServerWorks chipsets (see
	http://www.serverworks.com/news/press/000606.html
)
and Compaq explicitly lists RedHat 6.2 as being supported on the new ProLiants.

Also, a quick grep through the kernel 2.4.0-test2 source indicates that
there are the "proper" constants defined for this chipset.

But, all this is no substitute with hands-on experience.

Does anybody have any Linux experience with this chipset and/or these
motherboards and especially with respect to 64-bit/66 MHz PCI performance?

Thanks,

	...Paul


From bob at drzyzgula.org  Wed Jun 28 15:39:54 2000
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Wed, 28 Jun 2000 18:39:54 -0400
Subject: Take any two: motherboard performance, compatibility, value
In-Reply-To: <20000628222858.N1603@ostenfeld.dtu.dk>
References: <395A4F06.6C3F1A70@icase.edu> <20000628222858.N1603@ostenfeld.dtu.dk>
Message-ID: <20000628183954.B3703@mercury.drzyzgula.org>

On Wed, Jun 28, 2000 at 10:28:58PM +0200, Jakob ?stergaard wrote:
> On Wed, 28 Jun 2000, Josip Loncaric wrote:
> 
> > So we are back to square one, i.e. 440BX SMP motherboards and PC100
> > RAM.  This is Not Good.  High RAM bandwidth is essential, particularly
> > on dual P3/800 machines (faster clock, smaller cache)...
> 
> We bought a new dual 550 PIII at work recently, and ended up using good
> old Asus P2B-D (BX based).  It seemed to be the only real affordable 
> and known-stable solution.

Just one more indication of how badly Intel screwed up with the
whole RDRAM fiasco. Improvements in this arena have just been
stalled for months.

> > BTW, I see that ECC corrects about one single bit error per month in
> > 12GB of RAM.  Our total system will have close to 40GB, so errors could
> > pop up weekly, which is why we need ECC.  
> 
> Are you absolutely certain that ECC RAM on PC hardware actually *corrects*
> bit errors ?
> 
> There was a short discussion on this subject on the linux-kernel list some
> weeks ago, where someone stated that ECC RAM (for PCs) can only *detect* a
> parity error and offer you an NMI when that occurs. Noone seemed to object to
> this.

The last thing I am is an expert on this, but, quoting
Intel's 440BX web page at

  http://developer.intel.com/design/intarch/techinfo/440BX/BX_arch.htm

] The Intel? 440BX AGPset also provides DIMM plug-and-play
] support via Serial Presence Detect (SPD) mechanism using
] the SMBus interface. The 82443BX provides optional
] data integrity features including ECC in the memory
] array. During reads from DRAM, the 82443BX provides
] error checking and correction of the data. The 82443BX
] supports multiple-bit error detection and single-bit error
] correction when ECC mode is enabled and single/multi-bit
] error detection when correction is disabled. During
] writes to the DRAM, the 82443BX generates ECC for the
] data on a QWord basis. Partial QWord writes require a
] read-modify-write cycle when ECC is enabled.

In these PC architectures, I don't think that there is any
ECC generation on-module like there is in some architectures,
there is only sufficient bit storage to allow the chipset
to generate the somewhat-redundant codes and store those.

Whether the motherboard manufacturers, BIOS writers and
operating systems configure the chipset properly to take
advantage of this, or do anything interesting with any
information provided by the chipset is another matter
entirely. I would expect, for example, that the chipset
would raise some sort of alert if a single-bit ECC error
was detected and corrected; certainly the OS would want
to log such an event. Depending on the motherboard, BIOS
and OS, it would certainly be possible to treat such an
alert exactly the same as one would treat a double-bit
error, or a a single-bit error when ECC is turned off,
e.g. NMI. It's also possible, I suppose, that the ECC
generation and detection in the 443BX doesn't work worth
a damn and thus most 440BX designs leave it turned off.
I have no reason to believe this is true, however.

FWIW.

--Bob Drzyzgula


From mdavis at kieser.net  Wed Jun 28 16:35:22 2000
From: mdavis at kieser.net (Mike Davis)
Date: Thu, 29 Jun 2000 00:35:22 +0100
Subject: Water-cooling
Message-ID: <001b01bfe159$886320c0$6500000a@carisbrook.co.uk>

Hi,

Dunno if anyone's seen/tried this, but it sure looks good!
http://www.agaweb.com/coolcpu/build.htm

He mentions that the Coppermine chips are going to be in Chip format again
(not Slot1). Is this correct?

Mike


From deadline at plogic.com  Wed Jun 28 16:53:08 2000
From: deadline at plogic.com (Douglas Eadline)
Date: Wed, 28 Jun 2000 19:53:08 -0400 (EDT)
Subject: Take any two: motherboard performance, compatibility, value
In-Reply-To: <20000628155039.J2705@cs.ualberta.ca>
Message-ID: <Pine.LNX.4.10.10006281945080.9200-100000@lisa.plogic.com>

On Wed, 28 Jun 2000, Paul Lu wrote:

> Hello:
> 
> On Wed, Jun 28, 2000 at 03:16:22PM -0400, Josip Loncaric wrote:
> > About the only remaining option are
> > ServerWorks chipsets (used by Supermicro Super 370DLE), but I do not
> > know if anyone has made them work with Linux.
> 
> Well, the new Compaq ProLiant (DL360, DL580) servers appear to be based
> on the same (family of) ServerWorks chipsets (see
> 	http://www.serverworks.com/news/press/000606.html
> )
> and Compaq explicitly lists RedHat 6.2 as being supported on the new ProLiants.
> 
> Also, a quick grep through the kernel 2.4.0-test2 source indicates that
> there are the "proper" constants defined for this chipset.
> 
> But, all this is no substitute with hands-on experience.
> 
> Does anybody have any Linux experience with this chipset and/or these
> motherboards and especially with respect to 64-bit/66 MHz PCI performance?

We have Linux working on Serverworks systems. We are still testing, but
our dual PIII Cu-733 with 1GB PC-133 SDRAM(ecc) gave pretty good stream
numbers.  

Doug 

-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.814.2800
130 Webster Street        |   PARALLEL   |        Fax:+610.814.5844
Bethlehem, PA 18015 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------


From djholm at fnal.gov  Wed Jun 28 17:06:33 2000
From: djholm at fnal.gov (Don Holmgren)
Date: Wed, 28 Jun 2000 19:06:33 -0500 (CDT)
Subject: Take any two: motherboard performance, compatibility, value
In-Reply-To: <20000628183954.B3703@mercury.drzyzgula.org>
Message-ID: <Pine.SGI.3.95.1000628183851.15426A-100000@hppc.fnal.gov>


On Wed, 28 Jun 2000, Bob Drzyzgula wrote:

...

> > > BTW, I see that ECC corrects about one single bit error per month in
> > > 12GB of RAM.  Our total system will have close to 40GB, so errors could
> > > pop up weekly, which is why we need ECC.  
> > 
> > Are you absolutely certain that ECC RAM on PC hardware actually *corrects*
> > bit errors ?
> > 
> > There was a short discussion on this subject on the linux-kernel list some
> > weeks ago, where someone stated that ECC RAM (for PCs) can only *detect* a
> > parity error and offer you an NMI when that occurs. Noone seemed to object to
> > this.
> 
> The last thing I am is an expert on this, but, quoting
> Intel's 440BX web page at
> 
>   http://developer.intel.com/design/intarch/techinfo/440BX/BX_arch.htm
> 
> ] The Intel? 440BX AGPset also provides DIMM plug-and-play
> ] support via Serial Presence Detect (SPD) mechanism using
> ] the SMBus interface. The 82443BX provides optional
> ] data integrity features including ECC in the memory
> ] array. During reads from DRAM, the 82443BX provides
> ] error checking and correction of the data. The 82443BX
> ] supports multiple-bit error detection and single-bit error
> ] correction when ECC mode is enabled and single/multi-bit
> ] error detection when correction is disabled. During
> ] writes to the DRAM, the 82443BX generates ECC for the
> ] data on a QWord basis. Partial QWord writes require a
> ] read-modify-write cycle when ECC is enabled.
> 
> In these PC architectures, I don't think that there is any
> ECC generation on-module like there is in some architectures,
> there is only sufficient bit storage to allow the chipset
> to generate the somewhat-redundant codes and store those.
> 
> Whether the motherboard manufacturers, BIOS writers and
> operating systems configure the chipset properly to take
> advantage of this, or do anything interesting with any
> information provided by the chipset is another matter
> entirely. I would expect, for example, that the chipset
> would raise some sort of alert if a single-bit ECC error
> was detected and corrected; certainly the OS would want
> to log such an event. Depending on the motherboard, BIOS
> and OS, it would certainly be possible to treat such an
> alert exactly the same as one would treat a double-bit
> error, or a a single-bit error when ECC is turned off,
> e.g. NMI. It's also possible, I suppose, that the ECC
> generation and detection in the 443BX doesn't work worth
> a damn and thus most 440BX designs leave it turned off.
> I have no reason to believe this is true, however.
> 
> FWIW.
> 
> --Bob Drzyzgula

When we ran into some memory problems on 440BX- and 440GX-based systems, I dug
into the Intel PCI chipset manuals and wrote some code to dump the information
from the memory controller registers. 

The extra 8 bits available on memory with parity - 72 bits wide, rather than 64
bits (interesting that this is now marketed as ECC memory; a couple of years ago
it was sold as parity memory) - is indeed used to do the ECC calculations and
corrections by the memory controller.  No additional circuitry in needed on the
DIMMs.  Single bit errors are all corrected transparently to the microprocessor.
Multibit errors are not correctable, and if so configured the chipset can issue
an NMI.  On Linux this NMI results in the "dazed and confused" console message:

  "Uhhuh. NMI received. Dazed and confused, but trying to continue"

We have a critical application which can't tolerate data errors, and so have
patched the NMI trap and reboot the system immediately following a multibit
error.

The memory controller has a couple of registers used to indicate whether bit
errors have been detected - a flag for a single bit error, a flag for a multiple
bit error, and the page where the error occurred.  This information is latched
at the first error.

At ftp://linux-rep.fnal.gov/pub/motherboards/ I have 3 programs you can use to
query the controller:
  chip2.c - dumps lots of information, such as CAS/RAS timings, which DIMM
            slot(s) are populated, how large the DIMMs are, whether each DIMM is
            ECC-capable or not, whether and where bit errors have occurred.
  biterror_check.c - checks and reports whether or not a single or multiple bit
            error has occurred, and the page of the occurrance.  Remember, this
            information is latched, so multiple errors may have occurred
            subsequent to the first.
  biterror_reset.c - checks and reports whether or not a single or multiple bit
            error has occurred, and the page of the occurrance.  Also resets the
            error flags.

On my motherboards there's always a single bit error after a reboot, so I
suspect the BIOS causes one to happen when sizing memory.  So, I usually do a
biterror_reset during system startup.  

On the systems we're currently monitoring - 20 L440GX+ motherboards with 512 MB
of memory each - single bit errors are extremely rare.  Perhaps 1 per month of
operation across all of the machines.  I've not seen a multiple bit error since
replacing memory last January.

To interpret the output of chip2.c you'll need the 82443BX or 82443GX host
bridge manual from Intel.

Don Holmgren
Fermilab


From joysarkar at jncasr.ac.in  Thu Jun 29 09:55:48 2000
From: joysarkar at jncasr.ac.in (Mr.Joy Sarkar)
Date: Thu, 29 Jun 2000 11:55:48 -0500 (GMT+5)
Subject: i810
Message-ID: <Pine.LNX.4.04.10006291126490.4429-100000@jncasr.ac.in>

Hi,
	I am a little scared with all this discussion about Intel
motherboards! We are using i810 with PC 100 MHz  SDRAM (viking make). Can
anyone tell us if there are any known issues (bad news, that is -:( ) with
the above combo?

TIA,

sincerely,
js.
		

From josip at icase.edu  Thu Jun 29 05:39:13 2000
From: josip at icase.edu (Josip Loncaric)
Date: Thu, 29 Jun 2000 08:39:13 -0400
Subject: Take any two: motherboard performance, compatibility, value
References: <395A4F06.6C3F1A70@icase.edu> <20000628222858.N1603@ostenfeld.dtu.dk> <20000628183954.B3703@mercury.drzyzgula.org>
Message-ID: <395B4371.EF3A3D14@icase.edu>

Bob Drzyzgula wrote:
> 
> On Wed, Jun 28, 2000 at 10:28:58PM +0200, Jakob ?stergaard wrote:
> >
> > Are you absolutely certain that ECC RAM on PC hardware actually *corrects*
> > bit errors ?

[ Intel says yes ]

> I would expect, for example, that the chipset
> would raise some sort of alert if a single-bit ECC error
> was detected and corrected; certainly the OS would want
> to log such an event.

After BIOS activates ECC, the 440BX chipset logs all corrected single
bit errors (and all detected multiple bit errors); but Linux kernel does
*not* automatically monitor the relevant 440BX register.  You need to
donload, compile and insert the 'ecc' module to get ECC logging.  This
module (which currently works *only* on uniprocessor machines) is
available from 

http://www.anime.net/~goemon/linux-ecc/

This module produces the following kind of output:

Jun  8 18:47:11 n009 kernel: ECC: monitor version 0.9 (Oct 15 1999)  
Jun 27 19:04:30 n009 kernel: ECC: SBE detected in DRAM row 3  
Jun 27 19:04:30 n009 kernel: ECC: SBE at memory address 8000  


Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From emullinix at allegro.net  Thu Jun 29 06:03:25 2000
From: emullinix at allegro.net (Erik Mullinix)
Date: Thu, 29 Jun 2000 08:03:25 -0500
Subject: Water-cooling
Message-ID: <s95b10f9.020@www.gwmail.com>

Here is an Alternative to the use of water (a known corrosive);-)

http://www.accsdata.com/drffreeze/Dr%20Ffreeze.htm

Cooling an another fassion.;-)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20000629/b24a4a31/attachment.htm>

From timm at fnal.gov  Thu Jun 29 06:39:28 2000
From: timm at fnal.gov (Steven Timm)
Date: Thu, 29 Jun 2000 08:39:28 -0500 (CDT)
Subject: Water-cooling
In-Reply-To: <001b01bfe159$886320c0$6500000a@carisbrook.co.uk>
Message-ID: <Pine.LNX.4.21.0006290838580.6574-100000@sapphire.fnal.gov>

> Hi,
> 
> Dunno if anyone's seen/tried this, but it sure looks good!
> http://www.agaweb.com/coolcpu/build.htm
> 
> He mentions that the Coppermine chips are going to be in Chip format again
> (not Slot1). Is this correct?
> 
> Mike
> 
Coppermine chips are available both in a chip format and a slot-1 
format.

Steve Timm


> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 


From josip at icase.edu  Thu Jun 29 07:00:59 2000
From: josip at icase.edu (Josip Loncaric)
Date: Thu, 29 Jun 2000 10:00:59 -0400
Subject: Take any two: motherboard performance, compatibility, value
References: <Pine.LNX.4.10.10006281945080.9200-100000@lisa.plogic.com>
Message-ID: <395B569B.F514B5D5@icase.edu>

Douglas Eadline wrote:
> 
> We have Linux working on Serverworks systems. We are still testing, but
> our dual PIII Cu-733 with 1GB PC-133 SDRAM(ecc) gave pretty good stream
> numbers.

Thanks for the info!  I dug around some more, and learned that
Serverworks is kind of stingy with documentation (see
http://boudicca.tux.org/hypermail/linux-kernel/2000week15/0768.html). 
This may imply some difficulties in fully exploiting its nicer features
(64-bit PCI bus at 66MHz, etc.)

I got/found the following references to motherboards using Serverworks
chipsets:

http://www.tyan.com/html/pr_thunders_22400.html (see Thunder 2500)
http://www.tyan.com/products/html/s1867.html (bad link?)
http://www.supermicro.com/PRODUCT/MotherBoards/RCC_LE/370DL3.htm
http://www.supermicro.com/PRODUCT/MotherBoards/RCC_LE/370DLE.htm

ASL builds a dual Pentium III Linux workstation (presumably using the
Tyan Thunder 2500 board):

http://www.aslab.com/contents/workstations/Marquis-C250S.html

The Serverworks option is interesting, but I'd need more info before
buying a system of this type.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From emullinix at allegro.net  Thu Jun 29 07:26:56 2000
From: emullinix at allegro.net (Erik Mullinix)
Date: Thu, 29 Jun 2000 09:26:56 -0500
Subject: Water-cooling
Message-ID: <s95b247c.052@www.gwmail.com>

I do appologize for the leading on.  The URL has no misspellings however the domain seems to have stopped working within the last week or 2 and I had a cached version.
I will try to get more information or reconstruct it from my cache. (Lazyness pays off I still have my cache.)

Erik Mullinix

>>> Frank Joerdens <frank at joerdens.de> 06/29/00 10:23AM >>>
hello,

there are at least a couple of typos in the url. could you send it
again? i am intrigued ;)

cheers frank

On Thu, Jun 29, 2000 at 08:03:25AM -0500, Erik Mullinix wrote:
> 
>    Here is an Alternative to the use of water (a known corrosive);-)
>    
>    
>    
>    [1]http://www.accsdata.com/drffreeze/Dr%20Ffreeze.htm 
>    
>    
>    
>    Cooling an another fassion.;-)
> 
> References
> 
>    1. http://www.accsdata.com/drffreeze/Dr%20Ffreeze.htm 

-- 
frank joerdens               

joerdens new media
urbanstr. 116
10967 berlin
germany

e: frank at joerdens.de 
m: +49 (0)179 5174091
f: +49 (0)30 7864046 
h: http://www.joerdens.de 

pgp public key: http://www.joerdens.de/pgp/frank_joerdens.asc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20000629/b95a103c/attachment.htm>

From salim at ee.fit.edu  Thu Jun 29 07:40:14 2000
From: salim at ee.fit.edu (Salim Mounir AlAoui)
Date: Thu, 29 Jun 2000 10:40:14 -0400 (EDT)
Subject: make
Message-ID: <Pine.GSO.3.96.1000629103818.19532A-100000@yacht.ee.fit.edu>

Is there a tool to perform parallel make? For large softwares make could
take lot of time, a parallel make taking advantage of the beowulf will be
interesting. Does anyone worked on that?


--------------------------------------------------------------------------
Salim Mounir Alaoui					salim at ee.fit.edu
Computer Science  Dept.					salaoui at cs.fit.edu
Research Assistant.					salim at ieee.org
Florida Institute of Technology
Melbourne, Florida
Voice: (407) 537-8025.
--------------------------------------------------------------------------


From joelja at darkwing.uoregon.edu  Thu Jun 29 10:05:11 2000
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Thu, 29 Jun 2000 10:05:11 -0700 (PDT)
Subject: i810
In-Reply-To: <Pine.LNX.4.04.10006291126490.4429-100000@jncasr.ac.in>
Message-ID: <Pine.LNX.4.21.0006290926080.1041-100000@twin.uoregon.edu>

apart from poor video performance which you probably don't care about on a
cluster, there's nothing really wrong with the i810. the issues
surrounding the i820/i840 involve the memory transfer hub for using sdram
on the i820 being faulty, and the mrh for sdram on the 840 being lower
performance than people expected. couple that with the early delay in
delivery of the i820 because intel decided it didn't work that well with
three rdram rim sockets, and thusly reduced the spec to two, causing a
three month delay in the delivry of motherboards and you have the making
of a debacle...

regards
joelja
 
On Thu, 29 Jun 2000, Mr.Joy Sarkar wrote:

> 
> Hi,
> 	I am a little scared with all this discussion about Intel
> motherboards! We are using i810 with PC 100 MHz  SDRAM (viking make). Can
> anyone tell us if there are any known issues (bad news, that is -:( ) with
> the above combo?
> 
> TIA,
> 
> sincerely,
> js.
> 		
> 
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli				       joelja at darkwing.uoregon.edu    
Academic User Services			     consult at gladstone.uoregon.edu
     PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E
--------------------------------------------------------------------------
It is clear that the arm of criticism cannot replace the criticism of
arms.  Karl Marx -- Introduction to the critique of Hegel's Philosophy of
the right, 1843.


From gallipwc at nswcphdn.navy.mil  Thu Jun 29 10:44:49 2000
From: gallipwc at nswcphdn.navy.mil (Gallip William C PHDN)
Date: Thu, 29 Jun 2000 13:44:49 -0400
Subject: Parallelization check list
Message-ID: <AF67AB108F16D21196F600805F19516D01AB365E@phdnex01.nswcphdn.navy.mil>

We are starting to put together a check-list of things to consider when
taking a serial code and determining its parallelizability, either for a
distributed memory and/or SMP environment.  Has anyone already gone through
a similar exercise and if so could you share the results?  Any help would be
greatly appreciated.

Bill Gallip
Fleet Advanced Supercomputing Technology Center
NAVSEA Dam Neck
Virginia Beach, VA


From RSchilling at affiliatedhealth.org  Thu Jun 29 11:15:04 2000
From: RSchilling at affiliatedhealth.org (Schilling, Richard)
Date: Thu, 29 Jun 2000 11:15:04 -0700
Subject: Parallelization check list
Message-ID: <51FCCCF0C130D211BE550008C724149E8FECE5@mail1.affiliatedhealth.org>

Sure. One of the things to look for, definitely, are nested loops.  You can
increase "big oh" efficienty by factors of ten when you parallelize these
types of loops.  Not sure about any actual benchmarks on the approach,
however.

Richard Schilling
Web Integration Programmer
Affiliated Health Services

-----Original Message-----
From: Gallip William C PHDN [mailto:gallipwc at nswcphdn.navy.mil]
Sent: Thursday, June 29, 2000 10:45 AM
To: beowulf at beowulf.org
Subject: Parallelization check list


We are starting to put together a check-list of things to consider when
taking a serial code and determining its parallelizability, either for a
distributed memory and/or SMP environment.  Has anyone already gone through
a similar exercise and if so could you share the results?  Any help would be
greatly appreciated.

Bill Gallip
Fleet Advanced Supercomputing Technology Center
NAVSEA Dam Neck
Virginia Beach, VA

_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20000629/e1724daf/attachment.html>

From rgb at phy.duke.edu  Thu Jun 29 12:09:39 2000
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 29 Jun 2000 15:09:39 -0400 (EDT)
Subject: Parallelization check list
In-Reply-To: <AF67AB108F16D21196F600805F19516D01AB365E@phdnex01.nswcphdn.navy.mil>
Message-ID: <Pine.LNX.4.10.10006291451460.28675-100000@ganesh.phy.duke.edu>

On Thu, 29 Jun 2000, Gallip William C PHDN wrote:

> We are starting to put together a check-list of things to consider when
> taking a serial code and determining its parallelizability, either for a
> distributed memory and/or SMP environment.  Has anyone already gone through
> a similar exercise and if so could you share the results?  Any help would be
> greatly appreciated.
> 
> Bill Gallip
> Fleet Advanced Supercomputing Technology Center
> NAVSEA Dam Neck
> Virginia Beach, VA

Not exactly a checklist, but look at the talks on

   http://www.phy.duke.edu/brahma

The one presented at last years Extreme Linux Tutorial at Linux Expo is
largely a case study of this very thing.  The other talks (and the draft
book) explain the underlying math, I hope fairly clearly.  I'm working
on a paper that will describe the application of various measurement
tools to the process, to make it at least approximately quantitative
instead of qualitative.

However, be aware that your checklist will at best be an approximate
answer or guideline, not a deterministic protocol.  That is because the
parallelizability depends on algorithm, and the algorithms used in the
serial code may NOT be efficiently parallelizable.  Parallelizability
also depends (deeply) on the design of the parallel environment you're
talking about, which can vary considerably, the size of the problem, and
more.  The only way to fully answer the question for at least some
problems is to do some deep, fully informed research.  There is a
references in the draft book manuscript to a URL for an online book on
parallel algorithms and program design and there are a number of
well-known books that one can buy via a bookseller as well.

The best checklist would likely be one that could be used to select
projects that are definitely parallelizable (and possibly match them
with appropriate parallel architectures) but NOT reject ones as
unparallelizable except at the functional level (the task itself simply
has no usefully parallelizable component and fails the basic Amdahl's
Law type test before you even start).  Instead refer marginal rejects to
a deeper learning and research process.  Sometimes a complete rewrite of
a program with totally different algorithms will make it run very
efficiently at certain (desirable) scales on certain parallel
architectures that you can sometimes afford.  It's like that...;-).

Also remain aware (as the talk indicates) that scaling is everything
when determining parallelization.  Applying a given program to a "small"
task (e.g. a small lattice) may yield little parallel speedup or even a
negative one (it can easily run more slowly!). Applying the SAME PROGRAM
to a "large" task (e.g. a big lattice) may yield a big speedup.  You
have to understand speedup scaling relations with respect to at least
two dimensions (number of nodes and "size" of problem) in interaction
with your hardware layout and its fundamental topology, rates, latencies
and bandwidths (which can vary considerably) to determine whether or not
a program could be parallelized.  It's not generally a binary decision,
and the decision space is generally a lot bigger than even two
dimensional and very complex.

Still, good luck.

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From hahn at coffee.psychology.mcmaster.ca  Thu Jun 29 12:41:29 2000
From: hahn at coffee.psychology.mcmaster.ca (Mark Hahn)
Date: Thu, 29 Jun 2000 15:41:29 -0400 (EDT)
Subject: i810
In-Reply-To: <Pine.LNX.4.21.0006290926080.1041-100000@twin.uoregon.edu>
Message-ID: <Pine.LNX.4.10.10006291534290.29068-100000@coffee.psychology.mcmaster.ca>

> apart from poor video performance which you probably don't care about on a
> cluster, there's nothing really wrong with the i810. the issues

I believe that benchmarks show the i810 to deliver fairly poor
dram bandwidth (ie, inferior to the good old bx chipset.)
I'd think that would rule them out for a lot of clusters...


From joelja at darkwing.uoregon.edu  Thu Jun 29 13:14:58 2000
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Thu, 29 Jun 2000 13:14:58 -0700 (PDT)
Subject: i810
In-Reply-To: <Pine.LNX.4.10.10006291534290.29068-100000@coffee.psychology.mcmaster.ca>
Message-ID: <Pine.LNX.4.21.0006291309410.1041-100000@twin.uoregon.edu>

On Thu, 29 Jun 2000, Mark Hahn wrote:

> > apart from poor video performance which you probably don't care about on a
> > cluster, there's nothing really wrong with the i810. the issues
> 
> I believe that benchmarks show the i810 to deliver fairly poor
> dram bandwidth (ie, inferior to the good old bx chipset.)
> I'd think that would rule them out for a lot of clusters...

eric billings posting on the 25th of april, on this list  show  it being
incrementaly slower than the bx, although I think the issue he ws trying
to hilight was the abysmal showing if the 840 based board...

>                                 COPY     SCALE    ADD    TRIAD
> LoBoS2  P3/450 with 440BX      209.6     202.7   230.4   212.6
> PIIIDME P3/Coppermine/866/i840 128.4     129.5    81.5   101.2
>         AMD/850                356.3     319.2   351.6   336.2
>         P3/Coppermine/866/I810 163.9     171.5   219.5   214.3  

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli				       joelja at darkwing.uoregon.edu    
Academic User Services			     consult at gladstone.uoregon.edu
     PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E
--------------------------------------------------------------------------
It is clear that the arm of criticism cannot replace the criticism of
arms.  Karl Marx -- Introduction to the critique of Hegel's Philosophy of
the right, 1843.


From tim at santafe.edu  Thu Jun 29 13:19:33 2000
From: tim at santafe.edu (Tim Carlson)
Date: Thu, 29 Jun 2000 14:19:33 -0600 (MDT)
Subject: Any good gig ether copper NICs?
Message-ID: <Pine.GSO.4.10.10006291357460.10631-100000@pele.santafe.edu>

Hey folks,

We are going to be installing a Cisco 6509 switch full of gig-copper
blades into our cluster of Supermicro PIIIDME boards (I still like these
boards despite all the recent comments).  Anyway, Cisco is giving us this
switch so I need to find some copper NICs. The folks at Cisco have been
telling us they have used the Intel NICs. That's great, but you can't buy
those cards yet :). Has anybody on the list actually used a gig-copper NIC
that you can buy and been reasonably happy? 

I sent a query off to Alteon and never got a response. The only other
copper card that I am aware of is the Syskonnect 9821.

TIA

Tim

Tim Carlson                                  Voice:    (505) 984-8800x255
Director of Computing: Santa Fe Institute    Fax:      (505) 982-0565
WWW: http://www.santafe.edu/~tim             Email:   tim at santafe.edu


From billm at troikanetworks.com  Thu Jun 29 13:42:52 2000
From: billm at troikanetworks.com (Bill Moshier)
Date: Thu, 29 Jun 2000 13:42:52 -0700
Subject: Any good gig ether copper NICs?
Message-ID: <C7CA595F9B9FD311A40D009027DC4A856C9226@host03.troikanetworks.com>

The Intel ones are supposed to be the fastest.  

It would be interesting to hear what kind of performance
users of GigE are getting.  Both in terms of latency,
and streaming bandwidth.

Bill

-----Original Message-----
From: Tim Carlson [mailto:tim at santafe.edu]
Sent: Thursday, June 29, 2000 1:20 PM
To: Beowulf at beowulf.org
Subject: Any good gig ether copper NICs?


Hey folks,

We are going to be installing a Cisco 6509 switch full of gig-copper
blades into our cluster of Supermicro PIIIDME boards (I still like these
boards despite all the recent comments).  Anyway, Cisco is giving us this
switch so I need to find some copper NICs. The folks at Cisco have been
telling us they have used the Intel NICs. That's great, but you can't buy
those cards yet :). Has anybody on the list actually used a gig-copper NIC
that you can buy and been reasonably happy? 

I sent a query off to Alteon and never got a response. The only other
copper card that I am aware of is the Syskonnect 9821.

TIA

Tim

Tim Carlson                                  Voice:    (505) 984-8800x255
Director of Computing: Santa Fe Institute    Fax:      (505) 982-0565
WWW: http://www.santafe.edu/~tim             Email:   tim at santafe.edu


_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


From haohe at me1.eng.wayne.edu  Thu Jun 29 15:29:48 2000
From: haohe at me1.eng.wayne.edu (HE Hao)
Date: Thu, 29 Jun 2000 17:29:48 -0500
Subject: Please help me to unsubscribe
Message-ID: <200006292133.RAA16892@me1>

Hi!

Sorry to bother you by sending such a mail.
But I really need your help now.
I have tried to unsubsrcibe beowulf mailing list
for more than one month but always failed.
The mails sent to ask for unsubscribing are all kicked back.
Please help me to unsubscribe.
Any help or advice will be appreciated.
Thank you very much!

Best regards,
Hao He


From covenant at dirac.org  Thu Jun 29 15:37:30 2000
From: covenant at dirac.org (Peter Jay Salzman)
Date: Thu, 29 Jun 2000 15:37:30 -0700 (PDT)
Subject: PBS via rpm and bash variables
Message-ID: <Pine.LNX.4.10.10006291527210.12194-100000@satan.dirac.org>

dear all,

i installed pbs on our front end and am reading the docs right now.

i need to set a bash variable PBS_HOME.  however, since i installed an rpm
instead of compiling pbs from source, there might be a directory already
picked out to be PBS_HOME.

if so, i can't find it in the docs.   looking at the output of
	rpm -ql pbs-2.2-5RH6
it looks as if that directory might be /usr/spool/pbs.

is that right?  or does it not matter what i call PBS_HOME?  it's hard to
tell if the rpm binaries have been compiled with a hardcoded PBS_HOME or
not.

thanks!
pete


From seth at hogg.org  Thu Jun 29 15:59:00 2000
From: seth at hogg.org (Simon Hogg)
Date: Thu, 29 Jun 2000 23:59:00 +0100
Subject: Well, it's late ...
Message-ID: <4.3.1.2.20000629235343.05958a60@icex5.cc.ic.ac.uk>

http://news.cnet.com/news/0-1003-200-2167700.html?tag=st.ne.1002.thed.ni
tells of the ASCI white IBM machine, at LLNL running at "23 percent faster 
than anticipated
"
OK, so it's an SP machine, but I like the line; "DOE cut costs for ASCI 
White by leasing it from IBM for two years instead of purchasing it 
outright, LLNL officials said earlier."

More intriguing is the phrase; "and a new interconnection scheme code-named 
"Colony"
does anyone have any information on this?

--
Simon


From bob at drzyzgula.org  Thu Jun 29 17:10:40 2000
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Thu, 29 Jun 2000 20:10:40 -0400
Subject: i815, [was Re: i810]
In-Reply-To: <Pine.LNX.4.21.0006290926080.1041-100000@twin.uoregon.edu>
Message-ID: <NDBBKDAAGLOGAIACHKMMEEMBCDAA.bob@drzyzgula.org>

Interesting article re: The new Intel 815 chipset. It seems as if we may
finally have a real replacement for the 440BX; the motherboard manufacturers
seem to like everything but the price...

	http://www.ebnonline.com/digest/story/OEG20000623S0046

> apart from poor video performance which you probably don't care about on a
> cluster, there's nothing really wrong with the i810. the issues
> surrounding the i820/i840 involve the memory transfer hub for using sdram
> on the i820 being faulty, and the mrh for sdram on the 840 being lower
> performance than people expected. couple that with the early delay in
> delivery of the i820 because intel decided it didn't work that well with
> three rdram rim sockets, and thusly reduced the spec to two, causing a
> three month delay in the delivry of motherboards and you have the making
> of a debacle...
>
> regards
> joelja


From glindahl at hpti.com  Thu Jun 29 17:26:18 2000
From: glindahl at hpti.com (Greg Lindahl)
Date: Thu, 29 Jun 2000 20:26:18 -0400
Subject: Well, it's late ...
In-Reply-To: <4.3.1.2.20000629235343.05958a60@icex5.cc.ic.ac.uk>
Message-ID: <002d01bfe229$cf21d260$e4844b89@hptilap.hpti.com>

> OK, so it's an SP machine, but I like the line; "DOE cut costs for ASCI
> White by leasing it from IBM for two years instead of purchasing it
> outright, LLNL officials said earlier."

Yes. Many big computer bids are actually leases. The FSL machine (15 million
$ over 5 years) is a lease. Leases save money and waste money at the same
time; the devil is in the details.

> More intriguing is the phrase; "and a new interconnection scheme
> code-named
> "Colony"
> does anyone have any information on this?

The colony switch is just a means to scale up the normal SP switch to a
bigger size.

-- g


From becker at scyld.com  Thu Jun 29 21:26:02 2000
From: becker at scyld.com (Donald Becker)
Date: Fri, 30 Jun 2000 00:26:02 -0400 (EDT)
Subject: Unsubscribe administrivia
In-Reply-To: <200006292133.RAA16892@me1>
Message-ID: <Pine.LNX.4.10.10006300011440.6225-100000@vaio.greennet>

On Thu, 29 Jun 2000, HE Hao wrote:

> I have tried to unsubsrcibe beowulf mailing list

"unsubscribe"  -- misspellings are not accepted.  I deleted this account by
hand.

> for more than one month but always failed.
> The mails sent to ask for unsubscribing are all kicked back.
...
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf

Please read the trailer attached to each list message.

We added this text and switched to 'mailman' to make admin actions as simple
and obvious as possible.  The new web interface allows on-line modification
of your list preferences, such as temporarily suspending your subscription
or receiving messages in digest format.

A new aspect is that 'mailman' automatically deactivates subscriptions with
properly formatted mail failures.  A problem for me is that most "invalid
address" bounces are ugly and are not automatically handled, while the
bounces for "temporarily over quota" or transient name server failures (like
the wide-spread problem a week ago) tend to cause bogus subscription
suspensions.  So if you suddenly stop getting list mail after your mail
partition fills up with spam, check the web page to verify that your address
is still active.


Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Beowulf Clusters / Linux Installations
Annapolis MD 21403


From lardjane at cnrs-orleans.fr  Fri Jun 30 00:53:58 2000
From: lardjane at cnrs-orleans.fr (Nicolas Lardjane)
Date: Fri, 30 Jun 2000 09:53:58 +0200
Subject: Beowulf & Fluid Mechanics
Message-ID: <200006300753.JAA06904@admin.cnrs-orleans.fr>


Hello.

I'd like to know if someone has any experience of using PC cluster for solving
fluid mechanics problems by domain decomposition methods. The question is what
performance can be expected compared to super-computers ?

Thanks. 

________________________________________

Nicolas LARDJANE
CNRS - LCSR 
1C, Avenue de la Recherche Scientifique 
45071 ORLEANS Cedex 2 
FRANCE

e-mail : lardjane at cnrs-orleans.fr 
tel      : 33-02.38.25.76.13
fax     : 33-02.38.25.78.75
________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20000630/041bd5bc/attachment.html>

From jok707s at mail.smsu.edu  Fri Jun 30 03:33:10 2000
From: jok707s at mail.smsu.edu (jok707s at mail.smsu.edu)
Date: Fri, 30 Jun 2000 05:33:10 -0500
Subject: Apps & Design
Message-ID: <395D783D@caliber>

Several questions:

1.  What kind of work has been done in applying Beowulf to machine 
translation?  Would parallelism help when trying to translate several texts 
into several different languages in the shortest possible time?  Can existing 
web-based translation systems (Systran, InterTran, &c) be parallelised?

2.  Have the graphics folks in Hollywood experimented with Beowulfery as a 
possible addition to &/or replacement of their traditional machines for making 
digital images in movies?  Would Pixar and that gang be able to accomplish 
more in some situations with Beowulfs than they can with the SGIs, Suns, &c?

3a. Has anyone successfully assembled a completely wireless Beowulf with all 
of the nodes and the server connected to each other only by radio (or maybe 
infrared)?  Could this be done with cellular technology?

3b. If a wireless Beowulf can be made to work at all, would it be possible for 
its performance to get into the general area of the wired versions using 
technology available in the foreseeable future?

3c. If a wireless Beowulf were constructed entirely from laptops (including 
the server), could one have a "virtual" Beowulf with the individual nodes 
moving around while the system as a whole continued to function?  How much 
would performance degrade as the nodes got farther apart?  What would be the 
distance limits beyond which the system would fail completely?  How much 
difference would be made by the application(s) that were being run?

This should keep people emailing for a while?  :-)

Joel


From Thood at ifn.com  Fri Jun 30 05:28:39 2000
From: Thood at ifn.com (Hood, Tom)
Date: Fri, 30 Jun 2000 08:28:39 -0400
Subject: Apps & Design
References: <395D783D@caliber>
Message-ID: <395C9277.6D9743A7@ifn.com>

Being an ex-communications guy, I can address question 3 anyway...

"jok707s at mail.smsu.edu" wrote:
> 
> Several questions:
<snip>
> 3a. Has anyone successfully assembled a completely wireless Beowulf with all
> of the nodes and the server connected to each other only by radio (or maybe
> infrared)?  Could this be done with cellular technology?

Wireless TCP networks are in use today.  That said, the speeds across
the links are pathetic to say the least.  The best throughput I was able
to achieve was using multiplexed microwave, and we never got more than
10Mbps.  Our experiment was done several years ago, and I imagine
someone has built a better uwave rig by now.  RF (radio) is not very
efficient and requires loads of power.  Cellular (in it's commercial
configuration) has so much overhead that the percentage of data to
carrier is way too low to sustain high data rates.  As I was leaving the
job where I did this work, we were starting to play with lasers and
getting great data rates, but all the nodes had to be fixed in place,
the range was quite short (less than a mile), and there had to be clear
line of sight between transmitter and receiver.  Extremely secure
however...


> 
> 3b. If a wireless Beowulf can be made to work at all, would it be possible for
> its performance to get into the general area of the wired versions using
> technology available in the foreseeable future?

Yes, with lasers or high power uwave.

> 
> 3c. If a wireless Beowulf were constructed entirely from laptops (including
> the server), could one have a "virtual" Beowulf with the individual nodes
> moving around while the system as a whole continued to function?  How much
> would performance degrade as the nodes got farther apart?  What would be the
> distance limits beyond which the system would fail completely?  How much
> difference would be made by the application(s) that were being run?

Yes, using RF (any band) or uwave, you can have the nodes be mobile. 
Performance does not degrade until SNR gets to the point where error
correction begins to require packet re-transmission.  Depending on the
frequency used and the power of the transmitters, this can range from
yards to miles.  For most RF bands (up to Extra SHF), the range is
measured in 10's of miles at best depending on antenna characteristics. 
In the uwave range, best results are achieved by bouncing the signal off
satellites to ground stations, not directly peer to peer.  The downside
to this is the power consumption.  There's also a human physical problem
with being next to a high power uwave transmitter for any length of
time.


Thomas Hood
thood at ifn.com

> 
> This should keep people emailing for a while?  :-)
> 
> Joel
> 
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf


From joysarkar at jncasr.ac.in  Fri Jun 30 16:42:47 2000
From: joysarkar at jncasr.ac.in (Mr.Joy Sarkar)
Date: Fri, 30 Jun 2000 18:42:47 -0500 (GMT+5)
Subject: Apps & Design
In-Reply-To: <395D783D@caliber>
Message-ID: <Pine.LNX.4.04.10006301840550.14489-100000@jncasr.ac.in>

> 
> 2.  Have the graphics folks in Hollywood experimented with Beowulfery as a 
> possible addition to &/or replacement of their traditional machines for making 
> digital images in movies?  Would Pixar and that gang be able to accomplish 
> more in some situations with Beowulfs than they can with the SGIs, Suns, &c?
> 

i know and i think a lot of people know that "The Titanic"'s digital
imaging was done on a Beowulf!


From gropp at mcs.anl.gov  Fri Jun 30 07:00:28 2000
From: gropp at mcs.anl.gov (William Gropp)
Date: Fri, 30 Jun 2000 09:00:28 -0500
Subject: Beowulf & Fluid Mechanics
In-Reply-To: <200006300753.JAA06904@admin.cnrs-orleans.fr>
Message-ID: <4.2.2.20000630085342.01a4d9e0@localhost>

At 09:53 AM 6/30/2000 +0200, Nicolas Lardjane wrote:
>Hello.
>
>I'd like to know if someone has any experience of using PC cluster for 
>solving fluid mechanics problems by domain decomposition methods. The 
>question is what performance can be expected compared to super-computers ?

A fully implicit, unstructured CFD code was the subject of 
http://www.mcs.anl.gov/~gropp/papers/sc99/final-bell-12-4.pdf ; look at the 
ASCI Red results in comparison with the other (non-vector) supercomputer 
results.  Similar results have been seen on clusters with Myrinet.

Bill


From alazur at plogic.com  Fri Jun 30 07:36:02 2000
From: alazur at plogic.com (Adam Lazur)
Date: Fri, 30 Jun 2000 10:36:02 -0400
Subject: Take any two: motherboard performance, compatibility, value
In-Reply-To: <Pine.LNX.4.10.10006281945080.9200-100000@lisa.plogic.com>; from deadline@plogic.com on Wed, Jun 28, 2000 at 07:53:08PM -0400
References: <20000628155039.J2705@cs.ualberta.ca> <Pine.LNX.4.10.10006281945080.9200-100000@lisa.plogic.com>
Message-ID: <20000630103602.A23499@calypso.eecs.lehigh.edu>

Doug Eadline (deadline at plogic.com) said:
> We have Linux working on Serverworks systems. We are still testing, but
> our dual PIII Cu-733 with 1GB PC-133 SDRAM(ecc) gave pretty good stream
> numbers.  

For the benefit of the list, here are the stream numbers from a dual PIII
Cu-733 w/PC-133 SDRAM(ecc):

ServerWorks ServerSet III LE chipset board

[uniproc stream_wall test]
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         376.7285       0.0935       0.0934       0.0940
Scale:        373.5860       0.0942       0.0942       0.0943
Add:          483.6362       0.1092       0.1092       0.1092
Triad:        341.0235       0.1548       0.1548       0.1549

[dualproc stream_wall test]
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         161.1795       0.2216       0.2184       0.2244
Scale:        173.3435       0.2090       0.2031       0.2118
Add:          191.2891       0.2805       0.2760       0.2839
Triad:        185.7363       0.2953       0.2843       0.3018

Tyan S1834 Tiger 133 - VIA Apollo Pro 133A Chipset board

[uniproc stream_wall test]
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         313.4796       0.1124       0.1123       0.1131
Scale:        312.1564       0.1128       0.1128       0.1128
Add:          411.2952       0.1284       0.1284       0.1287
Triad:        312.8129       0.1688       0.1688       0.1689

[dualproc stream_wall test]
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         146.1533       0.2432       0.2408       0.2469
Scale:        147.7545       0.2402       0.2382       0.2420
Add:          165.5157       0.3218       0.3190       0.3237
Triad:        158.7726       0.3360       0.3326       0.3385


The results from the ServerWorks chipset are quite interesting. I also
experienced issues with not being able to set ECC (the menu option is
*missing*) on the Tyan S1834 mobo.

.adam

-- 
[                  Adam Lazur <alazur at plogic.com>                     ]
[      Paralogic Inc. - www.plogic.com - www.xtreme-machines.com      ]


From Lechner at drs-esg.com  Fri Jun 30 08:17:43 2000
From: Lechner at drs-esg.com (Lechner, David)
Date: Fri, 30 Jun 2000 11:17:43 -0400
Subject: Take any two: motherboard performance, compatibility, value
Message-ID: <D6F1CB2A6FD3D211A0AD00A0C995F320B8297E@mercury.tas.drs.com>

I would like to especially thank Paralogic Corp. (Doug and Adam) for making
this posting - VERY interesting indeed - 

Do I have it right in that the STREAM functions used in the benchmark are
really better suited for a single processor, and thus when the OS tries to
allocate and divide the work then extra work has to be done coordinating the
work (looks like a 10% hit beyond linear division of the bandwidth by 2) ? 

Do I interpret the timing results to see that the single processor function
is much faster than the dual processor functionfor the same work?
Is this because of the nature of the data set (such as very large matrix
manipulation) that the second processor is not only unable to help at all
but actually makes things twice as worse by getting involved?

Or is it because 2 versions are running in parallel, getting into trouble,
and then taking a little longer than 2x as long to do 2x as much work?

(Adam/Doug - Were you using an auto-parallelizing compiler for the dual
processor version? Or the hack suggested to run 2 jobs at once?)

W/Regards/ & Thanks In Advance/ 
 Dave Lechner


-----Original Message-----
From: Adam Lazur [mailto:alazur at plogic.com]
Sent: Friday, June 30, 2000 10:36 AM
To: Beowulf mailing list
Cc: Josip Loncaric; Douglas Eadline
Subject: Re: Take any two: motherboard performance, compatibility, value


Doug Eadline (deadline at plogic.com) said:
> We have Linux working on Serverworks systems. We are still testing, but
> our dual PIII Cu-733 with 1GB PC-133 SDRAM(ecc) gave pretty good stream
> numbers.  

For the benefit of the list, here are the stream numbers from a dual PIII
Cu-733 w/PC-133 SDRAM(ecc):

ServerWorks ServerSet III LE chipset board

[uniproc stream_wall test]
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         376.7285       0.0935       0.0934       0.0940
Scale:        373.5860       0.0942       0.0942       0.0943
Add:          483.6362       0.1092       0.1092       0.1092
Triad:        341.0235       0.1548       0.1548       0.1549

[dualproc stream_wall test]
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         161.1795       0.2216       0.2184       0.2244
Scale:        173.3435       0.2090       0.2031       0.2118
Add:          191.2891       0.2805       0.2760       0.2839
Triad:        185.7363       0.2953       0.2843       0.3018

Tyan S1834 Tiger 133 - VIA Apollo Pro 133A Chipset board

[uniproc stream_wall test]
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         313.4796       0.1124       0.1123       0.1131
Scale:        312.1564       0.1128       0.1128       0.1128
Add:          411.2952       0.1284       0.1284       0.1287
Triad:        312.8129       0.1688       0.1688       0.1689

[dualproc stream_wall test]
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         146.1533       0.2432       0.2408       0.2469
Scale:        147.7545       0.2402       0.2382       0.2420
Add:          165.5157       0.3218       0.3190       0.3237
Triad:        158.7726       0.3360       0.3326       0.3385


The results from the ServerWorks chipset are quite interesting. I also
experienced issues with not being able to set ECC (the menu option is
*missing*) on the Tyan S1834 mobo.

.adam

-- 
[                  Adam Lazur <alazur at plogic.com>                     ]
[      Paralogic Inc. - www.plogic.com - www.xtreme-machines.com      ]

_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


From jrforsythe at msn.com  Fri Jun 30 08:16:13 2000
From: jrforsythe at msn.com (Jim Forsythe)
Date: Fri, 30 Jun 2000 09:16:13 -0600
Subject: Beowulf & Fluid Mechanics
In-Reply-To: <4.2.2.20000630085342.01a4d9e0@localhost>
Message-ID: <LOBBKPCNALHIGJHFJBHEEEJEDKAA.jrforsythe@msn.com>

     The unstructured code Cobalt60 has a linux version now, supplied by
myself at the Air Force Academy.  The largest linux cluster we have run on
is 44 processors on 22 nodes.  It scaled linearly on that cluster (i.e. 100%
parallel efficiency) on a 2 million cell grid.  They recently ran this code
on 1024 processors of an SP3 and got over 98% efficiency (on a 3.2 million
cell grid).  For this code, it seems to stay linear until you get too few
cells on a processor.  For the expensive machines, this is about 2000 cells
per processor.  For our cluster it is around 8000.  Our cluster is 500Mhz
PIII, with 100Bt.  We were shocked that we got a linear speedup on 100Bt -
we were exepecting to have to buy Myrinet or gigabit.  The domain
decomposition is done by PARmetis, which seems to do a great job in load
balancing and giving minimum number of faces on the interface between
processors.
	Per processor our cluster is roughly equivalent to a more recent SP2, or a
225MhZ Origin 2000.  The SP3 is about 50% faster, and the T3E is about 50%
slower.  So with a linear speedup, and good per processor performance, we
couldn't be happier with our cluster.

There is a Cobalt page at:
http://www.va.afrl.af.mil/vaa/vaac/COBALT/


Jim Forsythe
USAF Academy

-----Original Message-----
From: beowulf-admin at beowulf.org [mailto:beowulf-admin at beowulf.org]On
Behalf Of William Gropp
Sent: Friday, June 30, 2000 8:00 AM
To: Nicolas Lardjane
Cc: beowulf at beowulf.org
Subject: Re: Beowulf & Fluid Mechanics


At 09:53 AM 6/30/2000 +0200, Nicolas Lardjane wrote:
>Hello.
>
>I'd like to know if someone has any experience of using PC cluster for
>solving fluid mechanics problems by domain decomposition methods. The
>question is what performance can be expected compared to super-computers ?

A fully implicit, unstructured CFD code was the subject of
http://www.mcs.anl.gov/~gropp/papers/sc99/final-bell-12-4.pdf ; look at the
ASCI Red results in comparison with the other (non-vector) supercomputer
results.  Similar results have been seen on clusters with Myrinet.

Bill


_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


From josip at icase.edu  Fri Jun 30 08:25:57 2000
From: josip at icase.edu (Josip Loncaric)
Date: Fri, 30 Jun 2000 11:25:57 -0400
Subject: Beowulf & Fluid Mechanics
References: <200006300753.JAA06904@admin.cnrs-orleans.fr>
Message-ID: <395CBC05.6DCE04C2@icase.edu>

Nicolas Lardjane wrote:
> 
> Hello.
> 
> I'd like to know if someone has any experience of using PC cluster for
> solving fluid mechanics problems by domain decomposition methods. The
> question is what performance can be expected compared to
> super-computers ?

On coarse grained problems, our 32 single CPU Pentium II 400MHz boxes
perform about as well as a 16 CPU (R10000, 250MHz, IP28) SGI Origin
2000.  However, our cost was 5-10 times lower.  See

  http://www.icase.edu/CoralProject.html

and particularly Brian Allan's results
  
  http://www.icase.edu/~allan/coral/Nov_99/index.html
  
Fine grained problems do not work as well (our switched Fast Ethernet
network is a bottleneck for more than 10 nodes).  Recently, Giganet
loaned us some hardware that we could test, and it did improve scaling
in such cases (speedup was actually better than on SGI Origin).  Brian's
Giganet results are available at

  http://www.icase.edu/~allan/coral/June_00/index.html

To me, the most interesting conclusions based on Brian's tests concern
MPI implementationa.  MPI/Pro really shows its advantages on dual CPU
machines with a very fast network, despite the fact that MVICH has much
lower latency.  We used to blame memory bottlenecks for the 25%
performance penalty typically observed on SMP machines with Fast
Ethernet; but now it appears that this penalty is primarily due to
polling in LAM, MPICH and MVICH.  With Giganet and MVICH, the SMP
performance penalty grows to about 40%, almost negating the benefit of
the second CPU.  With MPI/Pro, the SMP performance penalty is no longer
there.  On the other hand, LAM/MPICH/MVICH implementations work somewhat
better on uniprocessor nodes.  Moreover, Giganet latency with MVICH is
only 14 microseconds, much better than MPI/Pro's 86 microseconds.  While
we were limited to 16 CPUs in these tests, it appears that MPI/Pro's
higher latency may negate its SMP performance advantage when more than
about 20 CPUs are used (as the number of CPUs grows, more smaller
messages are exchanged, and the test becomes more latency sensitive).
  

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From alazur at plogic.com  Fri Jun 30 08:30:46 2000
From: alazur at plogic.com (Adam Lazur)
Date: Fri, 30 Jun 2000 11:30:46 -0400
Subject: Take any two: motherboard performance, compatibility, value
In-Reply-To: <D6F1CB2A6FD3D211A0AD00A0C995F320B8297E@mercury.tas.drs.com>; from Lechner@drs-esg.com on Fri, Jun 30, 2000 at 11:17:43AM -0400
References: <D6F1CB2A6FD3D211A0AD00A0C995F320B8297E@mercury.tas.drs.com>
Message-ID: <20000630113046.A23920@calypso.eecs.lehigh.edu>

Lechner, David (Lechner at drs-esg.com) said:
> Do I have it right in that the STREAM functions used in the benchmark are
> really better suited for a single processor, and thus when the OS tries to
> allocate and divide the work then extra work has to be done coordinating the
> work (looks like a 10% hit beyond linear division of the bandwidth by 2) ? 
> 
> Do I interpret the timing results to see that the single processor function
> is much faster than the dual processor functionfor the same work?
> Is this because of the nature of the data set (such as very large matrix
> manipulation) that the second processor is not only unable to help at all
> but actually makes things twice as worse by getting involved?
> 
> Or is it because 2 versions are running in parallel, getting into trouble,
> and then taking a little longer than 2x as long to do 2x as much work?

I believe it's due to the latter. When there are two copies of stream
running (see shell hack details below) they are competing for memory
bandwidth. The dualproc numbers are approximately the same for both copies
of stream.

> (Adam/Doug - Were you using an auto-parallelizing compiler for the dual
> processor version? Or the hack suggested to run 2 jobs at once?)

We used a shell hack (similar to the one suggested on the stream web page)
to run two jobs at once. I'd be very interested in running a threaded
version if anyone has a copy.

Oh, I've been told I should have included that we compiled stream_wall
from source with egcs-2.91.66 using the -O2 option.

.adam

-- 
[                  Adam Lazur <alazur at plogic.com>                     ]
[      Paralogic Inc. - www.plogic.com - www.xtreme-machines.com      ]


From Lechner at drs-esg.com  Fri Jun 30 09:18:21 2000
From: Lechner at drs-esg.com (Lechner, David)
Date: Fri, 30 Jun 2000 12:18:21 -0400
Subject: Take any two: motherboard performance, compatibility, value
Message-ID: <D6F1CB2A6FD3D211A0AD00A0C995F320B829C1@mercury.tas.drs.com>

So then the time is a bit more than 2 times as long for dual mode since each
single copy can only access memory at half the bandwidth - 
And while there are two processors, in the dual mode you ran there is also
twice the work (but only half the memory bandwidth and contention)- 
As I said before, very interesting...

Has anyone done this test with auto-parallelized version of STREAM doing one
work set on 1,2,3, and 4 processors on an SMP box with a single shared
memory bank, and the doing the same with multiple (N= # processors) copies
of the benchmark running concurrently (with contention)?

W/Regards/
 Dave Lechner

-----Original Message-----
From: Adam Lazur [mailto:alazur at plogic.com]
Sent: Friday, June 30, 2000 11:31 AM
To: Lechner, David
Cc: Beowulf mailing list
Subject: Re: Take any two: motherboard performance, compatibility, value


Lechner, David (Lechner at drs-esg.com) said:
> Do I have it right in that the STREAM functions used in the benchmark are
> really better suited for a single processor, and thus when the OS tries to
> allocate and divide the work then extra work has to be done coordinating
the
> work (looks like a 10% hit beyond linear division of the bandwidth by 2) ?

> 
> Do I interpret the timing results to see that the single processor
function
> is much faster than the dual processor functionfor the same work?
> Is this because of the nature of the data set (such as very large matrix
> manipulation) that the second processor is not only unable to help at all
> but actually makes things twice as worse by getting involved?
> 
> Or is it because 2 versions are running in parallel, getting into trouble,
> and then taking a little longer than 2x as long to do 2x as much work?

I believe it's due to the latter. When there are two copies of stream
running (see shell hack details below) they are competing for memory
bandwidth. The dualproc numbers are approximately the same for both copies
of stream.

> (Adam/Doug - Were you using an auto-parallelizing compiler for the dual
> processor version? Or the hack suggested to run 2 jobs at once?)

We used a shell hack (similar to the one suggested on the stream web page)
to run two jobs at once. I'd be very interested in running a threaded
version if anyone has a copy.

Oh, I've been told I should have included that we compiled stream_wall
from source with egcs-2.91.66 using the -O2 option.

.adam

-- 
[                  Adam Lazur <alazur at plogic.com>                     ]
[      Paralogic Inc. - www.plogic.com - www.xtreme-machines.com      ]

_______________________________________________
Beowulf mailing list
Beowulf at beowulf.org
http://www.beowulf.org/mailman/listinfo/beowulf


From josip at icase.edu  Fri Jun 30 09:20:58 2000
From: josip at icase.edu (Josip Loncaric)
Date: Fri, 30 Jun 2000 12:20:58 -0400
Subject: Take any two: motherboard performance, compatibility, value
References: <395A4F06.6C3F1A70@icase.edu> <20000628222858.N1603@ostenfeld.dtu.dk> <20000628183954.B3703@mercury.drzyzgula.org> <395B4371.EF3A3D14@icase.edu>
Message-ID: <395CC8EA.854872ED@icase.edu>

I need to correct myself.  When I wrote that the ecc

> module (which currently works *only* on uniprocessor machines) is
> available from
> 
> http://www.anime.net/~goemon/linux-ecc/

the part about difficulties with SMP was not quite stated right.  'ecc'
may be SMP capable, but even when SMP switches are enabled in Makefile,
'make' reports lots of compilation warnings and 'insmod ecc' complains
as follows:

ecc: unresolved symbol proc_register
ecc: unresolved symbol pci_find_class
...
ecc: unresolved symbol jiffies
ecc: unresolved symbol printk
ecc: unresolved symbol add_timer

Obviously, something has changed in the past year, so that old SMP
compilation commands no longer work.  I do not know the reason.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From alazur at plogic.com  Fri Jun 30 09:46:52 2000
From: alazur at plogic.com (Adam Lazur)
Date: Fri, 30 Jun 2000 12:46:52 -0400
Subject: Take any two: motherboard performance, compatibility, value
In-Reply-To: <395CC8EA.854872ED@icase.edu>; from josip@icase.edu on Fri, Jun 30, 2000 at 12:20:58PM -0400
References: <395A4F06.6C3F1A70@icase.edu> <20000628222858.N1603@ostenfeld.dtu.dk> <20000628183954.B3703@mercury.drzyzgula.org> <395B4371.EF3A3D14@icase.edu> <395CC8EA.854872ED@icase.edu>
Message-ID: <20000630124652.A24201@calypso.eecs.lehigh.edu>

Josip Loncaric (josip at icase.edu) said:
> the part about difficulties with SMP was not quite stated right.  'ecc'
> may be SMP capable, but even when SMP switches are enabled in Makefile,
> 'make' reports lots of compilation warnings and 'insmod ecc' complains
> as follows:
> 
> ecc: unresolved symbol proc_register
> ecc: unresolved symbol pci_find_class
> ...
> ecc: unresolved symbol jiffies
> ecc: unresolved symbol printk
> ecc: unresolved symbol add_timer
> 
> Obviously, something has changed in the past year, so that old SMP
> compilation commands no longer work.  I do not know the reason.

Hmmm. I'm compiling it okay using kernel 2.2.16 and ecc-0.9 (though I
picked up an ecc-0.11.tgz somewhere as well). I editted the Makefile to
make my CFLAGS a bit more kernel_module_compile looking:

CFLAGS= -O2 -fomit-frame-pointer -fno-strict-aliasing -D__SMP__ -pipe
-fno-strength-reduce -m486 -malign-loops=2 -malign-jumps=2
-malign-functions=2 -DCPU=686

The module seems to insert fine into the SMP kernel, though I don't have
any bit flipping occurring (though it could be doing so and somehow not
telling me).

I also cc'd the ecc mailing list in hopes of getting a definitive answer
on the SMP ability of the ecc module.

.adam

-- 
[                  Adam Lazur <alazur at plogic.com>                     ]
[      Paralogic Inc. - www.plogic.com - www.xtreme-machines.com      ]


From kragen at pobox.com  Fri Jun 30 11:29:59 2000
From: kragen at pobox.com (Kragen Sitaker)
Date: Fri, 30 Jun 2000 14:29:59 -0400 (EDT)
Subject: Apps & Design
Message-ID: <Pine.GSO.4.21.0006301406500.28182-100000@kirk.dnaco.net>

"Joel" asks:
> 1.  What kind of work has been done in applying Beowulf to machine 
> translation?  Would parallelism help when trying to translate several texts 
> into several different languages in the shortest possible time?  Can 
> existing web-based translation systems (Systran, InterTran, &c) be 
> parallelised?

Unfortunately, I don't know anything about machine translation, so I
can't answer the first question; but as for the second question,
translating multiple documents is an obviously parallelizable task, as
the results of the translations are independent.  From looking at
SYSTRAN output, it sort of looks like results of translations of
individual phrases in a document are pretty independent when they're
further than a sentence or two apart.

> 2.  Have the graphics folks in Hollywood experimented with Beowulfery as a 
> possible addition to &/or replacement of their traditional machines for 
> making digital images in movies?  Would Pixar and that gang be able to 
> accomplish more in some situations with Beowulfs than they can with the 
> SGIs, Suns, &c?

Some of the digital imagery in Titanic was rendered on a Linux cluster
at Digital Domain.  (I think it was a Beowulf, but I'm not sure.)
Pixar, I've heard, has practiced company-wide cycle harvesting for
quite some time.

Rendering and ray-tracing in general are highly parallelizable.  In
ray-tracing, each pixel is independent of all other pixels; in rendered
or ray-traced animation, each frame is independent of all other
frames.

> 3a. Has anyone successfully assembled a completely wireless Beowulf with 
> all of the nodes and the server connected to each other only by radio 
> (or maybe infrared)?  Could this be done with cellular technology?

I'm sure somebody has built radio-based Beowulfs, although I haven't
heard about it.  (Maybe they'll post. :)  At the moment, radio
networking for PCs is pretty low-bandwidth, so only EP problems will
perform well; also, non-cellular radio is broadcast, not full-duplex
point-to-point, dividing your bandwidth by the size of your cluster
times two.  (Using FDMA or CDMA, you could decrease this problem to
some extent.)

Many people have built completely wireless clusters communicating via
infrared; for quite a while, it was the only way to get Gigabit
Ethernet working.  However, the fiber-optic cables they shone the
infrared light through are more expensive and touchier than wires.

I haven't heard about anybody building free-space infrared-link
Beowulfs either.  I understand that cheap off-the-shelf free-space IR
transceivers are even slower than off-the-shelf radio network cards.

> 3b. If a wireless Beowulf can be made to work at all, would it be possible 
> for its performance to get into the general area of the wired versions 
> using technology available in the foreseeable future?

Sure --- you could do it today.  You can buy off-the-shelf 155Mbps
full-duplex free-space laser transceivers from LightPointe today.  Just
mount your, say, 16 nodes on the inside of a huge sphere, equip each of
them with fifteen of these babies (one pointed at each other node), and
for something like a quarter of a million dollars plus the cost of the
sphere, you have a really kick-ass Beowulf interconnect.  (I'm assuming
each transceiver is $1000; I haven't seen actual prices for them.)

But it would be much cheaper today to build a "wireless" Beowulf with
fiber optic cables, even if my price estimate is high by an order of
magnitude.  :)

In the long run --- maybe 10 or 15 years --- cellular radio networks
will be preferable to cables of any kind for all but the
most-concentrated-bandwidth communications.  (Anything over ten
gigabits to a single point.)  Free-space optical communication might
get cheaper, but I suspect you'll still be able to get more bandwidth
by using cables, and fairly cheaply.

> 3c. If a wireless Beowulf were constructed entirely from laptops (including 
> the server), could one have a "virtual" Beowulf with the individual nodes 
> moving around while the system as a whole continued to function?  How much 
> would performance degrade as the nodes got farther apart?  What would be 
> the distance limits beyond which the system would fail completely?  How 
> much difference would be made by the application(s) that were being run?

I'm sure the answer to the first question is "yes, but it'll be harder
to run than a standard wulf."

I don't know much about the rest of the questions; the Beowulf nature
of your network is probably not very relevant to them.  You should ask
in a wireless-network place --- and tell me the answers, because I'm
curious.  :)

It's arguable that a radio network can't be "private", and thus meet
the definition of "Beowulf", but I don't think that's the case.  All
you have to do is have it sufficiently far, or sufficiently
well-shielded, from other transceivers.

-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either.  :)