From rgb at phy.duke.edu  Mon Apr  1 05:27:43 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 1 Apr 2002 08:27:43 -0500 (EST)
Subject: DHCP Help 
In-Reply-To: <LAW2-F146wdUTaGmvSV000068b4@hotmail.com>
Message-ID: <Pine.LNX.4.44.0204010812280.28647-100000@lucifer.rgb.private.net>

On Sat, 30 Mar 2002, Adrian Garcia Garcia wrote:
> Hello everybody, I'm a beginner and I have been having problems with my
> dhcp server, I cant assign the ip's to the clients, I dont know exactly
> if the server is not working or the client. I am working with Red Hat 7.1
> and my dhcp client is dhcpcd because I tried with pump but It was not
> work. Please, Please, can anybody give some halp, what can I do???? Sorry
> for my poor english, In fact I speak spanish. Pleas help. Thanks a lot.
> 
> ________________________________________________________________________________
> Join the world?s largest e-mail service with MSN Hotmail. Click Here
> _______________________________________________ Beowulf mailing list,
> Beowulf at beowulf.org To change your subscription (digest mode or
> unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Sr Garcia,

Por favor, encuentre un ejemplo de mi configuracion que yo uso a mi casa
por mi beowulf privada.  Esto es por dhcpd, en /etc/dhcpd.conf, y es por
una red interna privada con IP numeros 192.168.  Nota bien los tres
secciones.  Esto funciona bien por computadores que boot en Windows o
Linux o otro con clientes dhcp -- algunos de mi computadores a casa boot
ambos.  Nota tambien los direcciones:

        range 192.168.1.192 192.168.1.224;

solamente estos estan usado para computadores no conocido por el
servidor con numeros ethernet registrado y direcciones staticos.

Espero que esto se ayuda on poquito.  Y desculpame de mi Espanol malo;
es (estoy seguro) peor que su Ingles, pero yo necesito la practica.

   rgb

##############################################################################
#
# /etc/dhcpd.conf - configuration file for our DHCP/BOOTP server
#

###########################################################
# Global Parameters
###########################################################

option domain-name              "rgb.private.net";
option domain-name-servers      152.3.250.1;
option subnet-mask              255.255.255.0;
option broadcast-address        192.168.1.255;
use-host-decl-names on;

###########################################################
# Subnets
###########################################################
shared-network RGB {
  subnet 192.168.1.0 netmask 255.255.255.0 {
        range 192.168.1.192 192.168.1.224;
        default-lease-time              43200;
        max-lease-time                  86400;
        option routers                  192.168.1.1;
        option domain-name              "rgb.private.net";
        option domain-name-servers      152.3.250.1;
        option broadcast-address        192.168.1.255;
        option subnet-mask              255.255.255.0;
  }

}

###########################################################
# Static IP addresses managed by DHCP server
###########################################################

# Personal Computers (MSDOS/Win-3.x/WfW/Win-95/Win-NT/MacOS)
#host hostname {
#       hardware ethernet xx:xx:xx:xx:xx:xx;
#       fixed-address 152.3.xxx.xxx;
#       option host-name hostname;
#       option routers 152.3.xxx.250;
#}

# UNIX systems
#host hostname {
#       hardware ethernet xx:xx:xx:xx:xx:xx;
#       fixed-address 152.3.xxx.xxx;
#       option host-name hostname;
#       option routers 152.3.xxx.250;
#}


# adam future gateway redux? 300MHz Celeron
host adam {
        hardware ethernet       00:20:18:58:27:1a;
        fixed-address           192.168.1.1;
        next-server             192.168.1.131;
        option domain-name      "rgb.private.net";
        option host-name        "adam";
}

# caine (Linux/Windows workstation)
#  (Linux/Windows workstation)
host tyrial {
        hardware ethernet       00:a0:cc:59:45:9b;
        fixed-address           192.168.1.134;
        next-server             192.168.1.131;
        option routers          192.168.1.1;
        option domain-name      "rgb.private.net";
        option host-name        "tyrial";
}

etc...

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From eugen at leitl.org  Mon Apr  1 13:19:31 2002
From: eugen at leitl.org (Eugen Leitl)
Date: Mon, 1 Apr 2002 23:19:31 +0200 (CEST)
Subject: FY;) Google's secret clustering technology
Message-ID: <Pine.LNX.4.33.0204012318320.20926-100000@hydrogen.leitl.org>


	http://www.google.com/technology/pigeonrank.html


From emiller at techskills.com  Mon Apr  1 17:41:18 2002
From: emiller at techskills.com (Eric Miller)
Date: Mon, 1 Apr 2002 20:41:18 -0500
Subject: Syntax for executing 
Message-ID: <NMELJLHHFNGMNFFNGJAEIEJHCDAA.emiller@techskills.com>

Hey all, got a five-node cluster up running 27-z9, preparing for a 30 node
cluster.

- What is the syntax to run an executable in the cluster environment?  For
example, I run

NP=5 mpi-mandel

to run the test fractal program.  How would I execute say, SETI, using the
cluster?  Assume that the SETI executable is in the PATH.  Also, the older
version of Scyld had some test code in /usr/mpi-beowulf/*.  Is that gone?

-  What would cause all but one of the processors to show usage in
beostatus?  The node shows "up" in every other way: hardware identical,
memory, swap, network, etc....just when I run something, only that one
processor on one node shows no % usage.

-ETM

  .~.
  /V\
 // \\
/(   )\
 ^'~'^


From hanzl at noel.feld.cvut.cz  Tue Apr  2 00:28:38 2002
From: hanzl at noel.feld.cvut.cz (hanzl at noel.feld.cvut.cz)
Date: Tue, 02 Apr 2002 10:28:38 +0200
Subject: Newest RPM's?
In-Reply-To: <002c01c1d933$ec02a3c0$c31fa6ac@xp>
References: <NMELJLHHFNGMNFFNGJAEMEIBCDAA.emiller@techskills.com>
	<1017612125.19271.20.camel@vhwalke.mathsci.usna.edu>
	<002c01c1d933$ec02a3c0$c31fa6ac@xp>
Message-ID: <20020402102838A.hanzl@unknown-domain>

> I am using RH7.2 on my master node and would like to RPM the latest stable
> version of Scyld, instead of using the CD (I have 27Bz-7, based on RH6.2)

I am not sure there is RH7.2 based Scyld system already available,
thought it is quite possible I missed something.

You may consider Clustermatic - it is similar to Scyld but smaller
(and therefore easier), rpm install on top of RH7.2 works great and
you may download iso images if you want.

  http://www.clustermatic.org


See my previous post "Clustermatic: smooth upgrade to new version" for
rpm-install microhowto:

  http://www.beowulf.org/pipermail/beowulf/2002-March/002969.html

HTH

Vaclav Hanzl


From daniel.kidger at quadrics.com  Tue Apr  2 01:10:20 2002
From: daniel.kidger at quadrics.com (Dan Kidger)
Date: Tue, 2 Apr 2002 10:10:20 +0100
Subject: FY;) Google's secret clustering technology
References: <Pine.LNX.4.33.0204012318320.20926-100000@hydrogen.leitl.org>
Message-ID: <002601c1da27$10987090$0100a8c0@spot>

----- Original Message -----
Eugen Leit" <eugen at leitl.org> wrote:
>To: <Beowulf at beowulf.org>
>Sent: Monday, April 01, 2002 10:19 PM
>Subject: FY;) Google's secret clustering technology
>

> http://www.google.com/technology/pigeonrank.html

>

This is a very interesting article. However there is no mention of them
using the Quadrics Interconnect, nor that matter Myrinet, Scali or even
plain ethernet.

I can only assume the whole cluster is run by just using cereal lines.

Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


From hanzl at noel.feld.cvut.cz  Tue Apr  2 01:51:11 2002
From: hanzl at noel.feld.cvut.cz (hanzl at noel.feld.cvut.cz)
Date: Tue, 02 Apr 2002 11:51:11 +0200
Subject: 27z-9 ? (Was: Syntax for executing )
In-Reply-To: <NMELJLHHFNGMNFFNGJAEIEJHCDAA.emiller@techskills.com>
References: <NMELJLHHFNGMNFFNGJAEIEJHCDAA.emiller@techskills.com>
Message-ID: <20020402115111Y.hanzl@unknown-domain>

> Hey all, got a five-node cluster up running 27-z9

I can see just a few files at

  ftp://ftp.scyld.com/pub/beowulf/27z-9/

Please can anybody comment on status of 27z-9 ?

Thanks

Vaclav


From rbw at ahpcrc.org  Tue Apr  2 07:48:50 2002
From: rbw at ahpcrc.org (Richard Walsh)
Date: Tue, 2 Apr 2002 09:48:50 -0600
Subject: Uptime data/studies/anecdotes ... ?
Message-ID: <200204021548.g32Fmod13276@mycroft.ahpcrc.org>

All,

What information is available on typical uptimes
of large-scale, clusters ... say greater than 256
processors and running a multi-user workload. What
gains do single-point-of-administration tools like
SCYLD provide?  Clearly, there are a great number
of things one can do to maximize uptime/utilization
(not the same thing really).  What are the essentials 
from the lists point of view? 

If a good figure is, say, 80% utilization over a 
8760 hour year today, what will this number be in 
three years?  Annual utilization for the 1088 processor 
T3E we run here is about 95%.  How long until a similarly
sized cluster typically yields the same value?

Regards,

rbw

#---------------------------------------------------
#
# Richard Walsh
# Project Manager, Cluster Computing, Computational
#                  Chemistry and Finance
# netASPx, Inc.
# 1200 Washington Ave. So.
# Minneapolis, MN 55415
# VOX:    612-337-3467
# FAX:    612-337-3400
# EMAIL:  rbw at networkcs.com, richard.walsh at netaspx.com
#
#---------------------------------------------------
# "What you can do, or dream you can, begin it;
#  Boldness has genius, power, and magic in it."
#                                  -Goethe
#---------------------------------------------------
# "Without mystery, there can be no authority."
#                                  -Charles DeGaulle
#---------------------------------------------------
# "Why waste time learning when ignornace is
#  instantaneous?"                 -Thomas Hobbes
#---------------------------------------------------


From roger at ERC.MsState.Edu  Tue Apr  2 08:15:00 2002
From: roger at ERC.MsState.Edu (Roger L. Smith)
Date: Tue, 2 Apr 2002 10:15:00 -0600
Subject: Uptime data/studies/anecdotes ... ?
In-Reply-To: <200204021548.g32Fmod13276@mycroft.ahpcrc.org>
Message-ID: <Pine.SGI.4.44.0204021008580.75196-100000@Downforce.ERC.MsState.Edu>

We currently run an average of about 75% utilization on our 586 processor
(293 node)  cluster.  We probably have about one node per week crash and
hang for various reasons.

We have occasional problems with memory leaks or PBS hangups which require
large scale reboots of the cluster. (Actually, PBS just died as I'm typing
this, but our pbs heartbeat script should restart it automatically in a
few minutes).  I'd say we have to do a full reboot of the cluster about
every 3-4 months.

For a bunch of PC hardware running a free OS, this seems like a pretty
good number to me.  It's not in the same class as our Sun servers (nor
even our SGIs!), but then, none of those systems are this large, either.


On Tue, 2 Apr 2002, Richard Walsh wrote:

>
> All,
>
> What information is available on typical uptimes
> of large-scale, clusters ... say greater than 256
> processors and running a multi-user workload. What
> gains do single-point-of-administration tools like
> SCYLD provide?  Clearly, there are a great number
> of things one can do to maximize uptime/utilization
> (not the same thing really).  What are the essentials
> from the lists point of view?
>
> If a good figure is, say, 80% utilization over a
> 8760 hour year today, what will this number be in
> three years?  Annual utilization for the 1088 processor
> T3E we run here is about 95%.  How long until a similarly
> sized cluster typically yields the same value?
>
> Regards,
>
> rbw
>
> #---------------------------------------------------
> #
> # Richard Walsh
> # Project Manager, Cluster Computing, Computational
> #                  Chemistry and Finance
> # netASPx, Inc.
> # 1200 Washington Ave. So.
> # Minneapolis, MN 55415
> # VOX:    612-337-3467
> # FAX:    612-337-3400
> # EMAIL:  rbw at networkcs.com, richard.walsh at netaspx.com
> #
> #---------------------------------------------------
> # "What you can do, or dream you can, begin it;
> #  Boldness has genius, power, and magic in it."
> #                                  -Goethe
> #---------------------------------------------------
> # "Without mystery, there can be no authority."
> #                                  -Charles DeGaulle
> #---------------------------------------------------
> # "Why waste time learning when ignornace is
> #  instantaneous?"                 -Thomas Hobbes
> #---------------------------------------------------
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


 _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_
| Roger L. Smith                        Phone: 662-325-3625               |
| Systems Administrator                 FAX:   662-325-7692               |
| roger at ERC.MsState.Edu                 http://WWW.ERC.MsState.Edu/~roger |
|                       Mississippi State University                      |
|_______________________Engineering Research Center_______________________|


From rbw at ahpcrc.org  Tue Apr  2 10:24:22 2002
From: rbw at ahpcrc.org (Richard Walsh)
Date: Tue, 2 Apr 2002 12:24:22 -0600
Subject: Uptime data/studies/anecdotes ... ?
Message-ID: <200204021824.g32IOMa14409@mycroft.ahpcrc.org>

On Tue, 2 Apr 2002 10:15:00 Roger Smith wrote:

>We currently run an average of about 75% utilization on our 586 processor
>(293 node)  cluster.  We probably have about one node per week crash and
>hang for various reasons.
>
>We have occasional problems with memory leaks or PBS hangups which require
>large scale reboots of the cluster. (Actually, PBS just died as I'm typing
>this, but our pbs heartbeat script should restart it automatically in a
>few minutes).  I'd say we have to do a full reboot of the cluster about
>every 3-4 months.
                                                                                >For a bunch of PC hardware running a free OS, this seems like a pretty
>good number to me.  It's not in the same class as our Sun servers (nor
>even our SGIs!), but then, none of those systems are this large, either.

Thanks for the estimate.  Do you use SCYLD or another pseudo-single-system-
image tool? I assume that 75% is a steady state number ... how long did
it take your group to reach that state?  If a full reboot is required 
only every 3-4 months then is singel node failure your main source of 
cycle loss? Or are other things like inefficient scheduling and lack of 
check-point/restart, etc. important?

75% does seem like a reasonably good number.

rbw


From roger at ERC.MsState.Edu  Tue Apr  2 10:46:07 2002
From: roger at ERC.MsState.Edu (Roger L. Smith)
Date: Tue, 2 Apr 2002 12:46:07 -0600
Subject: Uptime data/studies/anecdotes ... ?
In-Reply-To: <200204021824.g32IOMa14409@mycroft.ahpcrc.org>
Message-ID: <Pine.SGI.4.44.0204021236380.75196-100000@Downforce.ERC.MsState.Edu>

On Tue, 2 Apr 2002, Richard Walsh wrote:

> Thanks for the estimate.  Do you use SCYLD or another pseudo-single-system-
> image tool?

Nope, We use RH 7.2, PBS, and MPI/Pro, MPICH, and LAM MPI.

> I assume that 75% is a steady state number ... how long did
> it take your group to reach that state?

Our users are a bit "bursty".  The cluster rarely drops below 50%.
Looking back through my records, it hasn't been below 140 processors in
use in several weeks, and has spent most of its time with 400+ in use.
As we near project deadlines, we often have jobs waiting in the queue.
I've seen as many as 1100 processors in use, or requested and waiting.

When we upgraded from 324 to 586 processors, the users were banging on my
door wanting to know when the new nodes were available.  Within an hour or
releasing the new nodes (and without any notification to the users), they
were already using over 500 processors.  I'm currently working on an
expansion to about 1036 processors, and I fully expect to see it slammed
within a few days of release.

>  If a full reboot is required
> only every 3-4 months then is singel node failure your main source of
> cycle loss? Or are other things like inefficient scheduling and lack of
> check-point/restart, etc. important?

PBS is our leading cause of cycle loss.  We now run a cron job on the
headnode that checks every 15 minutes to see if the PBS daemons have died,
and if so, it automatically restarts them.  About 75% of the time that I
have a node fail to accept jobs, it is because its pbs_mom has died, not
because there is anything wrong with the node.

 _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_
| Roger L. Smith                        Phone: 662-325-3625               |
| Systems Administrator                 FAX:   662-325-7692               |
| roger at ERC.MsState.Edu                 http://WWW.ERC.MsState.Edu/~roger |
|                       Mississippi State University                      |
|_______________________Engineering Research Center_______________________|


From gabriel.weinstock at dnamerican.com  Tue Apr  2 13:00:03 2002
From: gabriel.weinstock at dnamerican.com (Gabriel J. Weinstock)
Date: Tue, 2 Apr 2002 16:00:03 -0500
Subject: Maui scheduler error
Message-ID: <21193187508190@DNAMERICAN.COM>

Hi,
  We are having trouble getting the Maui scheduler to work. We have no 
problem starting the server/scheduler and drone programs. (For testing, we 
are not starting the drone on every node in the cluster; is this a problem?) 
The set up is 'mauictl start' on the head node, followed by 'nodectl start' 
on 2 compute nodes. 'showq' works correctly. The log files show all three 
nodes processing correctly; right up until a user submits a job, at which 
point the server node spits out the following message to its log file and 
exits:

- log file -
4/02 15:21:25 (Sched.java:299) iteration 36
04/02 15:21:25 (Wiki.java:392) Wiki loop event
04/02 15:21:25 (BackfillMod.java:147) backfill scheduling
04/02 15:21:25 (ReservationsMod.java:105) handling reservations
04/02 15:21:25 (JobChecker.java:220) checkpointing...
04/02 15:21:25 (Sched.java:311) scheduling interval took 0.016 seconds
04/02 15:21:29 (BasicWorker.java:430) mauisubmit
04/02 15:21:29 (MauiSubmit.java:96) mauisubmit
04/02 15:21:29 (MauiSubmit.java:128) LRM cmdfile
04/02 15:21:29 (CMD.java:280) Removing envvar HOSTNAME
04/02 15:21:29 (CMD.java:280) Removing envvar MACHTYPE
04/02 15:21:29 (CMD.java:280) Removing envvar HOSTTYPE
04/02 15:21:29 (CMD.java:280) Removing envvar OSTYPE
04/02 15:21:29 (CMD.java:280) Removing envvar _
04/02 15:21:29 (MauiMySQL.java:268) Changing romeda's job account to 
no-account
04/02 15:21:29 (MauiSubmit.java:199) checking job on RM=Node
04/02 15:21:29 (BasicPolicy.java:111) pre debiting bank for 7200 slotsecs for 
job=romeda:1017778889:0
04/02 15:21:29 (MauiXMLHandlerImpl.java:284) FATAL: 
org.xml.sax.SAXParseException: Illegal XML character:  &#x0;.
04/02 15:21:29 (BasicWorker.java:244) Ignoring SAX freak-out: Illegal XML 
character:  &#x0;.
04/02 15:21:30 (Sched.java:326) 
----------------------------------------------------
04/02 15:21:30 (Sched.java:299) iteration 37
04/02 15:21:30 (Wiki.java:392) Wiki loop event
04/02 15:21:30 (BackfillMod.java:147) backfill scheduling
04/02 15:21:30 (BackfillMod.java:164) contemplating job romeda:1017778889:0
04/02 15:21:30 (Sched.java:330) java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException
        at 
unm.maui.rm.SimpleMatcher.getNodeAvailSlotIDs(SimpleMatcher.java:563)        
at unm.maui.rm.SimpleMatcher.getNodesSlots(SimpleMatcher.java:377)
        at unm.maui.rm.SimpleMatcher.getNodesSlots(SimpleMatcher.java:256)
        at unm.maui.rm.SimpleMatcher.findNodesSlots(SimpleMatcher.java:79)
        at unm.maui.sched.BackfillMod.makeReservation(BackfillMod.java:240)
        at unm.maui.sched.BackfillMod.event(BackfillMod.java:169)
        at unm.maui.sched.Sched.fireLoop(Sched.java:922)
        at unm.maui.sched.Sched.run(Sched.java:306)
        at java.lang.Thread.run(Thread.java:484)
04/02 15:21:30 (Sched.java:347) checkpointing scheduler.
04/02 15:21:30 (Wiki.java:385) shutting down RM=Node
04/02 15:21:30 (Sched.java:359) scheduler finished
- end -

If I try to restart the server daemon after this crash, it immediately exits 
again with the message in iteration 37 (ArrayIndexOutOfBoundsException.) The 
only way to restart the daemon is to create the mySQL database again (wiping 
whatever was in it.) Here is my .cmd file, which I run with 'mauisubmit 
maui_job.cmd':

- maui_job.cmd -
IWD == "/tmp"
WCLimit == 3600

Account == "WWGD190053X"

Tasks == 2
Nodes == 2
TaskPerNode == 1

Arch == x86
OS == Linux

JobType == "mpi.ch_gm"
Exec == "/export/mauisched-1.2/bin/runmpi_gm"
Args == "/export/home/romeda/cpi"

Output == "/tmp/$(MAUI_JOB_USER)2x3gm$(MAUI_JOB_ID).out"
Error == "/tmp/$(MAUI_JOB_USER)2x3gm$(MAUI_JOB_ID).err"
Log == "/tmp/$(MAUI_JOB_USER)2x3gm$(MAUI_JOB_ID).log"
Input == "/dev/null"
- end -

Is the XML error related to the out of bounds array exception? We compiled 
with the Sun jdk 1.3.1-02 and JavaCC 2.1. There is no information about this 
error on the web. Any help would be greatly appreciated.
Thanks,
Gabe


From emiller at techskills.com  Tue Apr  2 13:34:27 2002
From: emiller at techskills.com (Eric Miller)
Date: Tue, 2 Apr 2002 16:34:27 -0500
Subject: Syntax for executing 
In-Reply-To: <NMELJLHHFNGMNFFNGJAEIEJHCDAA.emiller@techskills.com>
Message-ID: <NMELJLHHFNGMNFFNGJAEIEKGCDAA.emiller@techskills.com>

disregard.  SETI is not available in an MPI-enabled format.

My apologies.  Can anyone direct me to an URL that lists some available
programs that I can execute on the cluster?  Preferably something with a
continuous (looping?) graphical output (e.g. SETI). This is a display for
students to visualize and promote educational programs for Linux, like a
museum peice.


>>>>>>>>>>>>>>>>>>>>>>>>>>

Hey all, got a five-node cluster up running 27-z9, preparing for a 30 node
cluster.

- What is the syntax to run an executable in the cluster environment?  For
example, I run

NP=5 mpi-mandel

to run the test fractal program.  How would I execute say, SETI, using the
cluster?  Assume that the SETI executable is in the PATH.  Also, the older
version of Scyld had some test code in /usr/mpi-beowulf/*.  Is that gone?

-  What would cause all but one of the processors to show usage in
beostatus?  The node shows "up" in every other way: hardware identical,
memory, swap, network, etc....just when I run something, only that one
processor on one node shows no % usage.

-ETM

  .~.
  /V\
 // \\
/(   )\
 ^'~'^


From gropp at mcs.anl.gov  Tue Apr  2 13:48:19 2002
From: gropp at mcs.anl.gov (William Gropp)
Date: Tue, 02 Apr 2002 15:48:19 -0600
Subject: Syntax for executing 
In-Reply-To: <NMELJLHHFNGMNFFNGJAEIEKGCDAA.emiller@techskills.com>
References: <NMELJLHHFNGMNFFNGJAEIEJHCDAA.emiller@techskills.com>
Message-ID: <5.1.0.14.2.20020402154730.01bdc3b8@localhost>

At 04:34 PM 4/2/2002 -0500, Eric Miller wrote:
>disregard.  SETI is not available in an MPI-enabled format.
>
>My apologies.  Can anyone direct me to an URL that lists some available
>programs that I can execute on the cluster?  Preferably something with a
>continuous (looping?) graphical output (e.g. SETI). This is a display for
>students to visualize and promote educational programs for Linux, like a
>museum peice.

pmandel in the MPICH distribution has a -loop option for just this 
purpose.  See the README in mpich/mpe/contrib/mandel .

Bill


From aby_sinha at yahoo.com  Tue Apr  2 19:19:42 2002
From: aby_sinha at yahoo.com (Abhishek sinha)
Date: Tue, 02 Apr 2002 19:19:42 -0800
Subject: apic problems
Message-ID: <3CAA74CE.2070109@yahoo.com>

Hi All

I am using dual processors with a Tyan Tiger 2505 T board and having so 
many problems with the APIC on the machine . I have looked around on the 
newsgroups and mailing list..with no hints...

Does the return code in the end of the message 00(02)
 > APIC error on CPU0: 00(08)
 > APIC error on CPU0: 08(08)
 > APIC error on CPU0: 08(08)
 > APIC error on CPU1: 02(02)
 > APIC error on CPU1: 02(08)

mean that this particular board i m using is crappy or the whole 2505T 
series cannot handle these kinds of requests


I am pasting the dmesg from the server below

Linux version 2.4.7-10smp ( bhcompile at stripples.devel.redhat.com 
<mailto:bhcompile at stripples.devel.redhat.com>) (gcc
 > version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)) #1 SMP Thu Sep 6
 > 17:09:31 EDT 2001
 > BIOS-provided physical RAM map:
 > BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
 > BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
 > BIOS-e820: 0000000000100000 - 000000007fff0000 (usable)
 > BIOS-e820: 000000007fff0000 - 000000007fff3000 (ACPI NVS)
 > BIOS-e820: 000000007fff3000 - 0000000080000000 (ACPI data)
 > BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
 > Scanning bios EBDA for MXT signature
 > 1151MB HIGHMEM available.
 > found SMP MP-table at 000f5660
 > hm, page 000f5000 reserved twice.
 > hm, page 000f6000 reserved twice.
 > hm, page 000f1000 reserved twice.
 > hm, page 000f2000 reserved twice.
 > On node 0 totalpages: 524272
 > zone(0): 4096 pages.
 > zone(1): 225280 pages.
 > zone(2): 294896 pages.
 > Intel MultiProcessor Specification v1.4
 > Virtual Wire compatibility mode.
 > OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000
 > Processor #0 Pentium(tm) Pro APIC version 17
 > Processor #1 Pentium(tm) Pro APIC version 17
 > I/O APIC #2 Version 17 at 0xFEC00000.
 > Processors: 2
 > Kernel command line: ro root=/dev/hda2
 > Initializing CPU#0
 > Detected 864.238 MHz processor.
 > Console: colour VGA+ 80x25
 > Calibrating delay loop... 1723.59 BogoMIPS
 > Memory: 2056920k/2097088k available (1396k kernel code, 37736k reserved,
 > 102k data, 240k init, 1179584k highmem)
 > Dentry-cache hash table entries: 262144 (order: 9, 2097152 bytes)
 > Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes)
 > Mount-cache hash table entries: 32768 (order: 6, 262144 bytes)
 > Buffer-cache hash table entries: 131072 (order: 7, 524288 bytes)
 > Page-cache hash table entries: 524288 (order: 10, 4194304 bytes)
 > CPU: Before vendor init, caps: 0387fbff 00000000 00000000, vendor = 0
 > CPU: L1 I cache: 16K, L1 D cache: 16K
 > CPU: L2 cache: 256K
 > Intel machine check architecture supported.
 > Intel machine check reporting enabled on CPU#0.
 > CPU: After vendor init, caps: 0387fbff 00000000 00000000 00000000
 > CPU serial number disabled.
 > CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
 > CPU: Common caps: 0383fbff 00000000 00000000 00000000
 > Enabling fast FPU save and restore... done.
 > Enabling unmasked SIMD FPU exception support... done.
 > Checking 'hlt' instruction... OK.
 > POSIX conformance testing by UNIFIX
 > mtrr: v1.40 (20010327) Richard Gooch ( rgooch at atnf.csiro.au 
<mailto:rgooch at atnf.csiro.au>)
 > mtrr: detected mtrr type: Intel
 > CPU: Before vendor init, caps: 0383fbff 00000000 00000000, vendor = 0
 > CPU: L1 I cache: 16K, L1 D cache: 16K
 > CPU: L2 cache: 256K
 > Intel machine check reporting enabled on CPU#0.
 > CPU: After vendor init, caps: 0383fbff 00000000 00000000 00000000
 > CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
 > CPU: Common caps: 0383fbff 00000000 00000000 00000000
 > CPU0: Intel Pentium III (Coppermine) stepping 0a
 > per-CPU timeslice cutoff: 730.77 usecs.
 > enabled ExtINT on CPU#0
 > ESR value before enabling vector: 00000000
 > ESR value after enabling vector: 00000000
 > Booting processor 1/1 eip 2000
 > Initializing CPU#1
 > masked ExtINT on CPU#1
 > ESR value before enabling vector: 00000000
 > ESR value after enabling vector: 00000000
 > Calibrating delay loop... 1723.59 BogoMIPS
 > CPU: Before vendor init, caps: 0387fbff 00000000 00000000, vendor = 0
 > CPU: L1 I cache: 16K, L1 D cache: 16K
 > CPU: L2 cache: 256K
 > Intel machine check reporting enabled on CPU#1.
 > CPU: After vendor init, caps: 0387fbff 00000000 00000000 00000000
 > CPU serial number disabled.
 > CPU: After generic, caps: 0383fbff 00000000 00000000 00000000
 > CPU: Common caps: 0383fbff 00000000 00000000 00000000
 > CPU1: Intel Pentium III (Coppermine) stepping 0a
 > Total of 2 processors activated (3447.19 BogoMIPS).
 > ENABLING IO-APIC IRQs
 > ...changing IO-APIC physical APIC ID to 2 ... ok.
 > init IO_APIC IRQs
 > IO-APIC (apicid-pin) 2-0, 2-16, 2-17, 2-18, 2-19, 2-20, 2-21, 2-22, 2-23
 > not connected.
 > ..TIMER: vector=0x31 pin1=2 pin2=0
 > number of MP IRQ sources: 19.
 > number of IO-APIC #2 registers: 24.
 > testing the IO APIC.......................
 >
 > IO APIC #2......
 > .... register #00: 02000000
 > ....... : physical APIC id: 02
 > .... register #01: 00178011
 > ....... : max redirection entries: 0017
 > ....... : IO APIC version: 0011
 > WARNING: unexpected IO-APIC, please mail
 > to linux-smp at vger.kernel.org <mailto:linux-smp at vger.kernel.org>
 > .... register #02: 00000000
 > ....... : arbitration: 00
 > .... IRQ redirection table:
 > NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
 > 00 000 00 1 0 0 0 0 0 0 00
 > 01 003 03 0 0 0 0 0 1 1 39
 > 02 003 03 0 0 0 0 0 1 1 31
 > 03 003 03 0 0 0 0 0 1 1 41
 > 04 003 03 0 0 0 0 0 1 1 49
 > 05 003 03 0 0 0 0 0 1 1 51
 > 06 003 03 0 0 0 0 0 1 1 59
 > 07 003 03 0 0 0 0 0 1 1 61
 > 08 003 03 0 0 0 0 0 1 1 69
 > 09 003 03 0 0 0 0 0 1 1 71
 > 0a 003 03 1 1 0 1 0 1 1 79
 > 0b 003 03 1 1 0 1 0 1 1 81
 > 0c 003 03 1 1 0 1 0 1 1 89
 > 0d 003 03 0 0 0 0 0 1 1 91
 > 0e 003 03 0 0 0 0 0 1 1 99
 > 0f 003 03 0 0 0 0 0 1 1 A1
 > 10 000 00 1 0 0 0 0 0 0 00
 > 11 000 00 1 0 0 0 0 0 0 00
 > 12 000 00 1 0 0 0 0 0 0 00
 > 13 000 00 1 0 0 0 0 0 0 00
 > 14 000 00 1 0 0 0 0 0 0 00
 > 15 000 00 1 0 0 0 0 0 0 00
 > 16 000 00 1 0 0 0 0 0 0 00
 > 17 000 00 1 0 0 0 0 0 0 00
 > IRQ to pin mappings:
 > IRQ0 -> 0:2
 > IRQ1 -> 0:1
 > IRQ3 -> 0:3
 > IRQ4 -> 0:4
 > IRQ5 -> 0:5
 > IRQ6 -> 0:6
 > IRQ7 -> 0:7
 > IRQ8 -> 0:8
 > IRQ9 -> 0:9
 > IRQ10 -> 0:10
 > IRQ11 -> 0:11
 > IRQ12 -> 0:12
 > IRQ13 -> 0:13
 > IRQ14 -> 0:14
 > IRQ15 -> 0:15
 > .................................... done.
 > Using local APIC timer interrupts.
 > calibrating APIC timer ...
 > ..... CPU clock speed is 864.2437 MHz.
 > ..... host bus clock speed is 132.9603 MHz.
 > cpu: 0, clocks: 1329603, slice: 443201
 > CPU0
 > cpu: 1, clocks: 1329603, slice: 443201
 > CPU1
 > checking TSC synchronization across CPUs: passed.
 > mtrr: your CPUs had inconsistent variable MTRR settings
 > mtrr: probably your BIOS does not setup all CPUs
 > PCI: PCI BIOS revision 2.10 entry at 0xfb3e0, last bus=1
 > PCI: Using configuration type 1
 > PCI: Probing PCI hardware
 > Unknown bridge resource 0: assuming transparent
 > Unknown bridge resource 1: assuming transparent
 > Unknown bridge resource 2: assuming transparent
 > PCI: Using IRQ router VIA [1106/0686] at 00:07.0
 > PCI->APIC IRQ transform: (B0,I6,P0) -> 12
 > PCI->APIC IRQ transform: (B0,I7,P3) -> 12
 > PCI->APIC IRQ transform: (B0,I7,P3) -> 12
 > PCI->APIC IRQ transform: (B0,I13,P0) -> 10
 > PCI->APIC IRQ transform: (B0,I14,P0) -> 11
 > PCI: Enabling Via external APIC routing
 > isapnp: Scanning for PnP cards...
 > isapnp: No Plug & Play device found
 > Linux NET4.0 for Linux 2.4
 > Based upon Swansea University Computer Society NET3.039
 > Initializing RT netlink socket
 > apm: BIOS version 1.2 Flags 0x07 (Driver version 1.14)
 > apm: disabled - APM is not SMP safe.
 > mxt_scan_bios: enter
 > Starting kswapd v1.8
 > allocated 64 pages and 64 bhs reserved for the highmem bounces
 > VFS: Diskquotas version dquot_6.5.0 initialized
 > Detected PS/2 Mouse Port.
 > pty: 2048 Unix98 ptys configured
 > Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT
 > SHARE_IRQ SERIAL_PCI ISAPNP enabled
 > ttyS00 at 0x03f8 (irq = 4) is a 16550A
 > Real Time Clock Driver v1.10d
 > block: queued sectors max/low 1365629kB/1234557kB, 4032 slots per queue
 > RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
 > Uniform Multi-Platform E-IDE driver Revision: 6.31
 > ide: Assuming 33MHz PCI bus speed for PIO modes; override with idebus=xx
 > VP_IDE: IDE controller on PCI bus 00 dev 39
 > VP_IDE: chipset revision 6
 > VP_IDE: not 100% native mode: will probe irqs later
 > ide: Assuming 33MHz PCI bus speed for PIO modes; override with idebus=xx
 > VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:07.1
 > ide0: BM-DMA at 0xd400-0xd407, BIOS settings: hda:DMA, hdb:pio
 > ide1: BM-DMA at 0xd408-0xd40f, BIOS settings: hdc:DMA, hdd:pio
 > hda: QUANTUM FIREBALLlct20 40, ATA DISK drive
 > hdc: CDU5211, ATAPI CD/DVD-ROM drive
 > ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
 > ide1 at 0x170-0x177,0x376 on irq 15
 > hda: 78177792 sectors (40027 MB) w/418KiB Cache, CHS=4866/255/63, 
UDMA(33)
 > ide-floppy driver 0.97
 > Partition check:
 > hda: hda1 hda2 hda3
 > FDC 0 is a post-1991 82077
 > ide-floppy driver 0.97
 > md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
 > md: Autodetecting RAID arrays.
 > md: autorun ...
 > md: ... autorun DONE.
 > NET4: Linux TCP/IP 1.0 for NET4.0
 > IP Protocols: ICMP, UDP, TCP, IGMP
 > IP: routing cache hash table of 16384 buckets, 128Kbytes
 > TCP: Hash tables configured (established 524288 bind 65536)
 > Linux IP multicast router 0.06 plus PIM-SM
 > NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
 > RAMDISK: Compressed image found at block 0
 > Freeing initrd memory: 324k freed
 > VFS: Mounted root (ext2 filesystem).
 > Journalled Block Device driver loaded
 > EXT3-fs: INFO: recovery required on readonly filesystem.
 > EXT3-fs: write access will be enabled during recovery.
 > kjournald starting. Commit interval 5 seconds
 > EXT3-fs: recovery complete.
 > EXT3-fs: mounted filesystem with ordered data mode.
 > Freeing unused kernel memory: 240k freed
 > Adding Swap: 2040244k swap-space (priority -1)
 > usb.c: registered new driver usbdevfs
 > usb.c: registered new driver hub
 > usb-uhci.c: $Revision: 1.259 $ time 17:18:11 Sep 6 2001
 > usb-uhci.c: High bandwidth mode enabled
 > usb-uhci.c: USB UHCI at I/O 0xd800, IRQ 12
 > usb-uhci.c: Detected 2 ports
 > usb.c: new USB bus registered, assigned bus number 1
 > hub.c: USB hub found
 > hub.c: 2 ports detected
 > usb-uhci.c: USB UHCI at I/O 0xdc00, IRQ 12
 > usb-uhci.c: Detected 2 ports
 > usb.c: new USB bus registered, assigned bus number 2
 > hub.c: USB hub found
 > hub.c: 2 ports detected
 > usb-uhci.c: v1.251:USB Universal Host Controller Interface driver
 > EXT3 FS 2.4-0.9.8, 25 Aug 2001 on ide0(3,2), internal journal
 > kjournald starting. Commit interval 5 seconds
 > EXT3 FS 2.4-0.9.8, 25 Aug 2001 on ide0(3,1), internal journal
 > EXT3-fs: mounted filesystem with ordered data mode.
 > parport0: PC-style at 0x378 [PCSPP,EPP]
 > parport0: cpp_daisy: aa5500ff(38)
 > parport0: assign_addrs: aa5500ff(38)
 > parport0: cpp_daisy: aa5500ff(38)
 > parport0: assign_addrs: aa5500ff(38)
 > parport_pc: Via 686A parallel port: io=0x378
 > eepro100.c:v1.09j-t 9/29/99 Donald Becker
 > http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
 > eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin
 > and others
 > eth0: Intel Corporation 82557 [Ethernet Pro 100], 00:E0:81:20:55:CC, IRQ
 > 10. Board assembly 567812-052, Physical connectors present: RJ45
 > Primary interface chip i82555 PHY #1.
 > General self-test: passed.
 > Serial sub-system self-test: passed.
 > Internal registers self-test: passed.
 > ROM checksum self-test: passed (0x04f4518b).
 > eth1: Intel Corporation 82557 [Ethernet Pro 100] (#2), 
00:E0:81:20:55:CD,
 > IRQ 11.
 > Board assembly 567812-052, Physical connectors present: RJ45
 > Primary interface chip i82555 PHY #1.
 > General self-test: passed.
 > Serial sub-system self-test: passed.
 > Internal registers self-test: passed.
 > ROM checksum self-test: passed (0x04f4518b).
 > APIC error on CPU1: 00(02)
 > APIC error on CPU0: 00(08)
 > APIC error on CPU0: 08(08)
 > APIC error on CPU0: 08(08)
 > APIC error on CPU1: 02(02)
 > APIC error on CPU1: 02(08)

PLEASE HELP

abby


From raysonlogin at yahoo.com  Tue Apr  2 20:07:19 2002
From: raysonlogin at yahoo.com (Rayson Ho)
Date: Tue, 2 Apr 2002 20:07:19 -0800 (PST)
Subject: Uptime data/studies/anecdotes ... ?
In-Reply-To: <Pine.SGI.4.44.0204021008580.75196-100000@Downforce.ERC.MsState.Edu>
Message-ID: <20020403040719.14849.qmail@web11408.mail.yahoo.com>

--- "Roger L. Smith" <roger at ERC.MsState.Edu> wrote:
> We currently run an average of about 75% utilization on our 586
> processor (293 node)  cluster.  We probably have about one node per 
> week crash and hang for various reasons.

The OpenPBS backfilling algorithm is really bad. If you are running
parallel jobs, you should use PBS+Maui.

> We have occasional problems with memory leaks or PBS hangups which
> require large scale reboots of the cluster. (Actually, PBS just died 
> as I'm typing this, but our pbs heartbeat script should restart it 
> automatically in a few minutes).  I'd say we have to do a full reboot

> of the cluster about every 3-4 months.

One bigger problem is (or was, I haven't been looking at PBS code since
last fall) that in each scheduling cycle, the scheduler tries to
contact each MOM in the cluster to get resource information, but if one
of the MON dies, then the scheduler hangs... and then timeout &
restarts.

You may try the "Cplant Fault Recovery Patch" and several other patches
if you want to stay with PBS.

> For a bunch of PC hardware running a free OS, this seems like a
> pretty good number to me.  It's not in the same class as our Sun 
> servers (nor even our SGIs!), but then, none of those systems are 
> this large, either.

Another problem (at least in OpenPBS 2.3.12) is that there are some
hard limit that is defined in the source (like 
"#define PBS_ACCT_MAX_RCD 4095", "#define PBS_NET_MAX_CONNECTIONS 256",
which may not work in large clusters)

If you want something free, then you may try SGE. It scales quite
nicely (SGE improved a lot in 5.3), it's open source, and integrates
with Maui.

I like SGE better than OpenPBS. -- at least when one (or more?) of your
nodes dies, the cluster continues to operate, and SGE even re-runs the
job for you. Another feature is the shadow master, which restarts the
master daemon on other machines if your master node dies.

I think someone on this list is planning to tell us his experience with
SGE on his beowulf?

Rayson

P.S. 

links:
OpenPBS public home: http://www-unix.mcs.anl.gov/openpbs/
SGE                : http://gridengine.sunsource.net
Maui               : http://www.supercluster.org


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From Drake.Diedrich at anu.edu.au  Tue Apr  2 23:11:26 2002
From: Drake.Diedrich at anu.edu.au (Drake Diedrich)
Date: Wed, 3 Apr 2002 17:11:26 +1000
Subject: pvm povray help
In-Reply-To: <3.0.6.32.20020321113924.00846be0@arrhenius.chem.kuleuven.ac.be>
References: <E16nmFV-0004Fo-00@oracle.clara.net> <E16nmFV-0004Fo-00@oracle.clara.net> <3.0.6.32.20020321113924.00846be0@arrhenius.chem.kuleuven.ac.be>
Message-ID: <20020403171126.B26086@duh.anu.edu.au>

On Thu, Mar 21, 2002 at 11:39:24AM +0100, Luc Vereecken wrote:
> >a very large project.
> 
> That shouldn't be a very large project at all. Read the inputfile

   The very large part would be in broadcasting the parsed object tree, so
as to limit the serial overhead of parsing to just one node, rather than
duplicate that effort on all nodes.


From opengeometry at yahoo.ca  Tue Apr  2 23:31:52 2002
From: opengeometry at yahoo.ca (William Park)
Date: Wed, 3 Apr 2002 02:31:52 -0500
Subject: Hyperthreading in P4 Xeon (question)
Message-ID: <20020403023152.A2972@node0.opengeometry.ca>

What is the realistic effect of "hyperthreading" in P4 Xeon?  I'm not
versed in the latest CPU trends.  Does it mean that dual-P4Xeon will
behave like 4-way SMP?

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin


From didonato at bigpond.net.au  Wed Apr  3 00:35:03 2002
From: didonato at bigpond.net.au (Christian Di Donato)
Date: Wed, 3 Apr 2002 18:35:03 +1000
Subject: Hyperthreading in P4 Xeon (question)
In-Reply-To: <20020403023152.A2972@node0.opengeometry.ca>
Message-ID: <000301c1daea$76b06a40$99ca8490@claptop>

There is a Whitepaper on the Xeon Processor concerning Hyperthreading
over at the intel site

http://www.intel.com/eBusiness/products/server/processor/xeon/wp020901_s
um.htm

-----Original Message-----
From: beowulf-admin at beowulf.org [mailto:beowulf-admin at beowulf.org] On
Behalf Of William Park
Sent: Wednesday, 3 April 2002 5:32 PM
To: beowulf at beowulf.org
Subject: Hyperthreading in P4 Xeon (question)

What is the realistic effect of "hyperthreading" in P4 Xeon?  I'm not
versed in the latest CPU trends.  Does it mean that dual-P4Xeon will
behave like 4-way SMP?

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From Luc.Vereecken at chem.kuleuven.ac.be  Wed Apr  3 03:32:39 2002
From: Luc.Vereecken at chem.kuleuven.ac.be (Luc Vereecken)
Date: Wed, 03 Apr 2002 13:32:39 +0200
Subject: pvm povray help
In-Reply-To: <20020403171126.B26086@duh.anu.edu.au>
References: <3.0.6.32.20020321113924.00846be0@arrhenius.chem.kuleuven.ac.be>
 <E16nmFV-0004Fo-00@oracle.clara.net>
 <E16nmFV-0004Fo-00@oracle.clara.net>
 <3.0.6.32.20020321113924.00846be0@arrhenius.chem.kuleuven.ac.be>
Message-ID: <3.0.6.32.20020403133239.008b8720@arrhenius.chem.kuleuven.ac.be>

At 17:11 3/04/02 +1000, Drake Diedrich wrote:
>On Thu, Mar 21, 2002 at 11:39:24AM +0100, Luc Vereecken wrote:
>> >a very large project.
>> 
>> That shouldn't be a very large project at all. Read the inputfile
>
>   The very large part would be in broadcasting the parsed object tree, so
>as to limit the serial overhead of parsing to just one node, rather than
>duplicate that effort on all nodes.

Would that duplication avoidance gain you anything ? 

Current case (IIRC, I haven't used pvm povray recently): 
Every node reads the inputfile (possibly from an inefficient NFS mounted
volume), and parses.

New Case 1 : 
Read the inputfiles on master, broadcast these N bytes, parse for Q seconds
on all nodes. User gained : no need to have the input file on all nodes.
Developer gained : easy to implement.

New Case 2 : 
Read the inputfile on master, parse for Q' seconds on master node,
broadcast M bytes for parsed object tree. User gained: no need to have the
input file on all nodes.

In the second case, you have NODES-1 nodes doing nothing, but you might not
be able to do anything with that free time, as they e.g. are already
allocated to that job, or whatever, especially since the parsing is fairly
short compared to the rendering.
Assuming identical nodes, the walltime of the parsing is the same
everywhere (Q=Q'), and duplicating that effort doesn't require extra
walltime (so irrelevant unless you're charged per used cpu time, or if you
have multiple jobs per processor (e.g. SMP) to reclaim the idle time). If
so, it then depends on whether the parsed object tree (M bytes) is larger
or smaller than the text inputfiles and other required files (N bytes). If
M > N, it takes longer to broadcast the parsed tree, if N > M, then it is
quicker to broadcast the parsed tree.
If the Master node is faster than the others, it's parsing time might be
shorter than the slowest of the other nodes (Q' < Q), and then it is
possible that even with M > N, it might be faster to distribute the parsed
tree rather than the inputfiles. 

The basic question is therefore : how large is (typically) the parsed tree
compared to the original input file ? Standard povray include files should
be assumed predistributed as they should/can be installed on each node
together with the executable. To be honest, I have no idea about this ratio.

Luc


From kus at free.net  Wed Apr  3 04:06:17 2002
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed, 3 Apr 2002 16:06:17 +0400 (MSD)
Subject: Hyperthreading in P4 Xeon 
Message-ID: <200204031206.QAA10335@nocserv.free.net>

According to William Park
> From beowulf-admin at beowulf.org Wed Apr  3 11:52:00 2002
> From: William Park <opengeometry at yahoo.ca>
> To: beowulf at beowulf.org
> Subject: Hyperthreading in P4 Xeon (question)
> 
> What is the realistic effect of "hyperthreading" in P4 Xeon?
  I didn't see any data about applications  which are typical 
for clusters. But there is some other results on Intel Web-site.
The success will depend from application strongly. 
For example, if you have an application, which need full
cache size for working set of pages, the perfornmance of like
application will degrade because at simultaneous running of
2 processes the cache will share.
>  I'm not
> versed in the latest CPU trends.  Does it mean that dual-P4Xeon will
> behave like 4-way SMP?
  Yes, every physical CPU is equal to 2 logical CPUs, and you
may use OpenMP etc.

Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow


From didonato at bigpond.net.au  Wed Apr  3 04:29:25 2002
From: didonato at bigpond.net.au (Christian Di Donato)
Date: Wed, 3 Apr 2002 22:29:25 +1000
Subject: Testing
Message-ID: <000401c1db0b$3438b7a0$99ca8490@claptop>

Can someone just reply to this list and confirm that they are indeed
receiving this. Only one person needs to reply. I'm getting e-mails
bouncing back every time I try to send something to beowulf at beowulf.org


Thanks in Advance 

And Kind Regards


Christian Di Donato


From walke at usna.edu  Wed Apr  3 04:46:46 2002
From: walke at usna.edu (LT V. H. Walke)
Date: 03 Apr 2002 07:46:46 -0500
Subject: Testing
In-Reply-To: <000401c1db0b$3438b7a0$99ca8490@claptop>
References: <000401c1db0b$3438b7a0$99ca8490@claptop>
Message-ID: <1017838087.30683.1.camel@vhwalke.mathsci.usna.edu>

I read you loud and clear.

Vann

On Wed, 2002-04-03 at 07:29, Christian Di Donato wrote:
> Can someone just reply to this list and confirm that they are indeed
> receiving this. Only one person needs to reply. I'm getting e-mails
> bouncing back every time I try to send something to beowulf at beowulf.org
> 
> 
> Thanks in Advance 
> 
> And Kind Regards
> 
> 
> Christian Di Donato
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- 
----------------------------------------------------------------------
  Vann H. Walke                        Office: Chauvenet 341
  Computer Science Dept.               Ph:  410-293-6811
  572 Holloway Road, Stop 9F           Fax: 410-293-2686
  United States Naval Academy          email: walke at usna.edu
  Annapolis, MD 21402-5002             http://www.cs.usna.edu/~walke
----------------------------------------------------------------------


From Daniel.Kidger at quadrics.com  Wed Apr  3 05:56:28 2002
From: Daniel.Kidger at quadrics.com (Daniel Kidger)
Date: Wed, 3 Apr 2002 14:56:28 +0100 
Subject: Testing
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA74D2D37@stegosaurus.bristol.quadrics.com>

Christian Di Donato [mailto:didonato at bigpond.net.au] wrote:

>Can someone just reply to this list and confirm that they are indeed
>receiving this. Only one person needs to reply. I'm getting e-mails
>bouncing back every time I try to send something to beowulf at beowulf.org

So why cant that someone reply just to you rather than the whole list?

and more importantly 
  - how can anyone know that they are the said 'one person' !


Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


From math at velocet.ca  Wed Apr  3 06:13:23 2002
From: math at velocet.ca (Velocet)
Date: Wed, 3 Apr 2002 09:13:23 -0500
Subject: Testing
In-Reply-To: <010C86D15E4D1247B9A5DD312B7F5AA74D2D37@stegosaurus.bristol.quadrics.com>; from Daniel.Kidger@quadrics.com on Wed, Apr 03, 2002 at 02:56:28PM +0100
References: <010C86D15E4D1247B9A5DD312B7F5AA74D2D37@stegosaurus.bristol.quadrics.com>
Message-ID: <20020403091323.J69845@velocet.ca>

On Wed, Apr 03, 2002 at 02:56:28PM +0100, Daniel Kidger's all...
> 
> Christian Di Donato [mailto:didonato at bigpond.net.au] wrote:
> 
> >Can someone just reply to this list and confirm that they are indeed
> >receiving this. Only one person needs to reply. I'm getting e-mails
> >bouncing back every time I try to send something to beowulf at beowulf.org
> 
> So why cant that someone reply just to you rather than the whole list?
> 
> and more importantly 
>   - how can anyone know that they are the said 'one person' !

Because when he asks for 'only one person' there's an implicit semaphore
called in the operation. Didnt you heed it? Now look what you've done! :)

This would all be funnier if it was still Apr 1.

/kc

> 
> 
> Yours,
> Daniel.
> 
> --------------------------------------------------------------
> Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
> One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
> ----------------------- www.quadrics.com --------------------
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From timm at fnal.gov  Wed Apr  3 06:27:42 2002
From: timm at fnal.gov (Steven Timm)
Date: Wed, 3 Apr 2002 08:27:42 -0600 (CST)
Subject: Hyperthreading in P4 Xeon (question)
In-Reply-To: <20020403023152.A2972@node0.opengeometry.ca>
Message-ID: <Pine.LNX.4.31.0204030824290.11114-100000@snowball.fnal.gov>

I have one such test machine that we are evaluating at the moment.
It's a dual cpu machine but under Linux it shows up looking like it
has four cpu's.  Haven't actually tried yet to see if it really can
run four loads just as well... the specimen we have has DDR SDRAM
and already gets bogged down going with two processes at once.

Steve Timm

------------------------------------------------------------------
Steven C. Timm (630) 840-8525  timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations

On Wed, 3 Apr 2002, William Park wrote:

> What is the realistic effect of "hyperthreading" in P4 Xeon?  I'm not
> versed in the latest CPU trends.  Does it mean that dual-P4Xeon will
> behave like 4-way SMP?
>
> --
> William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
> 8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From hahn at physics.mcmaster.ca  Wed Apr  3 07:50:06 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 3 Apr 2002 10:50:06 -0500 (EST)
Subject: Hyperthreading in P4 Xeon (question)
In-Reply-To: <20020403023152.A2972@node0.opengeometry.ca>
Message-ID: <Pine.LNX.4.33.0204031046590.20485-100000@coffee.psychology.mcmaster.ca>

> What is the realistic effect of "hyperthreading" in P4 Xeon?  I'm not
> versed in the latest CPU trends.  Does it mean that dual-P4Xeon will
> behave like 4-way SMP?

for some value of "behave like" ;)
that is, it will definitely NOT get twice as fast.  but it will appear
to have 4 CPUs, and can run 4 threads/procs at once (for values of 
"once" > 1 clock cycle ;)

we did a quick test on a dual-prestonia here, and saw a ~5% speedup
on a probably cache-friendly, compute-bound task.


From jurgen at botz.org  Wed Apr  3 10:25:31 2002
From: jurgen at botz.org (Jurgen Botz)
Date: Wed, 03 Apr 2002 10:25:31 -0800
Subject: Linux Software RAID5 Performance 
In-Reply-To: Message from mprinkey@aeolusresearch.com (Michael Prinkey) 
   of "Sun, 31 Mar 2002 14:33:59 EST." <F224lZz02H67yRLVbND000139d2@hotmail.com> 
Message-ID: <18878.1017858331@localhost>

Michael Prinkey wrote:
> Again, performance (see below) is remarkably good, especially considering 
> all of the strikes against this configuration:  EIDE instead of SCSI, UDMA66 
> instead of 100/133, 5400-RPM instead of 7200-RPM, and master/slave drives on 
> each port instead of a single drive per port. 

With regard to the master/slave config... I note that your performance
test is a single reader/writer... in this config with RAID5 I would
expect the performance to be quite good even with 2 drives per IDE
controller.  But if you have several processes doing disk I/O
simultaneously you should see a rather more precipitous drop in
performance than you would with a single drive per IDE controller.
I'm working on testing a very similar config right now and that's 
one of my findings (which I had expected) but our application for this
is not very performance sensitive so it's not a big deal.

A more important issue for me is reliability, and I'm somewhat 
concerned about failure modes.  For example, can an IDE drive fail
in such a way that if will disable the controller or the other
drive on the same controller?  If so, that would seriously limit
the usefulness of RAID5 in this config.  In general how good is
Linux software RAID's failure handling?  Etc.

:j


-- 
J?rgen Botz                       | While differing widely in the various
jurgen at botz.org                   | little bits we know, in our infinite
                                  | ignorance we are all equal. -Karl Popper


From ron_chen_123 at yahoo.com  Wed Apr  3 11:02:19 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Wed, 3 Apr 2002 11:02:19 -0800 (PST)
Subject: Fwd: FreeBSD port of SGE
Message-ID: <20020403190219.24496.qmail@web14701.mail.yahoo.com>

FreeBSD hackers and Beowulf users,

I am porting SGE (a software for the compute farms, or
the so-called batch systems) to *BSDs, and I am
wondering if someone can take over some of the ports.
I just started porting the code to *BSDs. Currently, I
can get the code compiled on *BSDs with "#ifdef BSD"s.

I am starting the system specific part, mainly to get
the load, cpu, and stuff like that.

I am not done yet, but I just want to tell you that it
is getting there :-)

Someone also started the SGE port to FreeBSD (which
means duplicated work), so you are interested, or if
you want to be the maintainer of the ports (currently,
we have FreeBSD, NetBSD, OpenBSD, Darwin/MacOSX),
please contact me.

More info: gridengine.sunsource.net

Thanks,
 -Ron

--- I wrote:
> Status of the port(s):
> 
> - compiled on FreeBSD, NetBSD, OpenBSD.
> - coding routines to get the load:
>    load: getloadavg(3), kvm_getloadavg(3)
>    #cpu: sysctl(3) hw.ncpu
>    mem : sysctl(3) vm.stats_vm.*
>    proc info: kvm_getprocs(3)
> 
> -Ron
> 
> --- Andy Schwierskott <andy.schwierskott at sun.com>
> wrote:
> > Ron,
> > 
> > > OK, I think I should write a porting-HOWTO.
> > >
> > > Once I am done, can you also include in the
> > "HowTo"
> > > page?
> > 
> > Of course, we (and certainly many developers)
> would
> > be more than happy to
> > add such a page;-)
> > 
> > Andy
> > 

> dev-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> dev-help at gridengine.sunsource.net
> 


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From hahn at physics.mcmaster.ca  Wed Apr  3 11:16:28 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 3 Apr 2002 14:16:28 -0500 (EST)
Subject: Hyperthreading in P4 Xeon (question)
In-Reply-To: <F129pJbCwOyD9YY8bxv0001bf08@hotmail.com>
Message-ID: <Pine.LNX.4.33.0204031401220.22599-100000@coffee.psychology.mcmaster.ca>

> I can amplify that point.  A commercial CFD application ran significantly 
> slower using 4 threads vs 2 on a dual Prestonia system.  Anything memory 
> limited will probably behave the same way.

well, it's an interesting issue.  afaikt, the benefit of HT depends
on what degree your app leaves idle resources.  for instance,
if everything you run is thrashing your dram bandwidth (big arrays,
perhaps), then forget HT - it doesn't add extra dimms!
similarly, if the CPU has just one fsqrt unit, and that's your 
bottleneck, HT doesn't add more units.  there are other resource
nonlinearities, like cache hitrate - the same effect that gives rise
to superlinear SMP speedup will slaughter some apps run on HT...

but if there's other work to be done while one thread is spinning
sqrt's, ie, there are idle resources, then a thread that uses them
will show HT profit...

in some sense, HT works precisely when the system's resources
*don't* match the optimal set your app wants.  I wonder if/when
Intel will start pouring in hordes of extra functional units,
since another 50M transistors will only improve the cache hit rate
a little bit...  of course, it's also true that HT makes bigger
TLB's and more associative caches attractive...


From garcia_garcia_adrian at hotmail.com  Wed Apr  3 11:14:06 2002
From: garcia_garcia_adrian at hotmail.com (Adrian Garcia Garcia)
Date: Wed, 03 Apr 2002 19:14:06 +0000
Subject: DHCP Help Again
Message-ID: <LAW2-F30Tk9v7A0fpsi00001eb3@hotmail.com>

An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020403/4e0c70c2/attachment.html>

From crhea at mayo.edu  Wed Apr  3 13:04:12 2002
From: crhea at mayo.edu (Cris Rhea)
Date: Wed, 3 Apr 2002 15:04:12 -0600 (CST)
Subject: How do you keep clusters running....
Message-ID: <200204032104.PAA23347@sijer.mayo.edu>

What are folks doing about keeping hardware running on large clusters?

Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...

Sure seems like every week or two, I notice dead fans (each RS-1200
has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).

My last fan failure was a CPU fan that toasted the CPU and motherboard.

How are folks with significantly more nodes than mine dealing with constant
maintenance on their nodes?  Do you have whole spare nodes sitting around-
ready to be installed if something fails, or do you have a pile of
spare parts?  Did you get the vendor (if you purchased prebuilt systems)
to supply a stockpile of warranty parts?

One of the problems I'm facing is that every time something croaks, 
Racksaver is very good about replacing it under warranty, but getting
the new parts delivered usually takes several days.

For some things like fans, they sent extras for me to keep on-hand.

For my last fan/CPU/motherboard failure, the node pair will be 
down ~5 days waiting for parts.

Comments? Thoughts? Ideas?

Thanks-

--- Cris


----
  Cristopher J. Rhea                      Mayo Foundation
  Research Computing Facility              Pavilion 2-25
  crhea at Mayo.EDU                        Rochester, MN 55905
  Fax: (507) 266-4486                     (507) 284-0587


From fraser5 at cox.net  Wed Apr  3 13:37:56 2002
From: fraser5 at cox.net (Jim Fraser)
Date: Wed, 3 Apr 2002 16:37:56 -0500
Subject: How do you keep clusters running....
In-Reply-To: <200204032104.PAA23347@sijer.mayo.edu>
Message-ID: <000901c1db57$d222ac90$0300005a@papabear>

Sounds to me like you have a heat problem.  dual ultra thin's generally run
pretty hot.
good luck with it.  There is just no room for any serious air to move thru
that case.  The fan diameter is so small that they require ridiculous rpms
to move the needed volume making them noisy and prone to fail, add to that
the high heat and you accelerate the mtbf to tomorrow.  Most fans fail
quickly in high heat conditions.
I think the basic rack design concept while rugged and strong is
fundamentally flawed and over priced.  I would invest in a serious rack fan
that moves major air out of that case somehow.
good luck with it.

jim

-----Original Message-----
From: beowulf-admin at beowulf.org [mailto:beowulf-admin at beowulf.org]On
Behalf Of Cris Rhea
Sent: Wednesday, April 03, 2002 4:04 PM
To: beowulf at beowulf.org
Subject: How do you keep clusters running....


What are folks doing about keeping hardware running on large clusters?

Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...

Sure seems like every week or two, I notice dead fans (each RS-1200
has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).

My last fan failure was a CPU fan that toasted the CPU and motherboard.

How are folks with significantly more nodes than mine dealing with constant
maintenance on their nodes?  Do you have whole spare nodes sitting around-
ready to be installed if something fails, or do you have a pile of
spare parts?  Did you get the vendor (if you purchased prebuilt systems)
to supply a stockpile of warranty parts?

One of the problems I'm facing is that every time something croaks,
Racksaver is very good about replacing it under warranty, but getting
the new parts delivered usually takes several days.

For some things like fans, they sent extras for me to keep on-hand.

For my last fan/CPU/motherboard failure, the node pair will be
down ~5 days waiting for parts.

Comments? Thoughts? Ideas?

Thanks-

--- Cris


----
  Cristopher J. Rhea                      Mayo Foundation
  Research Computing Facility              Pavilion 2-25
  crhea at Mayo.EDU                        Rochester, MN 55905
  Fax: (507) 266-4486                     (507) 284-0587
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From alvin at Maggie.Linux-Consulting.com  Wed Apr  3 13:44:14 2002
From: alvin at Maggie.Linux-Consulting.com (alvin at Maggie.Linux-Consulting.com)
Date: Wed, 3 Apr 2002 13:44:14 -0800 (PST)
Subject: How do you keep clusters running....
In-Reply-To: <200204032104.PAA23347@sijer.mayo.edu>
Message-ID: <Pine.LNX.3.96.1020403134055.5059A-100000@Maggie.Linux-Consulting.com>

hi ya

buy better quality fans...

we use $15.oo fans ( 40x40x10mm ) stuff used in 1U chassis

( you can get fans as cheap as $4.oo but is a dead $1,000 server
( worth the cost differences of cheap fans ??? 
	 ( not the place to save $$$ )
	- similarly ..get better quality (cooler running) powersupply too


fans should NOT die... at least not more than once a year ...

c ya
alvin
http:/www.linux-1U.net ... 11" deep 1U chassis w/ amd 1700+


On Wed, 3 Apr 2002, Cris Rhea wrote:

> 
> What are folks doing about keeping hardware running on large clusters?
> 
> Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
> 
> Sure seems like every week or two, I notice dead fans (each RS-1200
> has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
> 
> My last fan failure was a CPU fan that toasted the CPU and motherboard.
> 
> How are folks with significantly more nodes than mine dealing with constant
> maintenance on their nodes?  Do you have whole spare nodes sitting around-
> ready to be installed if something fails, or do you have a pile of
> spare parts?  Did you get the vendor (if you purchased prebuilt systems)
> to supply a stockpile of warranty parts?
> 
> One of the problems I'm facing is that every time something croaks, 
> Racksaver is very good about replacing it under warranty, but getting
> the new parts delivered usually takes several days.
> 
> For some things like fans, they sent extras for me to keep on-hand.
> 
> For my last fan/CPU/motherboard failure, the node pair will be 
> down ~5 days waiting for parts.
> 
> Comments? Thoughts? Ideas?
> 


From nordwall at pnl.gov  Wed Apr  3 14:46:31 2002
From: nordwall at pnl.gov (Doug J Nordwall)
Date: Wed, 03 Apr 2002 14:46:31 -0800
Subject: How do you keep clusters running....
In-Reply-To: <200204032104.PAA23347@sijer.mayo.edu>
References: <200204032104.PAA23347@sijer.mayo.edu>
Message-ID: <1017873992.2054.42.camel@duke>

On Wed, 2002-04-03 at 13:04, Cris Rhea wrote:

    What are folks doing about keeping hardware running on large clusters?
    
    Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
    
    Sure seems like every week or two, I notice dead fans (each RS-1200
    has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
    

You running lm_sensors on your nodes? That's a handy tool for paying
attention to things like that. We use ours in combination with ganglia
and pump it to a web page and to big brother to see when a cpu might be
getting hot, or a fan might be too slow. We actually saved a dozen
machines that way...we have 32 4 processor racksaver boxes in a rack,
and they rack was not designed to handle racksaver's fan system. That is
to say, there was a solid sidewall on the rack, and it kept in heat. I
set up lm_sensors on all the nodes (homogenous, so configured on one and
pushed it out to all), then pumped the data into ganglia
(ganglia.sourceforge.net) and then to a web page. I noticed that the
temp on a dozen of the machines was extremely high. So, I took off the
side panel of the rack. The temp dropped by 15 C on all the nodes, and
everything was within normal parameters again.


    My last fan failure was a CPU fan that toasted the CPU and motherboard.


Ya, we would have seen this on ours earlier...excellent tool

    
    How are folks with significantly more nodes than mine dealing with constant
    maintenance on their nodes?  Do you have whole spare nodes sitting around-
    ready to be installed if something fails, or do you have a pile of
    spare parts?


No, we don't actually, but we've talked about it


      Did you get the vendor (if you purchased prebuilt systems)
    to supply a stockpile of warranty parts?


we use racksaver as well, so our experience is similar. Probably should
talk to our people about getting some spare nodes

    
    One of the problems I'm facing is that every time something croaks, 
    Racksaver is very good about replacing it under warranty, but getting
    the new parts delivered usually takes several days.
    

Ya...this is another area where just monitoring the data can be
helpful...if a fan is failing, you can see it coming (temperature slowly
rises) and you can order it before hand and schedule downtime.


    ----
      Cristopher J. Rhea                      Mayo Foundation
      Research Computing Facility              Pavilion 2-25
      crhea at Mayo.EDU                        Rochester, MN 55905
      Fax: (507) 266-4486                     (507) 284-0587
    _______________________________________________
    Beowulf mailing list, Beowulf at beowulf.org
    To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Douglas J Nordwall	http://rex.nmhu.edu/~musashi	
System Administrator	Pacific Northwest National Labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020403/ad990599/attachment.html>

From tim.carlson at pnl.gov  Wed Apr  3 14:59:56 2002
From: tim.carlson at pnl.gov (Tim Carlson)
Date: Wed, 03 Apr 2002 14:59:56 -0800 (PST)
Subject: How do you keep clusters running....
In-Reply-To: <000901c1db57$d222ac90$0300005a@papabear>
Message-ID: <Pine.LNX.4.44.0204031458050.4260-100000@roach.emsl.pnl.gov>

On Wed, 3 Apr 2002, Jim Fraser wrote:

> Sounds to me like you have a heat problem.  dual ultra thin's generally run
> pretty hot.

If you are putting these boxes in a rack and are not using the Racksaver
rack, you need to take the side off of your rack (assuming you can do
that)

We've got 32 of these in a rack (4 CPU's per 1U) and they were running
really hot until week took the side panel off. 5 minutes later the CPU
temps had dropped 10C.

Tim

Tim Carlson
Voice: (509) 376 3423
Email: Tim.Carlson at pnl.gov
EMSL UNIX System Support


From opengeometry at yahoo.ca  Wed Apr  3 12:32:27 2002
From: opengeometry at yahoo.ca (William Park)
Date: Wed, 3 Apr 2002 15:32:27 -0500
Subject: Hyperthreading in P4 Xeon (question)
In-Reply-To: <Pine.LNX.4.33.0204031046590.20485-100000@coffee.psychology.mcmaster.ca>; from hahn@physics.mcmaster.ca on Wed, Apr 03, 2002 at 10:50:06AM -0500
References: <20020403023152.A2972@node0.opengeometry.ca> <Pine.LNX.4.33.0204031046590.20485-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20020403153227.A15201@node0.opengeometry.ca>

On Wed, Apr 03, 2002 at 10:50:06AM -0500, Mark Hahn wrote:
> > What is the realistic effect of "hyperthreading" in P4 Xeon?  I'm not
> > versed in the latest CPU trends.  Does it mean that dual-P4Xeon will
> > behave like 4-way SMP?
> 
> for some value of "behave like" ;)
> that is, it will definitely NOT get twice as fast.  but it will appear
> to have 4 CPUs, and can run 4 threads/procs at once (for values of 
> "once" > 1 clock cycle ;)
> 
> we did a quick test on a dual-prestonia here, and saw a ~5% speedup
> on a probably cache-friendly, compute-bound task.

Hi Mark, Steve, and Michael,

Can you try compiling your kernel, using
    make clean; time make     bzImage modules >& j1
    make clean; time make -j2 bzImage modules >& j2
    make clean; time make -j4 bzImage modules >& j4

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin


From James.P.Lux at jpl.nasa.gov  Wed Apr  3 17:15:58 2002
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 03 Apr 2002 17:15:58 -0800
Subject: How do you keep clusters running....
In-Reply-To: <200204032104.PAA23347@sijer.mayo.edu>
Message-ID: <5.1.0.14.2.20020403170033.0248fec0@mail1.jpl.nasa.gov>

You know, fans shouldn't fail...... There are fans available with 50,000 
hour MTBFs.. sure, they cost a bit more than $5, but, given the cost of the 
time to replace them (especially if you cook something), it might be a good 
investment.

You might cannibalize one of your failed fans to look for the number and 
kind of bearings.  I have heard that some "ball bearing" fans actually have 
sleeve bearings, a sure recipe for short life.  It's not unheard of to have 
some fans that are mislabelled.  Bear in mind that most fans have two 
bearings (one on each end of the shaft) and it is entirely possible to 
build a fan with one sleeve and one ball bearing.

At 03:04 PM 4/3/2002 -0600, Cris Rhea wrote:

>What are folks doing about keeping hardware running on large clusters?
>
>Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
>
>Sure seems like every week or two, I notice dead fans (each RS-1200
>has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).


>Jim Lux

Spacecraft Telecommunications Equipment Section
Jet Propulsion Laboratory
4800 Oak Grove Road, Mail Stop 161-213
Pasadena CA 91109

818/354-2075, fax 818/393-6875


From emiller at techskills.com  Wed Apr  3 18:12:46 2002
From: emiller at techskills.com (Eric Miller)
Date: Wed, 3 Apr 2002 21:12:46 -0500
Subject: Node boot disk to designate eth0
In-Reply-To: <20020403040719.14849.qmail@web11408.mail.yahoo.com>
Message-ID: <NMELJLHHFNGMNFFNGJAEIEMICDAA.emiller@techskills.com>

is there a switch I can pass to the node floppy routine that will cause the
node to boot using a designated ethernet adapter?  I have one onboard 10mb
adapter and a PCI 100 mb adapter (eth0), but the node tries to connect
throught the onboard eth1.  I cannot disable the onboard adapter in BIOS
(compaq :( ), so I need pass a pararmeter at boot time to use eth0.  Can
this be done?


From becker at scyld.com  Wed Apr  3 19:37:23 2002
From: becker at scyld.com (Donald Becker)
Date: Wed, 3 Apr 2002 22:37:23 -0500 (EST)
Subject: Node boot disk to designate eth0
In-Reply-To: <NMELJLHHFNGMNFFNGJAEIEMICDAA.emiller@techskills.com>
Message-ID: <Pine.LNX.4.33.0204032234290.1253-100000@presario>

On Wed, 3 Apr 2002, Eric Miller wrote:

> Subject: Node boot disk to designate eth0
>
> is there a switch I can pass to the node floppy routine that will cause the
> node to boot using a designated ethernet adapter?  I have one onboard 10mb
> adapter and a PCI 100 mb adapter (eth0), but the node tries to connect
> throught the onboard eth1.  I cannot disable the onboard adapter in BIOS
> (compaq :( ), so I need pass a pararmeter at boot time to use eth0.  Can
> this be done?

The Scyld system tries all interfaces (using RARP) to find a master.
That allows the system to work with all network topologies.

To avoid using finding the master on eth1, just don't connect that
interface to a master.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993


From emiller at techskills.com  Wed Apr  3 19:50:21 2002
From: emiller at techskills.com (Eric Miller)
Date: Wed, 3 Apr 2002 22:50:21 -0500
Subject: Fw: Node boot disk to designate eth0
Message-ID: <005e01c1db8b$da3a8590$c31fa6ac@xp>

----- Original Message -----
From: "Eric Miller" <emiller at techskills.com>
To: "Donald Becker" <becker at scyld.com>
Sent: Wednesday, April 03, 2002 10:48 PM
Subject: Re: Node boot disk to designate eth0


>
> ----- Original Message -----
> From: "Donald Becker" <becker at scyld.com>
> To: "Eric Miller" <emiller at techskills.com>
> Cc: <Beowulf at beowulf.org>
> Sent: Wednesday, April 03, 2002 10:37 PM
> Subject: Re: Node boot disk to designate eth0
>
>
> > On Wed, 3 Apr 2002, Eric Miller wrote:
> >
> > > Subject: Node boot disk to designate eth0
> > >
> > > is there a switch I can pass to the node floppy routine that will
cause
> the
> > > node to boot using a designated ethernet adapter?  I have one onboard
> 10mb
> > > adapter and a PCI 100 mb adapter (eth0), but the node tries to connect
> > > throught the onboard eth1.  I cannot disable the onboard adapter in
BIOS
> > > (compaq :( ), so I need pass a pararmeter at boot time to use eth0.
Can
> > > this be done?
> >
> > The Scyld system tries all interfaces (using RARP) to find a master.
> > That allows the system to work with all network topologies.
> >
> > To avoid using finding the master on eth1, just don't connect that
> > interface to a master.
>
> It's odd, I can see that the drivers for both interfaces are being loaded,
> and they are both the correct drivers.  It does not seem to be looking on
> both interfaces, however.  It clearly is looking on only eth1, as it
> specifies it line by line during the RARP requests.  I've tried all the
> obvious, different NIC, different board, etc.
>
> Thanks, I guess Ill leave it alone.
> >
> > --
> > Donald Becker becker at scyld.com
> > Scyld Computing Corporation http://www.scyld.com
> > 410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
> > Annapolis MD 21403 410-990-9993
> >
>


From leandro at ep.petrobras.com.br  Thu Apr  4 05:12:54 2002
From: leandro at ep.petrobras.com.br (Leandro Tavares Carneiro)
Date: 04 Apr 2002 10:12:54 -0300
Subject: How do you keep clusters running....
In-Reply-To: <200204032104.PAA23347@sijer.mayo.edu>
References: <200204032104.PAA23347@sijer.mayo.edu>
Message-ID: <1017925975.30189.87.camel@linux60>

We have here an beowulf cluster with 64 production nodes and 128
processors, and we have some problems like you, about fans.
Here, our cluster hardware is very cheap, using motherboards and cases
founds easily in the local market, and the problems is critical.
We have 5 spare nodes, and only 3 of that are ready to work. All our
production nodes and the 3 spare nodes which are read to start are an
dual PIII 1GHz, the other 2 spare nodes are an dual PIII 800MHz but this
processors are slot 1 (SECC2) and we have one node down because we dont
find coolers for this! The cooler vendors say they not producing anymore
SECC2 coolers, and i am studying how can i adapt others fans in that
coolers... this is sad but true.
We have a lot of problems with memory, hard disks and other parts. A 3
months ago, our cluster nodes was one PIII 500 MHz per node, and after
the upgrade to dual 1GHz we now have lots of memory and spare disks.

I think this kind of problem is inevitable with cheap PC parts, and can
be lower with high-quality (and price) parts. We are making an study to
by a new cluster, for another application and we call Compaq and IBM to
see what they have in hardware and software, with the hope of a future
with less problems...

Regards, and sorry about my poor english, i am brazilian and speak
portuguese... 

Em Qua, 2002-04-03 ?s 18:04, Cris Rhea escreveu:
> 
> What are folks doing about keeping hardware running on large clusters?
> 
> Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
> 
> Sure seems like every week or two, I notice dead fans (each RS-1200
> has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
> 
> My last fan failure was a CPU fan that toasted the CPU and motherboard.
> 
> How are folks with significantly more nodes than mine dealing with constant
> maintenance on their nodes?  Do you have whole spare nodes sitting around-
> ready to be installed if something fails, or do you have a pile of
> spare parts?  Did you get the vendor (if you purchased prebuilt systems)
> to supply a stockpile of warranty parts?
> 
> One of the problems I'm facing is that every time something croaks, 
> Racksaver is very good about replacing it under warranty, but getting
> the new parts delivered usually takes several days.
> 
> For some things like fans, they sent extras for me to keep on-hand.
> 
> For my last fan/CPU/motherboard failure, the node pair will be 
> down ~5 days waiting for parts.
> 
> Comments? Thoughts? Ideas?
> 
> Thanks-
> 
> --- Cris
> 
> 
> 
> ----
>   Cristopher J. Rhea                      Mayo Foundation
>   Research Computing Facility              Pavilion 2-25
>   crhea at Mayo.EDU                        Rochester, MN 55905
>   Fax: (507) 266-4486                     (507) 284-0587
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- 
Leandro Tavares Carneiro
Analista de Suporte
EP-CORP/TIDT/INFI
Telefone: 2534-1427


From rgb at phy.duke.edu  Thu Apr  4 06:52:47 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 4 Apr 2002 09:52:47 -0500 (EST)
Subject: DHCP Help Again
In-Reply-To: <LAW2-F30Tk9v7A0fpsi00001eb3@hotmail.com>
Message-ID: <Pine.LNX.4.44.0204031619270.30309-100000@ganesh.phy.duke.edu>

On Wed, 3 Apr 2002, Adrian Garcia Garcia wrote:

For one thing don't use the range statement -- it tells dhcpd the range
of IP numbers to assign UNKNOWN ethernet numbers.  You are statically
assigning an IP number in your "free" range to a particular host with a
KNOWN ethernet number below.  I don't know what dhcpd would do in that
case -- something sensible one would hope but then, maybe not.  The
range statement is really there so you can dynamically allocate
addresses from the range to hosts you may never have seen before that
you don't care to ever address by name (as they might well get a
different IP number on the next boot).  

DHCP servers run by ISP's not infrequently use the range feature to
conserve IP numbers -- they only need enough to cover the greatest
number of connections they are likely to have at any one time, not one
IP number per host that might ever connect.  Departments might use it to
give IP numbers to laptops brought in by visitors (with the extra
benefit that they can assign a subnet block that isn't "trusted" by the
usual department servers and/or is firewalled from the outside by an
ip-forwarding/masquerading host).

You want "only" static IP's in your cluster, as you'd like nodo1 to be
the same machine and IP address every time.

Be a bit careful about your use of domain names.  As it happens, I don't
find cluster.org registered yet (amazingly enough!) but it is pretty
easy to pick one that does exist in nameservice in the outside world.
In that case you'll run a serious risk of routing or name resolution
problems depending on things like the search order you use in
/etc/nsswitch.conf.  Even my previous example of rgb.private.net is a
bit risky.

You should run a nameserver (cache only is fine) on your 192.168.1.1
server, presuming it lives on an external network and you care to
resolve global names.

Similarly you may want:

 option routers		192.168.1.1;

if you want internal hosts to be able to get out through your (presumed
gateway) server.

Finally, if you want nodo1 to come up knowing its own name without
hardwiring it in on the node itself, add

 option host-name	nodo1;

to its definition.

I admit that I do tend to lay out my dhcpd.conf a bit differently than
you have it below but I don't think that the differences are
particularly significant, and you have a copy of the one I use anyway if
you want to play with the pieces.  You should find a log trace of
dhcpd's activities in /var/log/messages, which should help with any
further debugging.

On your nodo1 host, make sure that:

cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes

and

cat /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=nodo1

and that in /etc/modules.conf there is something like:

cat /etc/modules.conf
alias parport_lowlevel parport_pc
alias eth0 tulip

(or instead of tulip, whatever your network module is).

If you then boot your e.g. RH client it SHOULD just come up,
automatically try to start the network on device eth0 using dhcp as its
protocol for obtaining and IP number, ask the dhcp server for an address
and a route, and just "work" when they come back.

  Hope this helps.

       rgb

> server-name "server.cluster.org"
>  
> subnet 192.168.1.0 netmask 255.255.255.0
> {
>   range 192.168.1.2         192.168.1.10   #my client has the ip
> 192.168.1.2
>                                                                 #and my
> server the static ip 192.168.1.1
>  option subnet-mask                             255.255.255.0;
>  option broadcast-address                    192.168.1.255;
>  option domain-name-server                 192.168.1.1;  
>  option domain-name                            "cluster.org";
>  
>  host  nodo1.cluster.org
>  {
>     hardware ethernet 00:60:97:a1:ef:e0; #here is the address of the
> client's card
>     fixed-address        192.168.1.2;
>  }
> } 
>  
> And finally some files on my server.
>  
> NETWORK
> ------------------------------------------
> networking = yes
> hostname =server.cluster.org
> gatewaydev = eth0
> gatewaye=
> ------------------------------------------
>  
> HOSTS ( In my server and in the client I have the same on this file )
> ------------------------------------------
> 127.0.0.1             localhost
> 192.168.1.1         server.cluster.org
> 192.168.1.2         nodo1.cluster.org
>  
>  
> Ok thats the information, I am a little confuse, could you help me please
> =). I can?t detect the mistake, I dont know if is the server or some card
> =s. Thanks for all.
> 
> ________________________________________________________________________________
> Get your FREE download of MSN Explorer at http://explorer.msn.com.
> _______________________________________________ Beowulf mailing list,
> Beowulf at beowulf.org To change your subscription (digest mode or
> unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From jayne at sphynx.clara.co.uk  Thu Apr  4 10:49:03 2002
From: jayne at sphynx.clara.co.uk (Jayne Heger)
Date: Thu, 4 Apr 2002 18:49:03 +0000
Subject: commercial parallel libraries
Message-ID: <E16tBA7-0003ck-00@scrabble.freeuk.net>

Hi,

I know this is a beowulf list, but I could do with getting some info on any 
(if there are) commercial parallel libraries, the equivalent of pvm and mpi.

Do any of you know the names of any?

Thanks.

Jayne


From gropp at mcs.anl.gov  Thu Apr  4 11:01:59 2002
From: gropp at mcs.anl.gov (William Gropp)
Date: Thu, 04 Apr 2002 13:01:59 -0600
Subject: commercial parallel libraries
In-Reply-To: <E16tBA7-0003ck-00@scrabble.freeuk.net>
Message-ID: <5.1.0.14.2.20020404125850.0197fb88@localhost>

At 06:49 PM 4/4/2002 +0000, Jayne Heger wrote:

>Hi,
>
>I know this is a beowulf list, but I could do with getting some info on any
>(if there are) commercial parallel libraries, the equivalent of pvm and mpi.
>
>Do any of you know the names of any?

MPI is a standard for which there are both freely available and commercial 
implementations.

Bill


From Daniel.Kidger at quadrics.com  Thu Apr  4 11:25:29 2002
From: Daniel.Kidger at quadrics.com (Daniel Kidger)
Date: Thu, 4 Apr 2002 20:25:29 +0100 
Subject: commercial parallel libraries
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA74D2D43@stegosaurus.bristol.quadrics.com>


-----Original Message-----
William Gropp [mailto:gropp at mcs.anl.gov] wrote:
>At 06:49 PM 4/4/2002 +0000, Jayne Heger wrote:
>
>>Hi,
>>
>>I know this is a beowulf list, but I could do with getting some info on
any
>>(if there are) commercial parallel libraries, the equivalent of pvm and
mpi.
>>
>>Do any of you know the names of any?
>
>MPI is a standard for which there are both freely available and commercial 
>implementations.

or do you mean something that is 'the equivalent of mpi and pvm'
   but which isn't pvm or mpi (like perhaps ARMCI)?


Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


From Kim.Branson at csiro.au  Thu Apr  4 07:38:50 2002
From: Kim.Branson at csiro.au (Kim Branson)
Date: 05 Apr 2002 01:38:50 +1000
Subject: node problems
Message-ID: <1017934730.20621.38.camel@paracelsus>

Hi all

i have a 64node athlon cluster, at the moment i have about 19 nodes that
are flaky, they stay up for a bit and then fall over. one can still ping
them but not telnet or ftp. I'm trying to keep as many up as possible
(more nodes means i can get the final calculations done for my phd
thesis faster....)

this may be an unrelated problem but i see errors in the logs about
telnet

node01 telnetd[16941]: ttloop: peer died: EOF 
xinetd[17099]: warning: can't get client address: Connection reset by
peer
Apr  5 00:32:21 node01 rlogind[17099]: Can't get peer name of remote
host: Transport endpoint is not connected
Apr  5 00:32:21 node01 rshd[17098]: getpeername: Transport endpoint is
not connected
Apr  5 00:32:21 node01 ftpd[17097]: getpeername (in.ftpd): Transport
endpoint is not connected
Apr  5 00:32:31 node01 rlogind[17100]: Can't get peer name of remote
host: Transport endpoint is not connected
Apr  5 00:32:31 node01 xinetd[17101]: warning: can't get client address:
Connection reset by peer
Apr  5 00:32:31 node01 xinetd[17102]: warning: can't get client address:
Connection reset by peer
Apr  5 00:32:31 node01 xinetd[17103]: warning: can't get client address:
Connection reset by peer
Apr  5 00:32:31 node01 ftpd[17101]: getpeername (in.ftpd): Transport
endpoint is not connected

i am using enfuzion to do job dispatch and collect. by looking at 
the packets i see the enfuzion director on the head node attempts to
send a UDP packet to the node. all udp ports on the nodes are blocked
i checked this by scanning a node with nmap. older installs of redhat
(i.e my workstation) seem to have udp ports enabled.

regardless of the ttloop error the machine appears to work for a while.
i.e enfuzion logs in jobs run etc, untill sudennly all stops.
the machines remain up, and can be pinged. but no other services (rsh
ssh etc run) If i connect a monitor and keyboard to the node it is also
unresponive.


this is a problem across many nodes.
has anyone who uses enfuzion seen this error with nodes that are a rh7.1
install

On one node i have seen on 2 occasions 

CPU 0: Machine Check Exception: 0000000000000004
Bank 2: d40040000000017a at 540040000000017a

decoding this using a until i found on the net

Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(2): f60020000000017a @ 760020000000017a
        External tag parity error
        Correctable ECC error
        MISC register information valid
        Memory heirarchy error
        Request: Generic error
        Transaction type : Generic
        Memory/IO : I/O


can anyone tell me what the Restart IP invalid means. is this a dead cpu
or a memory problem causing a mce? 

cheers

Kim
-- 
______________________________________________________________________ 

Kim Branson
Phd Student
Structural Biology
CSIRO Health Sciences and Nutrition
Walter and Eliza Hall Institute
Royal Parade, Parkville, Melbourne, Victoria
Ph 61 03 9662 7136
Email kbranson at wehi.edu.au

______________________________________________________________________ 


From juari at provinet.com.br  Thu Apr  4 14:20:47 2002
From: juari at provinet.com.br (JOELMIR RITTER MULLER                   )
Date: Thu,  4 Apr 2002 19:20:47 -0300
Subject: very high bandwidth, low latency manner?
Message-ID: <200204041920.AA26476752@provinet.com.br>

what the best mean of interconnecting several microcomputers
in a very high bandwidth, low latency manner?
does anyone have some ideas about this subject?

Cheers, 
Juari R. M?ller


From James.P.Lux at jpl.nasa.gov  Thu Apr  4 16:05:30 2002
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Thu, 04 Apr 2002 16:05:30 -0800
Subject: very high bandwidth, low latency manner?
In-Reply-To: <200204041920.AA26476752@provinet.com.br>
Message-ID: <5.1.0.14.2.20020404160051.00aff7a0@mail1.jpl.nasa.gov>

What's high bandwidth?
What's low latency?
How much money do you want to spend?

Ethernet is cheap, $100-$200/node for 100 Mbps or GBE (by the time you get 
switches, cables, adapters, etc.)
Latency is kind of slow (compared to dedicated point to point links)


At 07:20 PM 4/4/2002 -0300, JOELMIR RITTER MULLER wrote:

>what the best mean of interconnecting several microcomputers
>in a very high bandwidth, low latency manner?
>does anyone have some ideas about this subject?
>
>Cheers,
>Juari R. M?ller
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

Jim Lux
Spacecraft Telecommunications Equipment Section
Jet Propulsion Laboratory
4800 Oak Grove Road, Mail Stop 161-213
Pasadena CA 91109

818/354-2075, fax 818/393-6875


From aby_sinha at yahoo.com  Thu Apr  4 16:43:01 2002
From: aby_sinha at yahoo.com (Abhishek sinha)
Date: Thu, 04 Apr 2002 16:43:01 -0800
Subject: console redirect issue
References: <3CA244E0.5060602@yahoo.com>
Message-ID: <3CACF315.2080704@yahoo.com>

Hi list

Sometime back i posted this problem on  the list and now that i solved 
it i wanted to share my experience with the list. I had console 
redirection enabled on a Tyan 2505 T bios and everytime i used to boot 
it used to go straight in the BIOS. On the other side i was using 
Hyperterminal(customer requirement). I checked the console redirection 
on an older version of hyperterminal(windows 2000) and found it to be 
working . I mean the system was not going into BIOS everytime. But when 
i used the Hyperterminal version 5 that comes with win2000 professional 
the system it  was going into BIOS every time it booted without touching 
any key. So much for microsoft technology that the newer version doesnt 
work and the older version does . Finally we resorted to using CRT in 
windows to do console redirect and it worked fine. I was trying to 
convince the customer to use minicom since we were selling Linux based 
servers and knew it would work, but to no use. being a tech i was amazed 
at what we can do with linux since we have the code open. U dont realise 
it a lot of time until u get a Application that doesnt run and u cant do 
anythign abt it . Its ridiculous that the older version of Hyperterminal 
works and the newer one shows strange problems...


Abhishek Sinha
California Digital


Abhishek sinha wrote:

> hi list
>
>
> This might be just out of the topic, but i couldnt find help anywhere. 
> I am using serial console redirect on the 2505 t Tyan board. now i am 
> getting strange things that i have never seen before. When i connect 
> the machines with the null modem cable , the machine (where the 
> console redirect is enabled ) goes into the BIOS. If u save and exit 
> again it goes into the BIOS without doing anything. When u disconnect 
> the cable then this does not happen . I tried using a cross over rj45 
> cable. With this i cannot see the POST messages and i can only see the 
> messages when the kernel boots. Is this an issue with the BIOS or some 
> one has been in wonderland and seen this issue .
>
> Please advise
> abhisek
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>


From raysonlogin at yahoo.com  Thu Apr  4 17:43:34 2002
From: raysonlogin at yahoo.com (Rayson Ho)
Date: Thu, 4 Apr 2002 17:43:34 -0800 (PST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <5.1.0.14.2.20020404160051.00aff7a0@mail1.jpl.nasa.gov>
Message-ID: <20020405014334.82902.qmail@web11402.mail.yahoo.com>

You may consider Myrinet, VIA, SCI...

(don't have the money to try each of those, so I can tell you which is
the best ;-( )

http://grappew2k.imag.fr/evalRezo.html (just found this benchmark on
the Net)

Rayson


--- Jim Lux <James.P.Lux at jpl.nasa.gov> wrote:
> What's high bandwidth?
> What's low latency?
> How much money do you want to spend?
> 
> Ethernet is cheap, $100-$200/node for 100 Mbps or GBE (by the time
> you get 
> switches, cables, adapters, etc.)
> Latency is kind of slow (compared to dedicated point to point links)
> 
> 
> 
> 
> 
> 
> At 07:20 PM 4/4/2002 -0300, JOELMIR RITTER MULLER wrote:
> 
> >what the best mean of interconnecting several microcomputers
> >in a very high bandwidth, low latency manner?
> >does anyone have some ideas about this subject?
> >
> >Cheers,
> >Juari R. M?ller
> >
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit 
> >http://www.beowulf.org/mailman/listinfo/beowulf
> 
> Jim Lux
> Spacecraft Telecommunications Equipment Section
> Jet Propulsion Laboratory
> 4800 Oak Grove Road, Mail Stop 161-213
> Pasadena CA 91109
> 
> 818/354-2075, fax 818/393-6875
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From ron_chen_123 at yahoo.com  Thu Apr  4 20:23:51 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Thu, 4 Apr 2002 20:23:51 -0800 (PST)
Subject: FreeBSD port of SGE (Compute farm system)
Message-ID: <20020405042351.86759.qmail@web14706.mail.yahoo.com>

Hi,

I compiled the source, changed a few parameters, and
SGE finally runs on FreeBSD. It is running in single-
user mode, with only 1 host. I am doing a little clean
up, and then I will need to make sure my changes do
not affect others (by "#ifdef BSD").

It still does not get the correct system information
yet, but some of the job accounting info is there (at
least run time is correct  8-) ).

It is now running for several hours, it looks stable.
It ran several tens of jobs. "qstat", "qhost",
"qacct",
"qconf", "qdel" look fine, output makes sense (but
need to implement the resource info collecting
routines).

I will post the patches tomorrow, together with some
output of the commands. (I will be busy today)

Also, I will move the discussion from the hackers list
to the cluster at freebsd list.

-Ron

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From Karl.Bellve at umassmed.edu  Fri Apr  5 07:08:53 2002
From: Karl.Bellve at umassmed.edu (Karl Bellve)
Date: Fri, 05 Apr 2002 10:08:53 -0500
Subject: ISSPL
Message-ID: <3CADBE05.1A608277@umassmed.edu>

Is there an AMD or INTEL optimized version of the ISSPL libraries?

We have an application that I ported from an array processor from CSPI
to a Beowulf system and it uses ISSPL. Right now, the ISSPL library I am
using is just straight C code and doesn't contain any optimization for
the Intel/AMD platform.

Or, is it better to switch to another library, like Intel Kernel Math
Library, or perhaps just use FFTw. It would be simplier if I could just
find a standard ISSPL library for Intel/AMD.

-- 
Cheers,


Karl Bellve, Ph.D.                   ICQ # 13956200
Biomedical Imaging Group             TLCA# 7938 		
University of Massachusetts
Email: Karl.Bellve at umassmed.edu
Phone: (508) 856-6514
Fax:   (508) 856-1840
PGP Public key: finger kdb at molmed.umassmed.edu


From math at velocet.ca  Fri Apr  5 09:59:56 2002
From: math at velocet.ca (Velocet)
Date: Fri, 5 Apr 2002 12:59:56 -0500
Subject: How do you keep clusters running....
In-Reply-To: <1017925975.30189.87.camel@linux60>; from leandro@ep.petrobras.com.br on Thu, Apr 04, 2002 at 10:12:54AM -0300
References: <200204032104.PAA23347@sijer.mayo.edu> <1017925975.30189.87.camel@linux60>
Message-ID: <20020405125956.D69845@velocet.ca>

On Thu, Apr 04, 2002 at 10:12:54AM -0300, Leandro Tavares Carneiro's all...
> We have here an beowulf cluster with 64 production nodes and 128
> processors, and we have some problems like you, about fans.
> Here, our cluster hardware is very cheap, using motherboards and cases
> founds easily in the local market, and the problems is critical.
> We have 5 spare nodes, and only 3 of that are ready to work. All our

[..]

> I think this kind of problem is inevitable with cheap PC parts, and can
> be lower with high-quality (and price) parts. We are making an study to
> by a new cluster, for another application and we call Compaq and IBM to
> see what they have in hardware and software, with the hope of a future
> with less problems...

You can always employ the 'maximum tolerable failure rate' concept and buy for
that rate. I find in terms of pricing equipment, there is a definite non
linear (exponential?) relationship between MTBF and price. For a failure rate
thats 3-5 times higher you can spend up to 40% less (or better) on equipment.
This isnt a solid number, but feels within the ballpark to me based on what
I've priced out before on clusters. Others may dispute this, but I am talking
about buying Dell 2U rackmount servers pre-assembled vs a bunch of boards and
CPUs and ram you slap together yourself.

Using this concept, and setting your maximum tolerable failure rate at a
specific level that suits your needs, for eg 1 node per month, coupled with an
agreesive RMA schedule with a good vendor, you can get the best price
performance out of a cluster.  If you can withstand, using my example, 3-5
times higher failure rate which ends up being 1 node per month, you end up
with 40% more gear.

If you require 100% of all nodes present to be in one mesh involved in
parallel calculations and a single node failure is catastrophic to the entire
job running since startup, then its obviously not worth it if your jobs have a
similar runtime as the failure rate (1 month). A failure rate of 1 node/5
months would work far better in that case, as the average failure would lose
you only 10% of the work you do in 5 months, whereas with 40% more equipment
and 5x the failure rate you may lose most of your work.  (Note I am not
considering that your jobs may run in [1 month / 1.4] instead due to the
speedup from more gear - which will cause jobs to run in ~70% of the time (~3
weeks) - and therefore have a higher success rate in finishing in the
1 node/mo MTBF environment.)

However, if your jobs run on all nodes for only a day, then a failure of a
single node once per month nets you a loss of a half day per month lost work
average. For this concession you get 40% more equipment (possibly meaning 40%
more processing power, depending on your application).

You also need to factor in how much personal time you have to deal with RMAing
and swapping equipment. This may well make any efforts towards this kind of
model impossible if extra time is not available. That notwithstanding, the
cost of extra time can be easily factored into the equation (and knowledgeable
work-study undergrads can be a REALLY cheap alternative here :)

Of course with 40% more power, you may configure two sub-clusters of 70% power
of the original HA design (HA = high availability ~ higher price).  If this
fits your needs, a failure of a single node once per month on average jobs of
a day in length will net you the equivalent loss of a quarter-day total
possible work. The more you isolate sections of the cluster from eachother,
the less you will lose when a failure occurs. If you can manually segment your
jobs to run one per node and still achieve near 100% (or more?)  of possible
capacity vs a more parallelized system, then a single node failure is
inconsequential.

Considering the amount and types of failures discussed here, there are
obviously no guarantee that a certain type of cluster setup will save you from
having massive problems. Being able to plan for downtime and manage the costs
associated with it is also obviously part of the design and operation of the
cluster. Its a seesaw-type of balance - if you want more nodes for less money,
be prepared to spend more time fixing them. Of course with any cluster, more
nodes of any type will logically translate into more down/service time - so
there will probably be a non-linear translation of amount of work when
comparing fewer HA nodes vs more cheaper nodes. Of course by this logic,
buying fewer bigger nodes would also result in less work. At some point
this becomes too expensive because you're buying big Suns that are
very expensive per GFLOPS (unless of course, it suits your needs best...).

Another problem with this whole situation that makes it even more complex is
that many cluster installations are subject to strange pricing/operation cost
models. Various parts may actually lie outside your budget responsability:

One time costs:
- design costs (on paper)
- equipment purchase
- equipment cosntruction/installation
- equipment configuration
- softwre installation & configuration

Long term/ongoing:
- software maintenance/reconfiguration
- upkeep/repair
- equipment upgrades
- power costs
- cooling costs

There are probably sub categories these could be split into as well.

The issue here is that, say in a university, power and cooling may be paid for
by the university as well as manual labour for upkeep and repair.  If that is
the case, then getting very power-inefficient but fast CPUs may work well (AMD
thunderbirds, for eg :). If you have to pay for your own power and cooling and
manual labour, then you may well just opt for spending more on cheaper gear
(Athlon XPs) - and at that point may as well go for HA gear as well (depending
on the cost model) to save expensive manual labour (at commercial rates
>$50/hr you can quickly rack up a node's cost in a day of work).

We have successfully employed the non-HA equipment deisgn in building one of
our clusters - and in fact there are added advantages. We have observed that
most (for various values of 'most' - 50% to 80%?) failures occur within the
first month of usage. Once you start swapping out bad nodes, you have a
falling rate of failure (though the age of components slowly catches up over a
long time period - things with moving parts, such as fans, especially). With
all problems taken together (swapping over NFS included, as these are diskless
nodes) we have about 1 node crash/fail in some way every 2 months. Of course,
since jobs can be checkpointed, and a single node failing doesnt take down the
whole cluster (as jobs are run on subsets of nodes) not much work is lost
overall. For the increased throughput from more nodes for the money, and
including about 15 minutes of work per month physically messing with the
machines thats directly related to hardware problems and crashes (ie unrelated
to the time spent maintaining the cluster as per normal operations), its been
an overall win on that particular cluster. (We have not had to RMA any
equipment since the start of the 2nd month of operation - under our current
service agreement, RMA would take 1-3 days, and about 20-30 min of labour, and
in the meantime not significantly impact the cluster's performance).

As always, designing your cluster customized for your needs and limitations is
always the biggest win on price/performance. Limitations to this are having
very wide ranges of needs and not having any idea of what capabilities will be
required in the future, along with expensive losses when there's downtime, and
expensive manual labour to get things working again. Barring these kinds of
considerations, commodity equipment with a failure rate that you can deal with
can net noticeable gains - having a planned failure cost related to that rate
will save you from suprises.

No matter what kind of cluster you build you WILL have failures, and designing
to be able to mitigate the impact from such to the highest possible extent is
obviously good planning.

/kc


> Em Qua, 2002-04-03 ?s 18:04, Cris Rhea escreveu:
> > 
> > What are folks doing about keeping hardware running on large clusters?
> > 
> > Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
> > 
> > Sure seems like every week or two, I notice dead fans (each RS-1200
> > has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
> > 
> > My last fan failure was a CPU fan that toasted the CPU and motherboard.
> > 
> > How are folks with significantly more nodes than mine dealing with constant
> > maintenance on their nodes?  Do you have whole spare nodes sitting around-
> > ready to be installed if something fails, or do you have a pile of
> > spare parts?  Did you get the vendor (if you purchased prebuilt systems)
> > to supply a stockpile of warranty parts?
> > 
> > One of the problems I'm facing is that every time something croaks, 
> > Racksaver is very good about replacing it under warranty, but getting
> > the new parts delivered usually takes several days.
> > 
> > For some things like fans, they sent extras for me to keep on-hand.
> > 
> > For my last fan/CPU/motherboard failure, the node pair will be 
> > down ~5 days waiting for parts.
> > 
> > Comments? Thoughts? Ideas?
> > 
> > Thanks-
> > 
> > --- Cris
> > 
> > 
> > 
> > ----
> >   Cristopher J. Rhea                      Mayo Foundation
> >   Research Computing Facility              Pavilion 2-25
> >   crhea at Mayo.EDU                        Rochester, MN 55905
> >   Fax: (507) 266-4486                     (507) 284-0587
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> -- 
> Leandro Tavares Carneiro
> Analista de Suporte
> EP-CORP/TIDT/INFI
> Telefone: 2534-1427
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From cblack at EraGen.com  Tue Apr  2 12:09:34 2002
From: cblack at EraGen.com (Chris Black)
Date: Tue, 02 Apr 2002 14:09:34 -0600
Subject: Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)
In-Reply-To: <"from roger"@ERC.MsState.Edu>
References: <200204021824.g32IOMa14409@mycroft.ahpcrc.org>
 <Pine.SGI.4.44.0204021236380.75196-100000@Downforce.ERC.MsState.Edu>
Message-ID: <20020402140934.A29446@getafix.EraGen.com>

On Tue, Apr 02, 2002 at 12:46:07PM -0600, Roger L. Smith wrote:
> On Tue, 2 Apr 2002, Richard Walsh wrote:
[stuff deleted]
> PBS is our leading cause of cycle loss.  We now run a cron job on the
> headnode that checks every 15 minutes to see if the PBS daemons have died,
> and if so, it automatically restarts them.  About 75% of the time that I
> have a node fail to accept jobs, it is because its pbs_mom has died, not
> because there is anything wrong with the node.
> 

We used to have the same problem with PBS, especially when many jobs were 
in the queue. At that point sometimes the pbs master died as well.
Since we've switched to SGE/GridEngine/CODINE I've been MUCH happier.
Plus there are lots of nifty things you can do with the expandibility of 
writing your own load monitors via shell scripts and such.
The whole point of this post is:
GNQS < PBS < Sun Gridengine :)

Chris (who tried two other batch schedulers until settling on SGE)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 232 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020402/1433e290/attachment.sig>

From tekka99 at libero.it  Tue Apr  2 13:29:55 2002
From: tekka99 at libero.it (Gianluca Cecchi)
Date: Tue, 2 Apr 2002 23:29:55 +0200
Subject: Linux Software RAID5 Performance
References: <F224lZz02H67yRLVbND000139d2@hotmail.com>
Message-ID: <006b01c1da8d$89365dd0$44e01d97@emea.cpqcorp.net>

Which option did you use for ext3 journal mechanism? It makes difference
expecially
when using "writeback" vs the default "ordered" (see below part of the
"Changes" file for ext3)
Which I/O benchmark did you use?
Thanks,
Gianluca Cecchi

New mount options:

    "mount -o journal=update"
        Mounts a filesystem with a Version 1 journal, upgrading the
        journal dynamically to Version 2.

    "mount -o data=journal"
        Journals all data and metadata, so data is written twice. This
        is the mode which all prior versions of ext3 used.

    "mount -o data=ordered"
        Only journals metadata changes, but data updates are flushed to
        disk before any transactions commit. Data writes are not atomic
        but this mode still guarantees that after a crash, files will
        never contain stale data blocks from old files.

    "mount -o data=writeback"
        Only journals metadata changes, and data updates are entirely
        left to the normal "sync" process. After a crash, files will
        may contain stale data blocks from old files: this mode is
        exactly equivalent to running ext2 with a very fast fsck on reboot.

Ordered and Writeback data modes require a Version 2 journal: if you do
not update the journal format then only the Journaled data will be
allowed.

The default data mode is Journaled for a V1 journal, and Ordered for V2.


----- Original Message -----
From: "Michael Prinkey" <mikeprinkey at hotmail.com>
To: <beowulf at beowulf.org>
Sent: Sunday, March 31, 2002 9:33 PM
Subject: Linux Software RAID5 Performance


> Some time ago, a thread discussed the relative performance and stability
> merits of different RAID solutions.  At that time, I gave some results for
> 640-GB arrays that I had build using EIDE drives and Software RAID5.  I
just
> recently constructed and installed a 1.0-TB array and had some performance
> numbers to share for it as well.  They are interesting for two reasons:
> First, the filesystem in use is ext3, rather than ext2.  Second, the read
> performance is significantly better (almost 2x) than that of the 640-GB
> units.
>
> The system uses 11 120-GB Maxtor 5400-RPM drives, two Promise Ultra66
> controllers, a P4 1.6-GHz CPU, an Intel 850 motherboard, and 512 MB ECC
> RDRAM.  Drives are configured in RAID5 (9 data, 1 parity, 1 hot spare).
> Four drives are on each Promise controller.  Three are on the on-board
EIDE
> controller (UDMA100).  A small boot drive is also on the on-board
> controller.  I had intended to use Ultra100 TX2 controllers, but the
latest
> EIDE driver updates with TX2 support are not making it into the latest
> kernels (I'm using 2.4.18), so I opted for the older, slower controllers
> rather than patching.  So, I am both cautious and lazy.  8)
>
> Again, performance (see below) is remarkably good, especially considering
> all of the strikes against this configuration:  EIDE instead of SCSI,
UDMA66
> instead of 100/133, 5400-RPM instead of 7200-RPM, and master/slave drives
on
> each port instead of a single drive per port.  With some hdparm tuning (-c
3
> -u 1), the read performance went from 83 MB/sec to 93 MB/sec.  Write
> performance remained essentially unchanged by tuning at 26 MB/sec.  For
> comparison, the 640-GB arrays gave read performance of about 56 MB/sec,
> write performance of 28.5 MB/sec.
>
> Had I more time, I would have tested ext2 vs ext3 to ascertain how much
that
> change effected performance.  Likewise, I was considering the use of a
raid1
> array as the ext3 journal device to perhaps improve write performance.
Any
> thoughts?
>
> Regards,
>
> Mike Prinkey
> Aeolus Research, Inc.
>
> ----------------------
>
> [root at tera /root]# df; mount; cat /proc/mdstat; cat bonnie10.log
> Filesystem           1k-blocks      Used Available Use% Mounted on
> /dev/hda6             38764268   2601128  34193976   8% /
> /dev/hda1               101089      4965     90905   6% /boot
> /dev/md0             1063591944  58195936 1005396008   6% /raid
> raid640:/raid/home   630296592 284066148 346230444  46% /mnt/tmp
> /dev/hda6 on / type ext2 (rw)
> none on /proc type proc (rw)
> /dev/hda1 on /boot type ext2 (rw)
> none on /dev/pts type devpts (rw,gid=5,mode=620)
> /dev/md0 on /raid type ext3 (rw)
> automount(pid580) on /misc type autofs
> (rw,fd=5,pgrp=580,minproto=2,maxproto=3)
> raid640:/raid/home on /mnt/tmp type nfs (rw,addr=192.168.0.123)
> Personalities : [raid5]
> read_ahead 1024 sectors
> md0 : active raid5 hdl1[10] hdk1[9] hdj1[8] hdi1[7] hdh1[6] hdg1[5]
hdf1[4]
> hde1[3] hdd1[2] hdc1[1] hdb1[0]
>       1080546624 blocks level 5, 32k chunk, algorithm 2 [10/10]
[UUUUUUUUUU]
>
> unused devices: <none>
> Bonnie 1.2: File '/raid/Bonnie.1027', size: 1048576000, volumes: 10
> Writing with putc()...         done:  14810 kB/s  88.9 %CPU
> Rewriting...                   done:  22288 kB/s  13.4 %CPU
> Writing intelligently...       done:  26438 kB/s  21.7 %CPU
> Reading with getc()...         done:  17112 kB/s  97.9 %CPU
> Reading intelligently...       done:  93332 kB/s  32.2 %CPU
> Seek numbers calculated on first volume only
> Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
>               ---Sequential Output (nosync)--- ---Sequential Input-- --Rnd
> Seek-
>               -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k
> (03)-
> Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
/sec
> %CPU
> raid05 10*1000 14810 88.9 26438 21.7 22288 13.4 17112 97.9 93332 32.2
206.3
>   2.1
>
>
> _________________________________________________________________
> Get your FREE download of MSN Explorer at
http://explorer.msn.com/intl.asp.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From tekka99 at libero.it  Tue Apr  2 13:44:28 2002
From: tekka99 at libero.it (Gianluca Cecchi)
Date: Tue, 2 Apr 2002 23:44:28 +0200
Subject: Syntax for executing 
References: <NMELJLHHFNGMNFFNGJAEIEKGCDAA.emiller@techskills.com>
Message-ID: <00b201c1da8f$91945340$44e01d97@emea.cpqcorp.net>

if using also pvm is not a problem you could use the pvm enabled version of
povray
3d rendring engine:

http://www.povray.org/

http://pvmpov.sourceforge.net/

Or, there are also MPI patches to povray (but I never used them):

http://www.ce.unipr.it/pardis/parma2/povray/povray.html

http://www.verrall.demon.co.uk/mpipov/


HIH.

Bye,
Gianluca Cecchi
----- Original Message -----
From: "Eric Miller" <emiller at techskills.com>
To: <beowulf at beowulf.org>
Sent: Tuesday, April 02, 2002 11:34 PM
Subject: RE: Syntax for executing


> disregard.  SETI is not available in an MPI-enabled format.
>
> My apologies.  Can anyone direct me to an URL that lists some available
> programs that I can execute on the cluster?  Preferably something with a
> continuous (looping?) graphical output (e.g. SETI). This is a display for
> students to visualize and promote educational programs for Linux, like a
> museum peice.
>
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
>
> Hey all, got a five-node cluster up running 27-z9, preparing for a 30 node
> cluster.
>
> - What is the syntax to run an executable in the cluster environment?  For
> example, I run
>
> NP=5 mpi-mandel
>
> to run the test fractal program.  How would I execute say, SETI, using the
> cluster?  Assume that the SETI executable is in the PATH.  Also, the older
> version of Scyld had some test code in /usr/mpi-beowulf/*.  Is that gone?
>
> -  What would cause all but one of the processors to show usage in
> beostatus?  The node shows "up" in every other way: hardware identical,
> memory, swap, network, etc....just when I run something, only that one
> processor on one node shows no % usage.
>
> -ETM
>
>   .~.
>   /V\
>  // \\
> /(   )\
>  ^'~'^
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From roger at ERC.MsState.Edu  Wed Apr  3 13:23:44 2002
From: roger at ERC.MsState.Edu (Roger L. Smith)
Date: Wed, 3 Apr 2002 15:23:44 -0600
Subject: How do you keep clusters running....
In-Reply-To: <200204032104.PAA23347@sijer.mayo.edu>
Message-ID: <Pine.SGI.4.44.0204031520190.75196-100000@Downforce.ERC.MsState.Edu>

I don't know how to say this without sounding condescending, but we
resolved this problem by purchasing high quality machines.  We currently
use IBM x330s (although I also had good luck with our SGI 1100's before
SGI discontinued them).  We have enough nodes on hand, that IBM has
stocked a couple of spare motherboards, power supplies, etc., but we don't
need them that often.  I've never had a fan failure.

In general, hardware problems are a very minor part of the care and
feeding of our cluster.


On Wed, 3 Apr 2002, Cris Rhea wrote:

>
> What are folks doing about keeping hardware running on large clusters?
>
> Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
>
> Sure seems like every week or two, I notice dead fans (each RS-1200
> has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
>
> My last fan failure was a CPU fan that toasted the CPU and motherboard.
>
> How are folks with significantly more nodes than mine dealing with constant
> maintenance on their nodes?  Do you have whole spare nodes sitting around-
> ready to be installed if something fails, or do you have a pile of
> spare parts?  Did you get the vendor (if you purchased prebuilt systems)
> to supply a stockpile of warranty parts?
>
> One of the problems I'm facing is that every time something croaks,
> Racksaver is very good about replacing it under warranty, but getting
> the new parts delivered usually takes several days.
>
> For some things like fans, they sent extras for me to keep on-hand.
>
> For my last fan/CPU/motherboard failure, the node pair will be
> down ~5 days waiting for parts.
>
> Comments? Thoughts? Ideas?
>
> Thanks-
>
> --- Cris
>
>
>
> ----
>   Cristopher J. Rhea                      Mayo Foundation
>   Research Computing Facility              Pavilion 2-25
>   crhea at Mayo.EDU                        Rochester, MN 55905
>   Fax: (507) 266-4486                     (507) 284-0587
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


 _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_
| Roger L. Smith                        Phone: 662-325-3625               |
| Systems Administrator                 FAX:   662-325-7692               |
| roger at ERC.MsState.Edu                 http://WWW.ERC.MsState.Edu/~roger |
|                       Mississippi State University                      |
|_______________________Engineering Research Center_______________________|


From SGaudet at turbotekcomputer.com  Wed Apr  3 13:26:28 2002
From: SGaudet at turbotekcomputer.com (Steve Gaudet)
Date: Wed, 3 Apr 2002 16:26:28 -0500 
Subject: How do you keep clusters running....
Message-ID: <3450CC8673CFD411A24700105A618BD61BF020@911TURBO>

Hello Chris,

> What are folks doing about keeping hardware running on large clusters?
> 
> Right now, I'm running 10 Racksaver RS-1200's (for a total of 
> 20 nodes)...
> 
> Sure seems like every week or two, I notice dead fans (each RS-1200
> has 6 case fans in addition to the 2 CPU fans and 2 power 
> supply fans).
> 
> My last fan failure was a CPU fan that toasted the CPU and 
> motherboard.
> 
> How are folks with significantly more nodes than mine dealing 
> with constant
> maintenance on their nodes?  Do you have whole spare nodes 
> sitting around-
> ready to be installed if something fails, or do you have a pile of
> spare parts?  Did you get the vendor (if you purchased 
> prebuilt systems)
> to supply a stockpile of warranty parts?
> 
> One of the problems I'm facing is that every time something croaks, 
> Racksaver is very good about replacing it under warranty, but getting
> the new parts delivered usually takes several days.
> 
> For some things like fans, they sent extras for me to keep on-hand.
> 
> For my last fan/CPU/motherboard failure, the node pair will be 
> down ~5 days waiting for parts.
> 
> Comments? Thoughts? Ideas?

------------------------------------------
The vendor of choise should be using quality parts.  We don't see these
issues here.  

Steve Gaudet 
Linux Solutions Engineer
   ..... 
  <(???)> 
 
===================================================================
| Turbotek Computer Corp.    tel:603-666-3062 ext. 21             |
| 8025 South Willow St.      fax:603-666-4519                     |
| Building 2, Unit 105       toll free:800-573-5393               |
| Manchester, NH 03103       e-mail:sgaudet at turbotekcomputer.com  |
|                            web: http://www.turbotekcomputer.com |
===================================================================

  
From haohe at me1.eng.wayne.edu  Wed Apr  3 14:48:14 2002
From: haohe at me1.eng.wayne.edu (Hao He)
Date: Wed, 3 Apr 2002 17:48:14 -0500
Subject: GbE Channel Bonding
Message-ID: <200204032258.RAA15974@me1.eng.wayne.edu>

Any one who has experience in bonding Gigabit Ethernet cards?
How about the performance?
Thanks.

-HH


From mikeprinkey at hotmail.com  Wed Apr  3 10:10:10 2002
From: mikeprinkey at hotmail.com (Michael Prinkey)
Date: Wed, 03 Apr 2002 13:10:10 -0500
Subject: Hyperthreading in P4 Xeon (question)
Message-ID: <F129pJbCwOyD9YY8bxv0001bf08@hotmail.com>

I can amplify that point.  A commercial CFD application ran significantly 
slower using 4 threads vs 2 on a dual Prestonia system.  Anything memory 
limited will probably behave the same way.

Mike Prinkey
Aeolus Research, Inc.


>From: Mark Hahn <hahn at physics.mcmaster.ca>
>To: William Park <opengeometry at yahoo.ca>
>CC: <beowulf at beowulf.org>
>Subject: Re: Hyperthreading in P4 Xeon (question)
>Date: Wed, 3 Apr 2002 10:50:06 -0500 (EST)
>
> > What is the realistic effect of "hyperthreading" in P4 Xeon?  I'm not
> > versed in the latest CPU trends.  Does it mean that dual-P4Xeon will
> > behave like 4-way SMP?
>
>for some value of "behave like" ;)
>that is, it will definitely NOT get twice as fast.  but it will appear
>to have 4 CPUs, and can run 4 threads/procs at once (for values of
>"once" > 1 clock cycle ;)
>
>we did a quick test on a dual-prestonia here, and saw a ~5% speedup
>on a probably cache-friendly, compute-bound task.
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf
>


_________________________________________________________________
Chat with friends online, try MSN Messenger: http://messenger.msn.com


From alan at infogroup.it  Wed Apr  3 15:13:39 2002
From: alan at infogroup.it (amedeo pimpini)
Date: Thu, 04 Apr 2002 01:13:39 +0200
Subject: my first diskless beowulf cluster.
Message-ID: <3CAB8CA3.9030809@infogroup.it>

I've  encountered a difficlult to launch init  after mount root on nfs.

Can somebody help me ?

Follows details:

I have compiled a 2.4.7-10 kernel whith autoconfiguration ip, with root 
on nfs and placed on /tftpboot
the kernel mount /tftbboot but dont start init.


On console of first ws:

IP-Config: Got DHCPanswer from 10.1.1.1 my address is 10.0.0.2
...
VFS: Mounted root (nfs filesystem).
Freeing unused kernel memory: 180k freed

Kernel panik: No init found. Try passing init= option to kernel

i have recompiled main.c with printk( %d ), errno
end i obtined 8
with perror on have
Error code   8:  Exec format error


If i mv /sbin/init /sbin/init.old

then i obtine error 14.


ON the server /var/log/messages i have:


Apr  3 00:17:14 nut1 dhcpd: Both dynamic and static leases present for 
10.1.1.2.
Apr  3 00:17:14 nut1 dhcpd: Either remove host declaration nut2 or 
remove 10.1.1.2
Apr  3 00:17:14 nut1 dhcpd: from the dynamic address pool for 10.1.0.0
Apr  3 00:17:14 nut1 dhcpd: DHCPREQUEST for 10.1.1.2 from 
00:e0:4c:20:6b:8f via eth0
Apr  3 00:17:14 nut1 dhcpd: DHCPACK on 10.1.1.2 to 00:e0:4c:20:6b:8f via 
eth0
Apr  3 00:17:14 nut1 mountd[1520]: 
mountproc_translate_mnt_1_svc(/tftpboot/10.1.1.2)
Apr  3 00:17:14 nut1 mountd[1520]: NFS mount of /tftpboot/10.1.1.2 
attempted from 10.1.1.2
Apr  3 00:17:14 nut1 mountd[1520]: /tftpboot/10.1.1.2 has been mounted 
by 10.1.1.2


and tcpdump:

00:21:06.149970 arp who-has nut1 tell nut2
00:21:06.149970 arp reply nut1 is-at 0:e0:4c:f0:6d:fb
00:21:06.149970 nut2.800 > nut1.sunrpc:  udp 56 (DF)
00:21:06.149970 nut1.sunrpc > nut2.800:  udp 28 (DF)
00:21:06.149970 nut2.800 > nut1.sunrpc:  udp 56 (DF)
00:21:06.149970 nut1.sunrpc > nut2.800:  udp 28 (DF)
00:21:06.149970 nut2.800 > nut1.849:  udp 64 (DF)
00:21:06.149970 nut1.849 > nut2.800:  udp 60 (DF)
00:21:06.149970 nut2.56685225 > nut1.nfs: 100 getattr [|nfs] (DF)
00:21:06.149970 nut1.nfs > nut2.56685225: reply ok 96 getattr DIR 47777 
ids 0/0 sz 4096  (DF)
00:21:06.149970 nut2.73462441 > nut1.nfs: 100 fsstat [|nfs] (DF)
00:21:06.159970 nut1.nfs > nut2.73462441: reply ok 48 fsstat [|nfs] (DF)
00:21:06.159970 nut2.90239657 > nut1.nfs: 108 lookup [|nfs] (DF)
00:21:06.159970 nut1.nfs > nut2.90239657: reply ok 128 lookup [|nfs] (DF)
00:21:06.159970 nut2.107016873 > nut1.nfs: 112 lookup [|nfs] (DF)
00:21:06.159970 nut1.nfs > nut2.107016873: reply ok 128 lookup [|nfs] (DF)
00:21:06.159970 nut2.123794089 > nut1.nfs: 108 lookup [|nfs] (DF)
00:21:06.159970 nut1.nfs > nut2.123794089: reply ok 128 lookup [|nfs] (DF)
00:21:06.159970 nut2.140571305 > nut1.nfs: 108 lookup [|nfs] (DF)
00:21:06.159970 nut1.nfs > nut2.140571305: reply ok 128 lookup [|nfs] (DF)
00:21:06.159970 nut2.157348521 > nut1.nfs: 112 read [|nfs] (DF)
00:21:06.159970 nut1 > nut2: (frag 23518:1244 at 2960)
00:21:06.159970 nut1 > nut2: (frag 23518:1480 at 1480+)
00:21:06.159970 nut1.nfs > nut2.157348521: reply ok 1472 read (frag 
23518:1480 at 0+)
00:21:11.149970 arp who-has nut2 tell nut1
00:21:11.149970 arp reply nut2 is-at 0:e0:4c:20:6b:8f


i've tagged my kernel with


mknbi-linux --output=/tftpboot/vmlinux.3com 
/usr/src/linux-2.4.7-10/arch/i386/boot/bzImage  
--ip=":10.1.1.1:10.1.1.1:255.255.0.0:"


From rgb at phy.duke.edu  Wed Apr  3 15:27:31 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 3 Apr 2002 18:27:31 -0500 (EST)
Subject: How do you keep clusters running....
In-Reply-To: <200204032104.PAA23347@sijer.mayo.edu>
Message-ID: <Pine.LNX.4.44.0204031816470.32578-100000@lucifer.rgb.private.net>

On Wed, 3 Apr 2002, Cris Rhea wrote:

> Comments? Thoughts? Ideas?

 a) Use onboard sensors (hoping your motherboards have them) to shut
nodes down if the CPU temp exceeds an alarm threshold.  That way future
fan failures shouldn't cause system failure, just node shutdown.

 b) Use the largest cases you can manage given your space requirements.
Larger cases have a bit more thermal ballast and can tolerate poor
cooling for a bit longer before catastrophically failing.  Gives you (or
your monitor software) more time to react if nothing else.

 c) With only ten boxes, it sounds like you're having plain old bad
luck, possibly caused by a bad batch of fans.  Relax, perhaps your luck
will improve;-)

With all that said, it is still true that maintenance problems scale
poorly with number of nodes.  One reason (of many) that I prefer not to
get nodes from vendors in another state that I never meet face to face.
If your nodes are built by a local vendor (especially one with a decent
local parts inventory and service department) then it is a bit easier to
get good turnaround on node repairs and minimize downtime, especially
since a local business rapidly learns that to make you happy is more
important to their bottom line than making the next twenty or thirty
customers that might walk through their door happy.

There is also the usual tradeoff between buying "insurance" (e.g.
onsite, 24 hour service contracts) on everything and number of nodes.
There are plenty of companies that will sell you nodes and guarantee
minimal downtime -- for a price.  IBM and Dell come to mind, although
there are many more.  Only you can determine how mission critical it is
to keep your nodes up and what the cost benefit tradeoffs are between
buying fewer nodes (but getting better quality nodes and arranging
guarantees of minimal downtime) or buying more nodes (but risking having
a node or two down pending repairs from time to time).

Cost-benefit analysis is at the heart of beowulf engineering, but you
have to determine the "values" that enter into the analysis based on
your local needs.

   rgb

> 
> Thanks-
> 
> --- Cris
> 
> 
> 
> ----
>   Cristopher J. Rhea                      Mayo Foundation
>   Research Computing Facility              Pavilion 2-25
>   crhea at Mayo.EDU                        Rochester, MN 55905
>   Fax: (507) 266-4486                     (507) 284-0587
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From tim at dolphinics.com  Tue Apr  2 10:05:48 2002
From: tim at dolphinics.com (Tim Wilcox)
Date: Tue, 02 Apr 2002 11:05:48 -0700
Subject: Call for Papers
Message-ID: <3CA9F2FC.7F39E715@dolphinics.com>

                           CALL FOR PAPERS
             Workshop on High-Speed Local Networks (HSLN)
                  as part of the IEEE LCN conference
                     http://www.hcs.ufl.edu/hsln
                       http://www.ieeelcn.org

                         November 6 - 8, 2002
                  Embassy Suites USF, Tampa, Florida

Important dates and contact:
----------------------------
  Paper submission: June 10, 2002
  Notification of acceptance: July 15, 2002
  Camera-ready copy due: August 16, 2002
  General Chair: Alan D. George (george at hcs.ufl.edu)

General Information:
--------------------
The High-Speed Local Networks (HSLN) workshop, within the 27th IEEE
Conference on Local Computer Networks (LCN), focuses on the design,
analysis, implementation, and exploitation of new concepts, techno-
logies, and applications related to high-performance networks on a
local scale.  This workshop will bring together networking researchers,
engineers, and practitioners from across the spectrum of high-speed
local networks, with participants from industry, academia, and
government.  Original papers that present research results, case
studies, technology development or deployment experience, work in
progress, etc. are solicited, as are survey articles.

Specific areas of interest include (but are not limited to):
- High-speed LANs (e.g. Gigabit Ethernet, 10 Gigabit Ethernet)
- System-area networks (e.g. SCI, Myrinet, ServerNet)
- Storage-area networks (e.g. Fibre Channel) and I/O interconnects
- High-speed networks in embedded systems (e.g. avionics, space systems)

- Protocols, services, and topologies for high-speed local networks
- Routing and switch architectures for high-speed local networks
- Quality of Service (QoS) in high-speed local networks
- Performance analysis of high-speed local networks and systems
- Modeling and simulation of high-speed local networks
- Middleware for high-speed local network communication
- Applications for high-speed local networks (e.g. video on demand)

Paper Submission Instructions:
------------------------------
Authors are invited to submit papers of up to ten camera-ready pages,
in PDF or Postscript format, for presentation at the workshop and
publication in the conference proceedings.  Papers should be
submitted by email to the workshop at hsln at hcs.ufl.edu on or before
June 10, 2002.  Alternatively, send five hard copies via postal mail
to:

  Dr. Alan D. George
  HSLN General Chair
  Department of Electrical and Computer Engineering
  University of Florida
  PO Box 116200, 327 Larsen Hall
  Gainesville, FL 32611-6200

HSLN Organizing Committee:
--------------------------
Workshop Chair:      Industry Chair:               Program Chair:
A.D. George          J.L. Meier                    K.J. Christensen
ECE Department       Advanced Technology Center    CSE Department
Univ of Florida      Rockwell Collins, Inc.        Univ of South Florida

george at hcs.ufl.edu   jlmeier at rockwellcollins.com   christen at csee.usf.edu

HSLN Program Committee:
-----------------------

Jay Bragg (awbragg at yahoo.com)
Consultant

Ron Brightwell (bright at sandia.gov)
Sandia National Labs, New Mexico

Wayne Chang (wchang at arl.army.mil)
Army Research Laboratory

Helen Chen (hycsw at california.sandia.gov)
Sandia National Labs, California

Patrick W. Dowd (dowd at lts.ncsc.mil)
University of Maryland at College Park and U.S. Department of Defense
College Park, MD

Mike Foster (michael.s.foster at boeing.com)
Boeing Corporation

Michael A. Hoard (hoardm at us.ibm.com)
IBM
Beaverton, OR

Cynthia S. Hood (hood at iit.edu)
Illinois Institute of Technology
Chicago, IL

Anestis Karasaridis (karasaridis at att.com)
Network Design and Performance Analysis Dept.
AT&T Labs, Middletown, NJ

Fred Kuhns (fredk at arl.wustl.edu)
Washington University
St. Louis, MI

Michael McKee (mckee026 at umn.edu)
University of Minnesota, Rochester
Rochester, MN

Knut Omang (knuto at fast.no)
University of Oslo
Oslo, Norway

Sarp Oral (oral at hcs.ufl.edu)
University of Florida
Gainesville, FL

D. K. Panda (panda at cis.ohio-state.edu)
Ohio State University
Columbus, Ohio

Anthony Skjellum (tony at MPI-SoftTech.Com)
Mississippi State University
Starkville, MS

Norm Strole (ncstrole at us.ibm.com)
IBM
Research Triangle Park, NC

Rollins Turner (rturner at paradyne.com)
Paradyne Corporation
Largo, FL

William White (wwhite at siue.edu)
Southern Illinois University
Edwardsville, IL

Tim Wilcox (tim.wilcox at dolphinics.com)
Technical Director, Dolphin Interconnect

---

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tim.vcf
Type: text/x-vcard
Size: 180 bytes
Desc: Card for Tim Wilcox
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020402/3038c307/attachment.vcf>

From mikeprinkey at hotmail.com  Tue Apr  2 15:07:21 2002
From: mikeprinkey at hotmail.com (Michael Prinkey)
Date: Tue, 02 Apr 2002 18:07:21 -0500
Subject: Linux Software RAID5 Performance
Message-ID: <F99dmcdn1YcEBdyFj9Z0000dbdc@hotmail.com>

Hi Gianluca,

I used the default "ordered" journaling option.  I haven't really looked 
into the different journaling options and there impact on performance.  Does 
the ordered option require two writes?  Also, any thoughts on performance 
tuning or using an external raid1 journal device?

The benchmark application is Bonnie 1.2.

Thanks,

Mike


>From: "Gianluca Cecchi" <tekka99 at libero.it>
>To: <mprinkey at aeolusresearch.com>, <beowulf at beowulf.org>
>Subject: Re: Linux Software RAID5 Performance
>Date: Tue, 2 Apr 2002 23:29:55 +0200
>
>Which option did you use for ext3 journal mechanism? It makes difference
>expecially
>when using "writeback" vs the default "ordered" (see below part of the
>"Changes" file for ext3)
>Which I/O benchmark did you use?
>Thanks,
>Gianluca Cecchi
>
>New mount options:
>
>     "mount -o journal=update"
>         Mounts a filesystem with a Version 1 journal, upgrading the
>         journal dynamically to Version 2.
>
>     "mount -o data=journal"
>         Journals all data and metadata, so data is written twice. This
>         is the mode which all prior versions of ext3 used.
>
>     "mount -o data=ordered"
>         Only journals metadata changes, but data updates are flushed to
>         disk before any transactions commit. Data writes are not atomic
>         but this mode still guarantees that after a crash, files will
>         never contain stale data blocks from old files.
>
>     "mount -o data=writeback"
>         Only journals metadata changes, and data updates are entirely
>         left to the normal "sync" process. After a crash, files will
>         may contain stale data blocks from old files: this mode is
>         exactly equivalent to running ext2 with a very fast fsck on 
>reboot.
>
>Ordered and Writeback data modes require a Version 2 journal: if you do
>not update the journal format then only the Journaled data will be
>allowed.
>
>The default data mode is Journaled for a V1 journal, and Ordered for V2.
>
>
>----- Original Message -----
>From: "Michael Prinkey" <mikeprinkey at hotmail.com>
>To: <beowulf at beowulf.org>
>Sent: Sunday, March 31, 2002 9:33 PM
>Subject: Linux Software RAID5 Performance
>
>
> > Some time ago, a thread discussed the relative performance and stability
> > merits of different RAID solutions.  At that time, I gave some results 
>for
> > 640-GB arrays that I had build using EIDE drives and Software RAID5.  I
>just
> > recently constructed and installed a 1.0-TB array and had some 
>performance
> > numbers to share for it as well.  They are interesting for two reasons:
> > First, the filesystem in use is ext3, rather than ext2.  Second, the 
>read
> > performance is significantly better (almost 2x) than that of the 640-GB
> > units.
> >
> > The system uses 11 120-GB Maxtor 5400-RPM drives, two Promise Ultra66
> > controllers, a P4 1.6-GHz CPU, an Intel 850 motherboard, and 512 MB ECC
> > RDRAM.  Drives are configured in RAID5 (9 data, 1 parity, 1 hot spare).
> > Four drives are on each Promise controller.  Three are on the on-board
>EIDE
> > controller (UDMA100).  A small boot drive is also on the on-board
> > controller.  I had intended to use Ultra100 TX2 controllers, but the
>latest
> > EIDE driver updates with TX2 support are not making it into the latest
> > kernels (I'm using 2.4.18), so I opted for the older, slower controllers
> > rather than patching.  So, I am both cautious and lazy.  8)
> >
> > Again, performance (see below) is remarkably good, especially 
>considering
> > all of the strikes against this configuration:  EIDE instead of SCSI,
>UDMA66
> > instead of 100/133, 5400-RPM instead of 7200-RPM, and master/slave 
>drives
>on
> > each port instead of a single drive per port.  With some hdparm tuning 
>(-c
>3
> > -u 1), the read performance went from 83 MB/sec to 93 MB/sec.  Write
> > performance remained essentially unchanged by tuning at 26 MB/sec.  For
> > comparison, the 640-GB arrays gave read performance of about 56 MB/sec,
> > write performance of 28.5 MB/sec.
> >
> > Had I more time, I would have tested ext2 vs ext3 to ascertain how much
>that
> > change effected performance.  Likewise, I was considering the use of a
>raid1
> > array as the ext3 journal device to perhaps improve write performance.
>Any
> > thoughts?
> >
> > Regards,
> >
> > Mike Prinkey
> > Aeolus Research, Inc.
> >
> > ----------------------
> >
> > [root at tera /root]# df; mount; cat /proc/mdstat; cat bonnie10.log
> > Filesystem           1k-blocks      Used Available Use% Mounted on
> > /dev/hda6             38764268   2601128  34193976   8% /
> > /dev/hda1               101089      4965     90905   6% /boot
> > /dev/md0             1063591944  58195936 1005396008   6% /raid
> > raid640:/raid/home   630296592 284066148 346230444  46% /mnt/tmp
> > /dev/hda6 on / type ext2 (rw)
> > none on /proc type proc (rw)
> > /dev/hda1 on /boot type ext2 (rw)
> > none on /dev/pts type devpts (rw,gid=5,mode=620)
> > /dev/md0 on /raid type ext3 (rw)
> > automount(pid580) on /misc type autofs
> > (rw,fd=5,pgrp=580,minproto=2,maxproto=3)
> > raid640:/raid/home on /mnt/tmp type nfs (rw,addr=192.168.0.123)
> > Personalities : [raid5]
> > read_ahead 1024 sectors
> > md0 : active raid5 hdl1[10] hdk1[9] hdj1[8] hdi1[7] hdh1[6] hdg1[5]
>hdf1[4]
> > hde1[3] hdd1[2] hdc1[1] hdb1[0]
> >       1080546624 blocks level 5, 32k chunk, algorithm 2 [10/10]
>[UUUUUUUUUU]
> >
> > unused devices: <none>
> > Bonnie 1.2: File '/raid/Bonnie.1027', size: 1048576000, volumes: 10
> > Writing with putc()...         done:  14810 kB/s  88.9 %CPU
> > Rewriting...                   done:  22288 kB/s  13.4 %CPU
> > Writing intelligently...       done:  26438 kB/s  21.7 %CPU
> > Reading with getc()...         done:  17112 kB/s  97.9 %CPU
> > Reading intelligently...       done:  93332 kB/s  32.2 %CPU
> > Seek numbers calculated on first volume only
> > Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
> >               ---Sequential Output (nosync)--- ---Sequential Input-- 
>--Rnd
> > Seek-
> >               -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- 
>--04k
> > (03)-
> > Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
>/sec
> > %CPU
> > raid05 10*1000 14810 88.9 26438 21.7 22288 13.4 17112 97.9 93332 32.2
>206.3
> >   2.1
> >
> >
> > _________________________________________________________________
> > Get your FREE download of MSN Explorer at
>http://explorer.msn.com/intl.asp.
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
>
>


_________________________________________________________________
Chat with friends online, try MSN Messenger: http://messenger.msn.com


From mikeprinkey at hotmail.com  Wed Apr  3 11:49:00 2002
From: mikeprinkey at hotmail.com (Michael Prinkey)
Date: Wed, 03 Apr 2002 14:49:00 -0500
Subject: Linux Software RAID5 Performance
Message-ID: <F310ByXj96RxqVWZzRE000150dd@hotmail.com>

Indeed, the multiple processes accessing the device made significantly 
degrade performance.  Fortunately for us, as well, access speed is limited 
by the NFS/SMB and the network, not by array performance.  Unfortunately, 
the unit is online now and I can't fiddle around with the settings and test 
it further.

WRT reliability, we have seen the array drop to degraded mode because of a 
single drive failure.  We have also a single drive take down the entire IDE 
port.  This results in the md device disappearing until you swap out the 
offending drive and restart the array.  There is no data here.  Usually one 
drive goes and the array goes into degraded mode and starts reconstructing 
on the spare.  Then the second goes and the array disappears.  It is a bit 
disconcerting to do ls /raid and get nothing back.  Changing out the drive 
and restarting pulls everything back.

I can honestly say that the only data loss that I have had on these arrays 
came when a maintenance person completely unplugged one of the arrays from 
the UPS.  It caused low-level corruption on 5 of the 9 drives in the array.  
We ended up using a Windows 98 boot floppy with Maxtor's Powermax utility to 
patch them all back up.  It took many hours.  This is the WORST possible 
scenario, BTW.  Even reseting the system gives the EIDE devices a chance to 
flush their caches and maintain low-level integrity.  Cutting the power can 
leave the array/drives inconsistent on the filesystem, device (/dev/md0), 
and hardware-format datagram levels.  So, lock your arrays in a cabinet!  8)

Mike

>From: Jurgen Botz <jurgen at botz.org>
>To: mprinkey at aeolusresearch.com (Michael Prinkey)
>CC: beowulf at beowulf.org
>Subject: Re: Linux Software RAID5 Performance
>Date: Wed, 03 Apr 2002 10:25:31 -0800
>
>Michael Prinkey wrote:
> > Again, performance (see below) is remarkably good, especially 
>considering
> > all of the strikes against this configuration:  EIDE instead of SCSI, 
>UDMA66
> > instead of 100/133, 5400-RPM instead of 7200-RPM, and master/slave 
>drives on
> > each port instead of a single drive per port.
>
>With regard to the master/slave config... I note that your performance
>test is a single reader/writer... in this config with RAID5 I would
>expect the performance to be quite good even with 2 drives per IDE
>controller.  But if you have several processes doing disk I/O
>simultaneously you should see a rather more precipitous drop in
>performance than you would with a single drive per IDE controller.
>I'm working on testing a very similar config right now and that's
>one of my findings (which I had expected) but our application for this
>is not very performance sensitive so it's not a big deal.
>
>A more important issue for me is reliability, and I'm somewhat
>concerned about failure modes.  For example, can an IDE drive fail
>in such a way that if will disable the controller or the other
>drive on the same controller?  If so, that would seriously limit
>the usefulness of RAID5 in this config.  In general how good is
>Linux software RAID's failure handling?  Etc.
>
>:j
>
>
>--
>J?rgen Botz                       | While differing widely in the various
>jurgen at botz.org                   | little bits we know, in our infinite
>                                   | ignorance we are all equal. -Karl 
>Popper
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf
>


_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx


From emiller at techskills.com  Wed Apr  3 17:06:32 2002
From: emiller at techskills.com (Eric Miller)
Date: Wed, 3 Apr 2002 20:06:32 -0500
Subject: another syntax question
In-Reply-To: <20020403153227.A15201@node0.opengeometry.ca>
Message-ID: <NMELJLHHFNGMNFFNGJAEIEMFCDAA.emiller@techskills.com>

For non-parallel applications, is it possible to run individual instances on
diskless nodes?  For example, I want to execute a non-MPI program "A" that
is located in the /bin directory of my master node, but I want to run one
instance of "A" on each of my diskless nodes.

What is the syntax that equates to:

#NP=1 "A" on node0 only
#NP=1 "A" on node1 only
#....
#....


From alvin at Maggie.Linux-Consulting.com  Wed Apr 10 20:55:45 2002
From: alvin at Maggie.Linux-Consulting.com (alvin at Maggie.Linux-Consulting.com)
Date: Wed, 10 Apr 2002 20:55:45 -0700 (PDT)
Subject: Call for Papers
In-Reply-To: <3CA9F2FC.7F39E715@dolphinics.com>
Message-ID: <Pine.LNX.3.96.1020410205241.20372A-100000@Maggie.Linux-Consulting.com>

hi tim

am fairly sure St Louis in Missouri is MO for state initials


thanx
alvin
http://www.Linux-1U.net .... 8 Drives in 1U chassis ...


On Tue, 2 Apr 2002, Tim Wilcox wrote:

>                            CALL FOR PAPERS
>              Workshop on High-Speed Local Networks (HSLN)
>                   as part of the IEEE LCN conference
>                      http://www.hcs.ufl.edu/hsln
>                        http://www.ieeelcn.org
> 
>                          November 6 - 8, 2002
>                   Embassy Suites USF, Tampa, Florida
> 
> Fred Kuhns (fredk at arl.wustl.edu)
> Washington University

.... [ snipped ]


> St. Louis, MI

^^^^^^^^^^^^^^^^^^^^^
 
> Michael McKee (mckee026 at umn.edu)
> University of Minnesota, Rochester
> Rochester, MN
> 
> Knut Omang (knuto at fast.no)
> University of Oslo
> Oslo, Norway
> 
> Sarp Oral (oral at hcs.ufl.edu)
> University of Florida
> Gainesville, FL
> 
> D. K. Panda (panda at cis.ohio-state.edu)
> Ohio State University
> Columbus, Ohio
> 
> Anthony Skjellum (tony at MPI-SoftTech.Com)
> Mississippi State University
> Starkville, MS
> 
....


From garcia_garcia_adrian at hotmail.com  Thu Apr  4 09:48:29 2002
From: garcia_garcia_adrian at hotmail.com (Adrian Garcia Garcia)
Date: Thu, 04 Apr 2002 17:48:29 +0000
Subject: DHCP Help
Message-ID: <LAW2-F71K2KmXpMyhzt0000d96e@hotmail.com>

An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020404/ca30ab16/attachment.html>

From rocky at atipa.com  Thu Apr  4 11:43:58 2002
From: rocky at atipa.com (Rocky McGaugh)
Date: Thu, 4 Apr 2002 13:43:58 -0600 (CST)
Subject: commercial parallel libraries
In-Reply-To: <E16tBA7-0003ck-00@scrabble.freeuk.net>
Message-ID: <Pine.LNX.4.33.0204041331050.11228-100000@rocky.lab.atipa.com>

On Thu, 4 Apr 2002, Jayne Heger wrote:

> 
> Hi,
> 
> I know this is a beowulf list, but I could do with getting some info on any 
> (if there are) commercial parallel libraries, the equivalent of pvm and mpi.
> 
> Do any of you know the names of any?
> 
> Thanks.
> 
> Jayne


MPIPro is a commercial implementation of MPI. I've heard alot of good 
about their Win/32 implementation, but not as much about their normal unix 
MPI. 

Linda is another commercial parallel API that provides very good support 
and services.


-- 
Rocky McGaugh
Atipa Technologies
rocky at atipatechnologies.com
rmcgaugh at atipa.com
1-785-841-9513 x3110
http://1087800222/
perl -e 'print unpack(u, ".=W=W+F%T:7\!A+F-O;0H`");'


From wewu at oscar.eecs.tufts.edu  Thu Apr  4 12:56:20 2002
From: wewu at oscar.eecs.tufts.edu (wewu at oscar.eecs.tufts.edu)
Date: Thu, 4 Apr 2002 15:56:20 -0500 (EST)
Subject: restrict node access
Message-ID: <Pine.LNX.4.33.0204041546410.32237-100000@oscar.eecs.tufts.edu>

We want to restrict regular users to access the nodes using rlogin or
rsh or ssh in cluster,  but still let PBS run the job. Does anybady have a
better suggestion? We install oscar 1.2.1 on our cluster.

Thanks


From s02.sbecker at wittenberg.edu  Thu Apr  4 15:47:53 2002
From: s02.sbecker at wittenberg.edu (s02.sbecker)
Date: Thu, 4 Apr 2002 18:47:53 -0500
Subject: Scyld node boot problem
Message-ID: <3C435198@smtp.wittenberg.edu>

I am using version 27bz-8 for the Scyld disk, with kernel version 
2.2.19-12.beo.  I have a 3c905b card in the slave and a 3c905 in the master.  
I am getting to the third phase of the boot for the slave to where it outputs 
the log file.  Then the node hangs.  Here is the log file for node.0...

node_up: Setting system clock.
node_up: TODO set interface netmask.
node_up: Configuring loopback interface.
node_up: Configuring PCI devices.
setup_fs: Configuring node filesystems...
setup_fs: Using /etc/beowulf/fstab
setup_fs: Checking /dev/ram3 (type=ext2)...
setup_fs: Hmmm...This appears to be a ramdisk. 
setup_fs: I'm going to try to try checking the filesystem (fsck) anyway.
setup_fs: If it is a RAM disk the following will fail harmlessly.
e2fsck 1.20, 25-May-2001 for EXT2 FS 0.5b, 95/08/09
Couldn't find ext2 superblock, trying backup blocks...
e2fsck
The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

: Bad magic number in super-block while trying to open /dev/ram3
setup_fs: FSCK failure. (OK for RAM disks)
setup_fs: Creating ext2 on /dev/ram3...
mke2fs 1.20, 25-May-2001 for EXT2 FS 0.5b, 95/08/09
setup_fs: Mounting /dev/ram3 on /rootfs//... (type=ext2; options=defaults)
setup_fs: Checking 192.168.1.1:/home (type=nfs)...
setup_fs: Mounting 192.168.1.1:/home on /rootfs//home... (type=nfs; 
options=nolock)
mount: 192.168.1.1:/home failed, reason given by server: Permission denied
Failed to mount 192.168.1.1:/home on /home.


Can someone help?  Thanks.
Shawn


From sp at scali.com  Thu Apr  4 18:08:32 2002
From: sp at scali.com (Steffen Persvold)
Date: Fri, 5 Apr 2002 04:08:32 +0200 (CEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <5.1.0.14.2.20020404160051.00aff7a0@mail1.jpl.nasa.gov>
Message-ID: <Pine.LNX.4.30.0204050252190.7535-100000@elin.scali.no>

On Thu, 4 Apr 2002, Jim Lux wrote:

> What's high bandwidth?
> What's low latency?
> How much money do you want to spend?
>
> Ethernet is cheap, $100-$200/node for 100 Mbps or GBE (by the time you get
> switches, cables, adapters, etc.)
> Latency is kind of slow (compared to dedicated point to point links)
>
>

Well this is a "touchy" topic since different people has different
opinions. There is also different ways of measuring bandwidth mainly
point to point (two machines talking together) and bisection (dividing
your nework in half and let the one half talk to the other which kind of
shows how the network scales with more nodes).

Also some people like to talk about the hardware bandwidth and hardware
latency, while the thing that really matters (IMHO) is application to
application bandwidth and latency.

I don't want to start a flamewar here, but I _think_ (not knowing real
numbers for other high speed interconnects) that SCI has atleast the
lowest latency and maybe also the highest point to point bandwidth :

SCI application to application latency   : 2.5 us
SCI application to application bandwidth : 325 MByte/sec

Note that these numbers are very chipset specific (as most high speed
interconnect numbers are), these numbers are from IA64. Here are numbers
from a popular IA32 platform, the AMD 760MPX :

SCI application to application latency   : 1.8 us
SCI application to application bandwidth : 283 MByte/sec


More "real" performance numbers using MPI over SCI (also collective and
application benchmarks) can be located on Dolphin's homepage
http://www.dolphinics.com

Other popular high speed interconnects I know of is Myrinet (considered
the main competitor to SCI for cluster interconnects) and Giganet. There
are some performance numbers on Myricoms homepage  (http://www.myricom.com)
but I doubt if that is for their latest hardware generation (correct me if
I'm wrong).

Best regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From alvin at Maggie.Linux-Consulting.com  Wed Apr 10 21:07:40 2002
From: alvin at Maggie.Linux-Consulting.com (alvin at Maggie.Linux-Consulting.com)
Date: Wed, 10 Apr 2002 21:07:40 -0700 (PDT)
Subject: Call for Papers - oops
In-Reply-To: <20020411040416.0F7FE14093@smtp.x263.net>
Message-ID: <Pine.LNX.3.96.1020410210451.20424B-100000@Maggie.Linux-Consulting.com>

> hi tim
>
> am fairly sure St Louis in Missouri is MO for state initials

- was hoping to help catch the typo before (??) it goes to the 
  hard copy printers

oopps... didnt mean for that to go the list...
my apologies for bothering ya.... ( twice )...

thanx
alvin


From suraj_peri at yahoo.com  Sat Apr  6 03:35:45 2002
From: suraj_peri at yahoo.com (Suraj Peri)
Date: Sat, 6 Apr 2002 03:35:45 -0800 (PST)
Subject: What could be the performance of my cluster 
In-Reply-To: <20020405125956.D69845@velocet.ca>
Message-ID: <20020406113545.91938.qmail@web10504.mail.yahoo.com>

Hi group, 
I was calculating the performance of my cluster. The
features are 

1. 8 nodes
2. Processor: AMD Athlon XP 1800+
3. 8 CPUs
4. 8*1.5 GB DDR RAM
5. 1 Server with 2 processorts with AMD MP 1800+ and
2GB DDR RAM

I calculated this to be 48 Mflops . Is this correct ?
if not, what is the correct performance of my cluster.
I also comparatively calculated that my cluster would
be 3 times faster than AlphaServer DS20E ( 833 MHz
alpha 64 bit processor, 4 GB max memory)

Is my calculation correct or wrong? please help me
ASAP. thanks in advance.

cheers
suraj.

=====
PIL/BMB/SDU/DK

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From suraj_peri at yahoo.com  Sat Apr  6 03:35:45 2002
From: suraj_peri at yahoo.com (Suraj Peri)
Date: Sat, 6 Apr 2002 03:35:45 -0800 (PST)
Subject: What could be the performance of my cluster 
In-Reply-To: <20020405125956.D69845@velocet.ca>
Message-ID: <20020406113545.91938.qmail@web10504.mail.yahoo.com>

Hi group, 
I was calculating the performance of my cluster. The
features are 

1. 8 nodes
2. Processor: AMD Athlon XP 1800+
3. 8 CPUs
4. 8*1.5 GB DDR RAM
5. 1 Server with 2 processorts with AMD MP 1800+ and
2GB DDR RAM

I calculated this to be 48 Mflops . Is this correct ?
if not, what is the correct performance of my cluster.
I also comparatively calculated that my cluster would
be 3 times faster than AlphaServer DS20E ( 833 MHz
alpha 64 bit processor, 4 GB max memory)

Is my calculation correct or wrong? please help me
ASAP. thanks in advance.

cheers
suraj.

=====
PIL/BMB/SDU/DK

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From sp at scali.com  Thu Apr  4 18:08:32 2002
From: sp at scali.com (Steffen Persvold)
Date: Fri, 5 Apr 2002 04:08:32 +0200 (CEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <5.1.0.14.2.20020404160051.00aff7a0@mail1.jpl.nasa.gov>
Message-ID: <Pine.LNX.4.30.0204050252190.7535-100000@elin.scali.no>

On Thu, 4 Apr 2002, Jim Lux wrote:

> What's high bandwidth?
> What's low latency?
> How much money do you want to spend?
>
> Ethernet is cheap, $100-$200/node for 100 Mbps or GBE (by the time you get
> switches, cables, adapters, etc.)
> Latency is kind of slow (compared to dedicated point to point links)
>
>

Well this is a "touchy" topic since different people has different
opinions. There is also different ways of measuring bandwidth mainly
point to point (two machines talking together) and bisection (dividing
your nework in half and let the one half talk to the other which kind of
shows how the network scales with more nodes).

Also some people like to talk about the hardware bandwidth and hardware
latency, while the thing that really matters (IMHO) is application to
application bandwidth and latency.

I don't want to start a flamewar here, but I _think_ (not knowing real
numbers for other high speed interconnects) that SCI has atleast the
lowest latency and maybe also the highest point to point bandwidth :

SCI application to application latency   : 2.5 us
SCI application to application bandwidth : 325 MByte/sec

Note that these numbers are very chipset specific (as most high speed
interconnect numbers are), these numbers are from IA64. Here are numbers
from a popular IA32 platform, the AMD 760MPX :

SCI application to application latency   : 1.8 us
SCI application to application bandwidth : 283 MByte/sec


More "real" performance numbers using MPI over SCI (also collective and
application benchmarks) can be located on Dolphin's homepage
http://www.dolphinics.com

Other popular high speed interconnects I know of is Myrinet (considered
the main competitor to SCI for cluster interconnects) and Giganet. There
are some performance numbers on Myricoms homepage  (http://www.myricom.com)
but I doubt if that is for their latest hardware generation (correct me if
I'm wrong).

Best regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From tony at MPI-SoftTech.Com  Sun Apr  7 09:36:53 2002
From: tony at MPI-SoftTech.Com (Tony Skjellum)
Date: Sun, 7 Apr 2002 11:36:53 -0500 (CDT)
Subject: Experience with GigE Switches with Jumbo packet support
Message-ID: <Pine.GSO.4.33.0204071135420.14788-100000@mpi.mpi-softtech.com>

Any Beowulf folks out there have specific experience with switches that allow
Jumbo packets?  It seems hard to tell from online specs on various company
pages whether a switch does this or not?

Adapters seem to be readily available...

Any clusters doing this right now?

Thanks,
Tony


From s02.sbecker at wittenberg.edu  Sun Apr  7 19:02:40 2002
From: s02.sbecker at wittenberg.edu (Shawn M Becker s02)
Date: Sun, 07 Apr 2002 22:02:40 -0400
Subject: Scyld slave node boot problem
Message-ID: <5.1.0.14.2.20020407220211.0196d080@mail.wittenberg.edu>

I am using version 27bz-8 for the Scyld disk, with kernel version
2.2.19-12.beo. I have a 3c905b card in the slave and a 3c905 in the master.
I am getting to the third phase of the boot for the slave to where it outputs
the log file. Then the node hangs. Here is the log file for node.0...
node_up: Setting system clock.
node_up: TODO set interface netmask.
node_up: Configuring loopback interface.
node_up: Configuring PCI devices.
setup_fs: Configuring node filesystems...
setup_fs: Using /etc/beowulf/fstab
setup_fs: Checking /dev/ram3 (type=ext2)...
setup_fs: Hmmm...This appears to be a ramdisk.
setup_fs: I'm going to try to try checking the filesystem (fsck) anyway.
setup_fs: If it is a RAM disk the following will fail harmlessly.
e2fsck 1.20, 25-May-2001 for EXT2 FS 0.5b, 95/08/09
Couldn't find ext2 superblock, trying backup blocks...
e2fsck
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
: Bad magic number in super-block while trying to open /dev/ram3
setup_fs: FSCK failure. (OK for RAM disks)
setup_fs: Creating ext2 on /dev/ram3...
mke2fs 1.20, 25-May-2001 for EXT2 FS 0.5b, 95/08/09
setup_fs: Mounting /dev/ram3 on /rootfs//... (type=ext2; options=defaults)
setup_fs: Checking 192.168.1.1:/home (type=nfs)...
setup_fs: Mounting 192.168.1.1:/home on /rootfs//home... (type=nfs;
options=nolock)
mount: 192.168.1.1:/home failed, reason given by server: Permission denied
Failed to mount 192.168.1.1:/home on /home.

Can someone help? Thanks.
Shawn


~~~~~~~~~~~~~~~~~~~
Shawn Becker
Wittenberg University
930 N. Fountain
Springfield, OH 45504
(937) 360-7562
~~~~~~~~~~~~~~~~~~~     


From wheeler.mark at ensco.com  Mon Apr  8 04:56:07 2002
From: wheeler.mark at ensco.com (Wheeler.Mark)
Date: Mon, 8 Apr 2002 07:56:07 -0400
Subject: PG Compilers
Message-ID: <8986151694190742869D08450EE4DCDE0CF298@amu-exch.ensco.win>

We are running pgf77 version 3.2-4 on a Linux cluster.

For a standard FORTRAN WRITE statement with IOSTAT=IOS, I am getting a
value of 5.

In the section B.4 (runtime error messages) of the PGI User's Guide, I
do not see values less than 201.

Does anyone know what this error means?

How can I determine what is causing this error?

 
Mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020408/13d77c93/attachment.html>

From ron_chen_123 at yahoo.com  Mon Apr  8 08:01:27 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Mon, 8 Apr 2002 08:01:27 -0700 (PDT)
Subject: FreeBSD port of SGE (Compute farm system)
In-Reply-To: <20020405042351.86759.qmail@web14706.mail.yahoo.com>
Message-ID: <20020408150127.20897.qmail@web14708.mail.yahoo.com>

Patch and output attached.

Also, I already found 1 problem -- somewhere in execd.
It affects the process' priority in SGEEE mode.

However, I've not fixed it yet, I just want to release
the current patch ASAP to let people try it out.


 -Ron


--- Ron Chen <ron_chen_123 at yahoo.com> wrote:
> Hi,
> 
> I compiled the source, changed a few parameters, and
> SGE finally runs on FreeBSD. It is running in
> single-
> user mode, with only 1 host. I am doing a little
> clean
> up, and then I will need to make sure my changes do
> not affect others (by "#ifdef BSD").
> 
> It still does not get the correct system information
> yet, but some of the job accounting info is there
> (at
> least run time is correct  8-) ).
> 
> It is now running for several hours, it looks
> stable. It ran several tens of jobs. "qstat", 
> "qhost", "qacct", "qconf", "qdel" look fine, output 
> makes sense (but need to implement the resource info

> collecting routines).
> 
> I will post the patches tomorrow, together with some
> output of the commands. (I will be busy today)
> 
> Also, I will move the discussion from the hackers
> list
> to the cluster at freebsd list.


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: diff.txt
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020408/9c5e1b9c/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: output.txt
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020408/9c5e1b9c/attachment-0001.txt>

From christon at pluto.dsu.edu  Tue Apr  9 07:40:15 2002
From: christon at pluto.dsu.edu (Christoffersen, Neils)
Date: Tue, 9 Apr 2002 09:40:15 -0500 
Subject: Beowulf -- confirmation of subscription -- request 851822
Message-ID: <0718ABB23368D2119FC200008362AF6816BDD7@pluto.dsu.edu>

 
-----Original Message-----
From: beowulf-request at beowulf.org
To: christon at pluto.dsu.edu
Sent: 4/9/02 9:35 AM
Subject: Beowulf -- confirmation of subscription -- request 851822

Beowulf -- confirmation of subscription -- request 851822

We have received a request from 138.247.172.98 for subscription of
your email address, <christon at pluto.dsu.edu>, to the
beowulf at beowulf.org mailing list.  To confirm the request, please send
a message to beowulf-request at beowulf.org, and either:

- maintain the subject line as is (the reply's additional "Re:" is
ok),

- or include the following line - and only the following line - in the
message body: 

confirm 851822

(Simply sending a 'reply' to this message should work from most email
interfaces, since that usually leaves the subject line in the right
form.)

If you do not wish to subscribe to this list, please simply disregard
this message.  Send questions to beowulf-admin at beowulf.org.


From eugen at leitl.org  Wed Apr 10 03:52:30 2002
From: eugen at leitl.org (Eugen Leitl)
Date: Wed, 10 Apr 2002 12:52:30 +0200 (CEST)
Subject: CCL:parallel quantum solutions (fwd)
Message-ID: <Pine.LNX.4.33.0204101252230.664-100000@hydrogen.leitl.org>


---------- Forwarded message ----------
Date: Tue, 09 Apr 2002 10:07:54 -0400
From: David J Giesen <david.giesen at kodak.com>
To: Dr. Bill Davis <bdavis at UTB.edu>, CHEMISTRY at ccl.net
Subject: CCL:parallel quantum solutions

Bill -

This is a long post, but I think vendors that do excellent jobs should
get pats on the back.  I am not a PQS employee nor do I receive
new-customer kick-backs.  Those not interested in a review of PQS
products can hit delete now.

I've been a very satisfied PQS customer since Aug. 2000.  We purchased
an 8-processor Linux cluster at that point, and several months later we
were so pleased we bought a second.  At the end of last year, we
contracted with them to build a 34-processor cluster for general
computational (not just chemistry) use.  

Hardware performance : The setup they use works well for running serial
or parallel codes, and I have PQS, Jaguar and Gaussian (using LINDA)
running in parallel on them.  Based on timings against other
machines/platforms, the PQS machines perform as well as could be
expected.  Our 1.2 GHz athlon PQS machine runs G98 slightly (~10%)
faster than the latest-and-greatest-just-off-the-design-sheet Sun
hardware, and 2-3 times faster than our SGI 194 MHz R10000.  It is ~10%
slower than a 1.5GHz P4.  Both the Athlon and P4 machine used PIII
optimized blas libraries...

Software performance : the PQS software 'is what it is'.  If you are
interested in mainly HF, MP2 and DFT computations, it is very good.  You
can see its capabilities on their website.  Speed-wise, it runs faster
than other codes I use, although it is not faster than Jaguar's
pseudo-spectral methods.  The geometry optimizer is rock solid
dependable as one would expect from code by Pulay and Baker.  PQS uses
PVM for parallel execution.  Without getting into a debate about
parallel paradigms, I'll say simply this: in our hands, I have never had
a PVM job die because of inter-process communication problems while
MPICH/MPI is very flaky and tends to die on about 10-25% of chemistry
jobs (even more for systems using automount) independent of linux, sun
or SGI.  Because PQS uses PVM to set up the parallel system only once
per job, there is less parallel overhead using PQS than with other codes
that set up LINDA parallel systems at every SCF and geometry
optimization step - although for large jobs, these both essentially go
to zero.

Support : PQS is a small company, and the support shows it.  They have
absolutely bent over backwards each time we have had an issue, and
dealing with them is always a pleasure.  We have not had a hardware or
PQS software issue that they haven't resolved to our satisfaction.  In
fairness, you can't expect a 24-hour help line or technicians in suits
to fly in and fix problems.  You should be aware that they are not in
the business of selling/supporting Linux or Gnu software, so some
problems you have on your machine if you veer off the PQS path might be
technically out of their scope.  However, in my experience, they make
every honest effort to solve those as well (and usually do).  Every
machine they shipped us has been stress-tested by an expert for a number
of days before they are delivered.

Ease of use : The machines come setup to run the PQS chemistry code out
of the box.  If you are planning on running one PQS parallel job at a
time across the whole cluster or multiple serial jobs, the included DQS
(not associated with PQS) queuing system works OK.  Running multiple
parallel codes/jobs on the same cluster through the queue does not work
well. Running other codes in parallel through the queue takes some hard
work.  Setting up other parallel codes also takes some work.  This is
not really a function of PQS, however, and you'll find this is true no
matter what machine you get.

Disclaimer : This e-mail does not in any way imply an 'official Kodak'
stance, it is merely the personal opinion of a Kodak employee who uses
PQS products at work.

Dave

Dr. Bill Davis wrote:
> 
> Hi!
> 
> Does anyone have any experience with the PQS hardware/software
> combination, more specifically the QS4-1800S?  Any comments on ease of
> use, support and any other important points would be greatly
> appreciated...Thanks!
> 
> Bill
> 
> 
> --
> **********************************
> Dr. William M. Davis
> Assistant Professor of Chemistry/
> Phi Theta Kappa Advisor
> Dept. of Chemistry and Environmental Science
> University of Texas at Brownsville
> 80 Fort Brown
> Brownsville, TX 78520
> Phone: (956) 574-6646
> Fax: (956) 574-6692
> WWW: unix.utb.edu/~bdavis
> **********************************
> 
> 

-- 
Dr. David J. Giesen
Eastman Kodak Company                           david.giesen at kodak.com
2/83/RL MC 02216                                (ph) 1-585-58(8-0480)
Rochester, NY 14650                             (fax)1-585-588-1839


-= This is automatically added to each message by mailing script =-
CHEMISTRY at ccl.net -- To Everybody  | CHEMISTRY-REQUEST at ccl.net -- To Admins
MAILSERV at ccl.net -- HELP CHEMISTRY or HELP SEARCH
CHEMISTRY-SEARCH at ccl.net -- archive search    |    Gopher: gopher.ccl.net 70
Ftp: ftp.ccl.net  |  WWW: http://www.ccl.net/chemistry/   | Jan: jkl at osc.edu


From fraser5 at cox.net  Wed Apr 10 05:24:05 2002
From: fraser5 at cox.net (Jim Fraser)
Date: Wed, 10 Apr 2002 08:24:05 -0400
Subject: MPICH: works for users not root?
Message-ID: <003801c1e08a$9b6d8840$0300005a@papabear>

I have had this pesky problem of running mpi using the bash shell and trying
to figure out how to get it to work for root.  It works fine for all the
users but not root. As root I can rsh to any node ok but if I do a      rsh
node2 -n true  then I get a permission denied.  Again, it works for normal
users.   I have gutted the .bashrc and /etc/bashrc scripts and the .rhosts
seem ok.  what could be the problem?  (Linux 7.2)

thanks
jim


From rastapoppolous at yahoo.com  Wed Apr 10 22:29:24 2002
From: rastapoppolous at yahoo.com (k r)
Date: Wed, 10 Apr 2002 22:29:24 -0700 (PDT)
Subject: Scyld and mpi fasta Makefile Problems 
Message-ID: <20020411052924.33098.qmail@web9009.mail.yahoo.com>

hello all, 
 I can't seem to get the included Makefile 
(Makefile.mpi4) for FASTA to compile.

When i compile i get the following error.

mm_file.h:25: conflicting types for `int64_t'
types.h:172: previous declaration of `int64_t'

I did not make any changes to the Makefile. 

any help is appreciated.
Thanks,
Kart

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From siegert at sfu.ca  Wed Apr 10 22:43:28 2002
From: siegert at sfu.ca (Martin Siegert)
Date: Wed, 10 Apr 2002 22:43:28 -0700
Subject: MPICH: works for users not root?
In-Reply-To: <003801c1e08a$9b6d8840$0300005a@papabear>; from fraser5@cox.net on Wed, Apr 10, 2002 at 08:24:05AM -0400
References: <003801c1e08a$9b6d8840$0300005a@papabear>
Message-ID: <20020410224328.A19551@stikine.ucs.sfu.ca>

On Wed, Apr 10, 2002 at 08:24:05AM -0400, Jim Fraser wrote:
> I have had this pesky problem of running mpi using the bash shell and trying
> to figure out how to get it to work for root.  It works fine for all the
> users but not root. As root I can rsh to any node ok but if I do a      rsh
> node2 -n true  then I get a permission denied.  Again, it works for normal
> users.   I have gutted the .bashrc and /etc/bashrc scripts and the .rhosts
> seem ok.  what could be the problem?  (Linux 7.2)
> 
> thanks
> jim

The least that you need is a line "rsh" in /etc/securetty.

Hope this helps.

Cheers,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================


From rastapoppolous at yahoo.com  Wed Apr 10 22:44:14 2002
From: rastapoppolous at yahoo.com (k r)
Date: Wed, 10 Apr 2002 22:44:14 -0700 (PDT)
Subject: mpi fasta Makefile Problems 
Message-ID: <20020411054414.34011.qmail@web9009.mail.yahoo.com>

hello all, 
 I can't seem to get the included Makefile 
(Makefile.mpi4) for FASTA to compile on a beowulf
cluster.

When i compile i get the following error.

mm_file.h:25: conflicting types for `int64_t'
types.h:172: previous declaration of `int64_t'

I did not make any changes to the Makefile. 

any help is appreciated.
Thanks,
Kart


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From math at velocet.ca  Wed Apr 10 23:07:43 2002
From: math at velocet.ca (Velocet)
Date: Thu, 11 Apr 2002 02:07:43 -0400
Subject: Linux Software RAID5 Performance
In-Reply-To: <F310ByXj96RxqVWZzRE000150dd@hotmail.com>; from mikeprinkey@hotmail.com on Wed, Apr 03, 2002 at 02:49:00PM -0500
References: <F310ByXj96RxqVWZzRE000150dd@hotmail.com>
Message-ID: <20020411020739.W19272@velocet.ca>

On Wed, Apr 03, 2002 at 02:49:00PM -0500, Michael Prinkey's all...
> Indeed, the multiple processes accessing the device made significantly 
> degrade performance.  Fortunately for us, as well, access speed is limited 
> by the NFS/SMB and the network, not by array performance.  Unfortunately, 
> the unit is online now and I can't fiddle around with the settings and test 
> it further.
> 
> WRT reliability, we have seen the array drop to degraded mode because of a 
> single drive failure.  We have also a single drive take down the entire IDE 
> port.  This results in the md device disappearing until you swap out the 
> offending drive and restart the array.  There is no data here.  Usually one 
> drive goes and the array goes into degraded mode and starts reconstructing 
> on the spare.  Then the second goes and the array disappears.  It is a bit 
> disconcerting to do ls /raid and get nothing back.  Changing out the drive 
> and restarting pulls everything back.
> 
> I can honestly say that the only data loss that I have had on these arrays 
> came when a maintenance person completely unplugged one of the arrays from 
> the UPS.  It caused low-level corruption on 5 of the 9 drives in the array.  
> We ended up using a Windows 98 boot floppy with Maxtor's Powermax utility to 
> patch them all back up.  It took many hours.  This is the WORST possible 
> scenario, BTW.  Even reseting the system gives the EIDE devices a chance to 
> flush their caches and maintain low-level integrity.  Cutting the power can 
> leave the array/drives inconsistent on the filesystem, device (/dev/md0), 
> and hardware-format datagram levels.  So, lock your arrays in a cabinet!  8)

ok get an EIDE RAID controller with battery backed-up ram onboard. We pulled a
bunch of the SCSI equiv of such from a netfinity server a customer pawned off
on us.  Rather nice. (anyone want to buy? :)

/kc

> 
> Mike
> 
> >From: Jurgen Botz <jurgen at botz.org>
> >To: mprinkey at aeolusresearch.com (Michael Prinkey)
> >CC: beowulf at beowulf.org
> >Subject: Re: Linux Software RAID5 Performance
> >Date: Wed, 03 Apr 2002 10:25:31 -0800
> >
> >Michael Prinkey wrote:
> > > Again, performance (see below) is remarkably good, especially 
> >considering
> > > all of the strikes against this configuration:  EIDE instead of SCSI, 
> >UDMA66
> > > instead of 100/133, 5400-RPM instead of 7200-RPM, and master/slave 
> >drives on
> > > each port instead of a single drive per port.
> >
> >With regard to the master/slave config... I note that your performance
> >test is a single reader/writer... in this config with RAID5 I would
> >expect the performance to be quite good even with 2 drives per IDE
> >controller.  But if you have several processes doing disk I/O
> >simultaneously you should see a rather more precipitous drop in
> >performance than you would with a single drive per IDE controller.
> >I'm working on testing a very similar config right now and that's
> >one of my findings (which I had expected) but our application for this
> >is not very performance sensitive so it's not a big deal.
> >
> >A more important issue for me is reliability, and I'm somewhat
> >concerned about failure modes.  For example, can an IDE drive fail
> >in such a way that if will disable the controller or the other
> >drive on the same controller?  If so, that would seriously limit
> >the usefulness of RAID5 in this config.  In general how good is
> >Linux software RAID's failure handling?  Etc.
> >
> >:j
> >
> >
> >--
> >J?rgen Botz                       | While differing widely in the various
> >jurgen at botz.org                   | little bits we know, in our infinite
> >                                   | ignorance we are all equal. -Karl 
> >Popper
> >
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit 
> >http://www.beowulf.org/mailman/listinfo/beowulf
> >
> 
> 
> _________________________________________________________________
> MSN Photos is the easiest way to share and print your photos: 
> http://photos.msn.com/support/worldwide.aspx
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From ron_chen_123 at yahoo.com  Thu Apr 11 00:53:52 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Thu, 11 Apr 2002 00:53:52 -0700 (PDT)
Subject: Lost cycles due to PBS (was Re: Uptime data/studies/anecdotes)
In-Reply-To: <20020402140934.A29446@getafix.EraGen.com>
Message-ID: <20020411075352.26837.qmail@web14703.mail.yahoo.com>

--- Chris Black <cblack at eragen.com> wrote:
> On Tue, Apr 02, 2002 at 12:46:07PM -0600, Roger L.
> Smith wrote:
> > On Tue, 2 Apr 2002, Richard Walsh wrote:
> [stuff deleted]
> > PBS is our leading cause of cycle loss.  We now
> run a cron job on the
> > headnode that checks every 15 minutes to see if
> the PBS daemons have died,
> > and if so, it automatically restarts them.  About
> 75% of the time that I
> > have a node fail to accept jobs, it is because its
> pbs_mom has died, not
> > because there is anything wrong with the node.
> > 
> 
> We used to have the same problem with PBS,
> especially when many jobs were 
> in the queue. At that point sometimes the pbs master
> died as well.
> Since we've switched to SGE/GridEngine/CODINE I've
> been MUCH happier.
> Plus there are lots of nifty things you can do with
> the expandibility of 
> writing your own load monitors via shell scripts and
> such.
> The whole point of this post is:
> GNQS < PBS < Sun Gridengine :)
> 
> Chris (who tried two other batch schedulers until
> settling on SGE)
> 

I also have similar experience -- I tried PBS, it is
hard to install, and there are not much scheduling
policies -- but it is hard to config.

Then I read the news about SGE, and since it does not
require root access to install/run, I gave it a try. I
did an experience a few weeks ago -- submitting over
30,000 "sleep jobs" to SGE, and it did not die! If
the master host is down, another machine takes over,
so there is not lost of computing power.

I think SGE 5.3 is better than anything available. I
tried commerical DRM systems, other open source
packages, but so far SGE is by far the best.

BTW, Chris, how many nodes are there in your cluster?

-Ron

P.S. I'm doing a port of SGE to FreeBSD, hope people
find it useful

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From keithu at parl.clemson.edu  Thu Apr 11 05:11:23 2002
From: keithu at parl.clemson.edu (Keith Underwood)
Date: Thu, 11 Apr 2002 08:11:23 -0400 (EDT)
Subject: Experience with GigE Switches with Jumbo packet support
In-Reply-To: <Pine.GSO.4.33.0204071135420.14788-100000@mpi.mpi-softtech.com>
Message-ID: <Pine.LNX.4.44.0204110809480.14772-100000@keithu-pc2.parl.clemson.edu>

Most of the Extreme switches support Jumbo packets.  The new line of 
products from Foundry Networks (JetCore is what I think they call it) is 
supposed to support Jumbo packets (even for Fast Ethernet from the way I 
read the spec, if you could find a card to do it).

					Keith

On Sun, 7 Apr 2002, Tony Skjellum wrote:

> Any Beowulf folks out there have specific experience with switches that allow
> Jumbo packets?  It seems hard to tell from online specs on various company
> pages whether a switch does this or not?
> 
> Adapters seem to be readily available...
> 
> Any clusters doing this right now?
> 
> Thanks,
> Tony
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

---------------------------------------------------------------------------
Keith Underwood                   Parallel Architecture Research Lab (PARL)
keithu at parl.clemson.edu                                  Clemson University


From wonglijie at yahoo.com  Thu Apr 11 06:00:08 2002
From: wonglijie at yahoo.com (Li Jie)
Date: Thu, 11 Apr 2002 06:00:08 -0700 (PDT)
Subject: (no subject)
Message-ID: <20020411130008.63691.qmail@web9608.mail.yahoo.com>


hi

may i know if anyone here can provide detailed information on how to start a a beowulf?

i have about 9 machinese with Pentium MMX 233 Mhz processors, 128 mb RAM, 2 x 1.99 Gb. HDD

I am also considering various designs and this is a school project. Thanks for your help!


lijie
<e117>
[2002]
7540832
[2 Cor. 5:7] We live by faith and not by sight.


---------------------------------
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020411/fd1e3d99/attachment.html>

From rgb at phy.duke.edu  Thu Apr 11 06:18:39 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 11 Apr 2002 09:18:39 -0400 (EDT)
Subject: DHCP Help
In-Reply-To: <LAW2-F71K2KmXpMyhzt0000d96e@hotmail.com>
Message-ID: <Pine.LNX.4.44.0204110905550.1632-100000@lucifer.rgb.private.net>

On Thu, 4 Apr 2002, Adrian Garcia Garcia wrote:

> Ok I understand most of the tips, but I have some doubts about the domain
> name, I used the domain name "cluster.org" because every documentation
> about DHCP had a domain name in the configuration so ...
> Is it necesary to have a domain name server (like BIND) working together
> with the dhcp server??????

We're getting to where I don't know the answers -- just try it with and
without.  At a guess, the answer is no, you don't need a domain name and
if you use one it can likely be made up -- mine always have been, and
IIRC I've used names that didn't correspond to anything in hosts and
didn't even have an approved ending.  If you do make one up I'd suggest
you stay away from any name (unfortunately like cluster.org or
cluster.net) that MIGHT be registered in nameservice so you can avoid
any possibility of name resolution confusion in the future.  You
definitely don't need a nameserver -- my hosts are all on a private
internal network anyway and not in nameservice.  If you want them to
resolve by name you have to ensure that they are resolvable one of the
ways given for hosts in /etc/nsswitch.conf and the library calls will
take care of the rest.

> One more thing...
> ?
> I don?t have Internet in my LAN and I don?t know if is it necesary the
> domain name?????
> 
> Thanks a lot. I'm newbe and my english is not good =)

Probably not.  It depends on what services you want to run elsewhere.
Mail servers/clients will likely get unhappy without some sort of domain
name defined, maybe a few other things like this.  It is also possible
some distribution-installed tools (assuming in their preconfiguration
that they are on an open LAN) will bitch or break if no domain name is
defined -- I've not tried it so can't tell you.  /etc/hosts based name
resolution per se couldn't care less.

Domain names are used primarily for routing or domain administration.
The correspondance between a domain name and a subnet block or union of
subnet blocks is often useful for both.  If you have a private network,
no routing except between hosts on the same wire/switch, and no need to
differentiate subnet blocks for administrative purposes you can probably
live without.

If you think that there is any reasonable chance that your cluster might
one day end up on a public network it is reasonable to define one
anyway.  If any installed tools complain because there isn't one it is
certainly harmless enough to define one.  I generally do out of sheer
habit and inertia even within my private lan at home.

   rgb

> 
> Adri?n .
> 
> ________________________________________________________________________________
> Chat with friends online, try MSN Messenger: Click Here
> _______________________________________________ Beowulf mailing list,
> Beowulf at beowulf.org To change your subscription (digest mode or
> unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From emiller at techskills.com  Thu Apr 11 07:08:16 2002
From: emiller at techskills.com (Eric Miller)
Date: Thu, 11 Apr 2002 10:08:16 -0400
Subject: (no subject)
In-Reply-To: <20020411130008.63691.qmail@web9608.mail.yahoo.com>
Message-ID: <NMELJLHHFNGMNFFNGJAEEEEPCEAA.emiller@techskills.com>

>hi
>may i know if anyone here can provide detailed information on how to start
a a beowulf?
>i have about 9 machinese with Pentium MMX 233 Mhz processors, 128 mb RAM, 2
x 1.99 Gb. HDD
>I am also considering various designs and this is a school project. Thanks
for your help!

Li, this group is not very "newbie freindly" when you ask for detailed
information, I am working on a similar project and have only gotten spotty
assistance.  I will tell you that the Scyld system is by far the easiset to
set up, and it works well.  If you are familiar with Linux, you should have
no problems getting a Scyld beowulf up and running, just be sure to read the
docs first, they explain the NIC requirements on the master, and other
important physical setup issues.  After you get the network built, it is
really a well-built distribution, with a GUI and all.  See www.scyld.com  I
have tried others, but Scyld is far and away the best, with the most
community support.

After you get the cluster up and running, that's where the help seems to
drift off.  Most of the people in this group are upper-level users who know
how to get these MPI enabled programs to run on thier clusters.  If you are
like me, these topics are a little foreign.  If you are looking for
something to run continuously, like a display, they say the MandelBrot
renderer has a loop function, but I can't get it to work.  Someone suggested
SETI many months ago, which would be perfect, but SETI does not offer an MPI
enabled program.

Maybe you and I can work together, Ill help you get your cluster up and
running, then together we can rattle our swords for some detailed assistance
with the MPI programs (and programming).  Contact me emiller at techskills.com,
good luck!


From pzb at datastacks.com  Wed Apr 10 20:56:53 2002
From: pzb at datastacks.com (Peter Bowen)
Date: 10 Apr 2002 23:56:53 -0400
Subject: Newest RPM's?
In-Reply-To: <1017637861.1772.39.camel@loiosh>
References: <NMELJLHHFNGMNFFNGJAEMEIBCDAA.emiller@techskills.com>
	<1017612125.19271.20.camel@vhwalke.mathsci.usna.edu> 
	<002c01c1d933$ec02a3c0$c31fa6ac@xp>  <1017637861.1772.39.camel@loiosh>
Message-ID: <1018497414.18187.3.camel@gargleblaster.caffeinexchange.org>

On Mon, 2002-04-01 at 00:11, Sean DIlda wrote:
> On Sun, 2002-03-31 at 23:15, Eric Miller wrote:
> I must note, my above answer was given as if you were installing over
> RH6.2  I do *NOT* recommend installing the binary rpms from a
> RHL6.2-based Scyld Beowulf over a RHL7.2 system.  This is by no means a
> supported method.  I don't know if anything will or won't break in doing
> it, but I would assume that something will considering how much has
> changed between RHL6.2 and RHL7.2  If you really want to try, you're
> free to try, but if you want something right now, I'd suggesting going
> to the RHL6.2 based setup.

Are there beowulf packages available for RHL7.2?

Thanks.
Peter


From Todd_Henderson at Raytheon.com  Tue Apr  9 07:16:44 2002
From: Todd_Henderson at Raytheon.com (Todd Henderson)
Date: Tue, 09 Apr 2002 09:16:44 -0500
Subject: NASTRAN on cluster
Message-ID: <3CB2F7CC.82F09FF9@raytheon.com>

We're in the process of starting the search for a cluster to replace our
35 XP1000's that come off lease in Sept.  We currently use the cluster
for CFD only, but we have been instructed that it would be beneficial to
all to ensure that NASTRAN can use the new cluster.  Therefore, I was
wondering if anyone out there is running NASTRAN on a cluster?  If so,
what OS and cpu's are you using, and do you have any suggestions.

thanks,
Todd Henderson


From jayne at sphynx.clara.co.uk  Tue Apr  9 09:28:09 2002
From: jayne at sphynx.clara.co.uk (Jayne Heger)
Date: Tue, 9 Apr 2002 16:28:09 +0000
Subject: Parallel povraying baby!!!!
Message-ID: <E16uxLJ-0001sP-00@scrabble.freeuk.net>

Right,
I've now ran an parallel application on my Beowulf Cluster, and its working 
well!  ;)
When runnig pvmpov which is a parallel rendering farm application. I get 
these results when I render skyvase.pov, (a picture of a vase)


1 host = 7 mins, 11 seconds
2 hosts = 3min, 30 seconds
3 hosts = 2min 18 seconds

One other machine to add yet though!
These are all 486's

This is my final year project at university

What do you think???

kw1el huh????

Jayne


From rickey-co at mug.biglobe.ne.jp  Wed Apr 10 22:15:40 2002
From: rickey-co at mug.biglobe.ne.jp (Iwao Makino)
Date: Thu, 11 Apr 2002 14:15:40 +0900
Subject: very high bandwidth, low latency manner?
In-Reply-To: <Pine.LNX.4.30.0204050252190.7535-100000@elin.scali.no>
References: <Pine.LNX.4.30.0204050252190.7535-100000@elin.scali.no>
Message-ID: <v05010178b8dacb026736@[192.168.1.4]>

I think ... Quadrics<http://www.quadrics.com/> is another one.

Here's quick figures I have on hand....

RH7.2, 2.4.9 kernel for i860 cluster.
On their site, they claim;
after protocol, of 340Mbytes/second in each direction. The 
process-to-process latency for remote write operations is2us, and 5us for 
MPI messages.

But pricing is MUCH higher than SCI/Myrinet.

Best regards,

At 4:08 +0200 5.04.2002, Steffen Persvold wrote:
>On Thu, 4 Apr 2002, Jim Lux wrote:
>
>>  What's high bandwidth?
>>  What's low latency?
>  > How much money do you want to spend?
>I don't want to start a flamewar here, but I _think_ (not knowing real
>numbers for other high speed interconnects) that SCI has atleast the
>lowest latency and maybe also the highest point to point bandwidth :
>
>SCI application to application latency   : 2.5 us
>SCI application to application bandwidth : 325 MByte/sec
>
>Note that these numbers are very chipset specific (as most high speed
>interconnect numbers are), these numbers are from IA64. Here are numbers
>from a popular IA32 platform, the AMD 760MPX :
>
>SCI application to application latency   : 1.8 us
>SCI application to application bandwidth : 283 MByte/sec

-- 

Best regards,

Iwao Makino
Hard Data Ltd. Tokyo branch
mailto:iwao at harddata.com
http://www.harddata.com/

--> Now Shipping 1U Dual Athlon DDR <-
--> Ask me about the new Alpha DDR UP1500 Systems  <-


From jobriant at MPI-SoftTech.Com  Thu Apr 11 08:20:00 2002
From: jobriant at MPI-SoftTech.Com (Jennifer O'Briant)
Date: Thu, 11 Apr 2002 10:20:00 -0500 (CDT)
Subject: cluster of IBM Netfinity's
Message-ID: <Pine.GSO.4.33.0204111014190.21949-100000@mpi.mpi-softtech.com>

I have a cluster of 10 IBM Netfinity's that I am upgrading with a 2nd PIII
700Mhz,type slot 1, processor.  I am having a hard time finding a fan
and heatsink that will fit in these 1U size servers.  Does anyone have any
ideas where I can find a side mount fan or side flow fan that will work?


Jennifer O'Briant
Associate Systems Administrator
MPI Software Technology, Inc.


From rgb at phy.duke.edu  Wed Apr 10 06:31:06 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 10 Apr 2002 09:31:06 -0400 (EDT)
Subject: DHCP Help Again
In-Reply-To: <1018442200.3cb431d875b7d@mail1.nada.kth.se>
Message-ID: <Pine.LNX.4.44.0204100842570.29933-100000@ganesh.phy.duke.edu>

On Wed, 10 Apr 2002 tegner at nada.kth.se wrote:

> Quoting "Robert G. Brown" <rgb at phy.duke.edu>:
> Is there a convenient way to obtain static ip-addresses using dhcp without
> having to explicitly write down the mac-addresses in dhcpd.conf?
> 
> Regards,
> 
> /jon

Static?  As in each machine gets a single IP number that remains its own
"forever" through all reboots and which can be identified by a fixed
name in host tables?

Following the time-honored tradition of actually reading the man pages
for dhcpd, we see that the answer is "sort of".  As in in principle yes,
but only in a wierd way and would you really want to?

First of all, let us consider, how COULD it do this?  All dhcp knows of
a host is its mac address.  System needs IP number.  System broadcasts a
DHCP request.  What can the daemon do?

It can assign the address out of a range without looking at the MAC
address (beyond ensuring that isn't one that it recognizes already) or
it can look at the MAC address, do a table lookup and find it in the
table, and assign an IP address based on the table that maps MAC->IP.
This is pretty much what actually happens, and of course the lookup
table CAN ensure a static MAC->IP matchup.

The only question is how the lookup table is constructed.

The obvious way is by making explict per-host entries in the dhcpd.conf
file.  dhcpd reads the file and builds the table from what it finds
there.  You make the dhcpd.conf entries by hand or automagically by
means of a clever script.  In general this isn't a real problem.  You
have to make a per-host entry into e.g. /etc/hosts as well, or you won't
know the NAME the host is going to have to correspond to the IP number
the daemon happened to give it the first time it saw it.  The same
script can do both, given e.g. the MAC address and hostname you wish to
assign as arguments.

Now there is nothing to PREVENT the daemon from assigning IP numbers out
of the free range, creating a MAC->IP mapping, and saving the mapping
itself so that it is automagically reloaded after, say, a crash (which
tends to wipe out the table it builds in memory.  By strange chance,
this is pretty much exactly what dhcpd does.  It views IP's assigned out
of a given subnet range as "leases", to be given to hosts for a certain
amount of time and then recovered for reuse.  It saves its current lease
table in /var/lib/dhcp/dhcpd.leases.  Periodically it goes through this
table and "grooms" it, cleaning out expired leases so the IP numbers are
reused.  In many/most cases where range addresses are used, this is just
fine.  Remember, dhcp was "invented" at least in part to simplify
address assignment to rooms full of PC's running WinXX, a well-known
stupid operating system that wouldn't know what to do with a remote
login attempt if it saw one.  Heck, it doesn't know what to do with a
LOCAL login a lot of the time.  The IP<->name map is pretty unimportant
in this case, because you tend never to address the system by its
internet name.  So it's no big deal to let IP addresses for dumb WinXX
clients recycle.

Of course this isn't always true even for WinXX, especially if XX is 2K
or XP or NT.  Sometimes systems people really like to know that log
traces by IP number can be mapped into specific machines just so they
can go around with a sucker rod (see "man syslogd" and do a search on
"sucker") to administer correction, for example, even if they cannot
remotely login to the host in question.

dhcpd allows you to pretty much totally control the lease time used for
any given subnet or range.  You can set it from very short to "very
large", probably 4 billion or so seconds, which is (practically)
"infinity".  Infinity would be your coveted static IP address
assignment.

Once again I'd argue that although you CAN do this, you probably don't
want to in just about any unixoid context including LAN management and
cluster engineering.  There is something so satisfying, so USEFUL, about
the hostname<->IP map, and in order for this map to correspond to some
SPECIFIC box, you really are building the hostname<->IP<->MAC map,
piecewise.  And of course you need to leave the NIC's in the boxes,
since yes the map follows the NIC and not the actual box.  Although it
likely isn't the "only" way to control the complete chain,
simultaneously and explicity building /etc/hosts (or the NIS, LDAP,
rsync exported versions thereof), the various hostname-related
permissions (e.g. netgroups) and /etc/dhpcd.conf static entries is
arguably the best way.

To emphasize this last point, note that there is additional information
that can be specified in the dhcp static table entries, such as the name
of a per-host kickstart file to be used in installing it and more.  dhcp
is at least an approximation to a centralized configuration data server
and can perform lots of useful services in this arena, not just handing
out IP addresses.  Unfortunately (perhaps? as far as I know?) dhcp's
options can only be passed from it's own internal list, so one can't
QUITE use it as a way of globally synchronizing whole tables of
important data (like /etc/hosts,netgroups,passwd) across a subnet as
systems automatically and periodically renew their leases.  The list of
options it supports as it stands now is quite large, though.

I also don't know how susceptible it is to spoofing -- one problem with
daemon-based services like this is that if they aren't uniquely bound at
both ends to an authorized server and somebody puts a faster server on
the same physical network, one can sometimes do something like
dynamically change a systems "identity" in real time and gain access
privileges you otherwise might not have had.  Obviously, sending files
like /etc/passwd around in this way would be a very dangerous thing to
do unless the daemon were re-engineered to use something like ssl to
simultaneously certify the server and encrypt the traffic.

Hope this helps.  BTW, in addition to the always useful man pages for
dhcpd and dhcpd.conf (e.g.) you can and should look at the linux
documentation project site and the various RFCs that specify dhcp's
behavior and option spread.

   rgb


From rgb at phy.duke.edu  Wed Apr 10 09:03:42 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 10 Apr 2002 12:03:42 -0400 (EDT)
Subject: DHCP Help Again
In-Reply-To: <1018448717.3cb44b4d7ee2a@mail1.nada.kth.se>
Message-ID: <Pine.LNX.4.44.0204101133270.30097-100000@ganesh.phy.duke.edu>

On Wed, 10 Apr 2002 tegner at nada.kth.se wrote:

> Very helpful! Thanks!
> 
> But I'm still curious about how you make - automagically - the hardware ethernet
> line in dhcpd.conf initially. Say you have 100 machines. One way I would think
> of would be to use kickstart and:
> 
> Install the machines and boot them up in sequence and using the range statement
> in dhcpd.conf (so that the first machine gets 192.168.1.101, the second
> 192.168.1.102 ...)
> 
> Once all nodes are up use some script to extract the mac addresses for all the
> nodes and either modify dhcpd.conf - or - discard of dhcp completely and
> hardwire the ip-addresses to each node.
> 
> But I'm sure there are better ways to do this?

Not a whole lot better.  Since our installations tend to be O(10)
systems at a time (10-30, not hundreds) and since we've gotten our local
vendor to label each node with the MAC address before delivery (they've
gotta boot up and burn in each node anyway) we just pop the nodes in a
rack and use a script to insert a static entry for each one in an order
that corresponds to rack order.  After all, even though yes we label the
nodes, it would be a bit silly to have g01 next to g22 next to g13 in
rack order, and since we use the same dhcp server for nodes that we use
for the general department, we cannot guarantee that some other host
won't request and be granted a floating IP number that breaks the
ordered sequence.

The alternative (which would work fine for a cluster with a dedicated,
in-the-local-isolated-net, and hence predictable dhcpd server) is to
write the scriptset you describe, which we've actually considered doing.
Boot the nodes in rack order, with floating addresses hopefully assigned
in strict order from the address range, let them install themselves, and
in the meantime write a script that parses e.g. /var/log/messages for
the DHCP request and offer messages or /var/lib/dhpc/dhcpd.leases for
the MAC and IP mapping and creates the required host and dhcpd.conf
tables.  

We haven't gone this way partly out of laziness -- with tens of systems
at a time to install it will only save work (relative to the time
required to write the scripts) after we've used the scriptset for years
-- and partly because to our direct observation at least one node
install in twenty or thirty will screw up and occur in the wrong order.
This, of course, will screw up EVERYTHING -- either one physically
rearranges the rack or hand edits the tables, either of which costs one
far more than the labor saved in the first place.

There may be a better solution (probably smarter, more complex scripts
that can perform e.g. node insert and delete operations and hence manage
a reordering of the tables without having to hand edit everything) but
more complex scripts require a signficant investment in time and one
needs a very clear conceptualization of the design to have a good chance
at ending up with something really usable.  This in turn requires
experience with the simpler scripts and a time living with their
frustrations.  We just don't have enough nodes to do all this except for
the fun of it -- maybe a really big DOE site does but we don't.  So
we'll likely continue to use simple-building block scripts that require
the entry of the MAC address and desired hostname/IP mapping as
parameters (possibly augmented by a script that extracts MAC addresses
from the log files, since even with help for the vendor we often have
nodes or workstations to install with unknown MAC addresses and have to
boot once, get the MAC address, and boot again to do the install).

Not to beat dead horses or anything, but (IMHO) a lot of this management
scriptset development is retarded by the fact that every single system
tool has a configuration file with its own unique format and structure.
I am well on the way to becoming downright religious about using xml as
THE basis for the formatting of this sort of thing, at least where one
can choose to do so in future applications.  If dhcpd.conf and
dhcpd.leases were written in an xml-compliant way, it would both make
much better logical sense and it would be easier to both parse and write
tools to manipulate them.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Thu Apr 11 08:54:51 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 11 Apr 2002 11:54:51 -0400 (EDT)
Subject: Newest RPM's?
In-Reply-To: <1018497414.18187.3.camel@gargleblaster.caffeinexchange.org>
Message-ID: <Pine.LNX.4.44.0204111145390.1141-100000@ganesh.phy.duke.edu>

On 10 Apr 2002, Peter Bowen wrote:

> Are there beowulf packages available for RHL7.2?

Depends on what you mean.  Some of the fundamental tools (PVM, flavors
of MPI, more) are already packaged in and in 7.2.  Ditto the full range
of GPL compilers and programming support tools.  Even commercial beowulf
packages like scyld or a turnkey vendor's arrangement often use RH as a
base, although they aren't always current with the very latest release.

Pretty much all the truly open source beowulf tools either are available
in rpm form that will install under 7.2 (or source rpm that will rebuild
and install under 7.2) or at the very least and worst in a tarball form
that will build and install under any unixoid/posix environment
including 7.2.  In fact, it is almost tautological that this would be so
-- beowulf tools were mostly developed on linux/gnu boxes and RH is at
heart a generic linux/gnu distribution.

Commercial packages (e.g. portland or absoft compilers, PBS-Pro) can
almost always be obtained in a form that runs under 7.2.  The only
exceptions are likely to be ones with library issues that haven't yet
been ported.  libc has from time to time changed enough to break things,
so tools developed on e.g. RH 5.2 don't always work on 7.2 without some
porting effort (but I know of no mainstream tools in this category).

So I think the answer would have to be "yes";-)

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From josip at icase.edu  Thu Apr 11 09:00:04 2002
From: josip at icase.edu (Josip Loncaric)
Date: Thu, 11 Apr 2002 12:00:04 -0400
Subject: Naming etc. (Was: DHCP Help)
References: <Pine.LNX.4.44.0204110905550.1632-100000@lucifer.rgb.private.net>
Message-ID: <3CB5B304.49BCC793@icase.edu>

"Robert G. Brown" wrote:
> 
> [...] my hosts are all on a private
> internal network anyway and not in nameservice.

Good policy!  Private hostnames/addresses should remain private because
they are not guaranteed to be unique across the entire Internet.  The
DNS server should contain only registered hostnames/addresses.

The head node of a cluster is typically multi-homed and its public
interface should be DNS registered, but the internal private interface
(and the client nodes on the internal private network) are best resolved
via /etc/hosts, where internal domain name is determined from the FQDN
form of the name.  If /etc/hosts on client1 contains:

192.168.1.1     client1.internal.domain   client1

then 'dnsdomainname' on client1 returns 'internal.domain' (clearly not
found in any Internet registry).  This would work fine internally, but
NOT outside the cluster (e.g. sendmail may have problems, etc.).

The /etc/hosts tables should be consistent across the cluster, even if
there are reasons to play tricks.  For example, one typically has all
machines on a fast ethernet (FE) subnet (say 192.168.1.x) but a few may
also have gigabit ethernet (GE) interfaces (say 192.168.2.x).  Using IP
level routing can result in complicated routing tables, because only
specific FE hosts can also be reached via the GE interface.  What about
name level "routing"?  While /etc/hosts can be used to make hostnames of
GE machines resolve to GE addresses on GE machines but to their FE
addresses on the FE-only machines, this can lead to problems with
software packages which assume globally consistent hostname/address
mapping.

For example, grid software (Globus) needs a globally consistent FQDN/IP
mapping.  The grid machine name is the fully-qualified domain name or
Internet name of a grid machine. It should be the name returned by the
"gethostbyname()" function (from libc) and the primary name retrieved
from DNS via nslookup.  The primary name should correspond to the host's
primary interface (if there is more than one) and be fully accessible
across the grid.  The grid could involve private addresses, but those
are visible only WITHIN an organization because private addresses must
not be routable outside an organization.  This is a serious limitation
-- so it is probably best to limit grids to publicly registered hosts
only.  Proxy processes on the head nodes to access internal machines may
be needed.

Most clusters are built around a private subnet, sometimes with IP
masquerading enabled on the head node so that the internal clients can
'call out'.  This still means that internal clients are not visible
externally, i.e. one cannot 'call in' from the outside.  As a
consequence, parallel jobs which assume global TCP connectivity of all
participating machines (e.g. MPICH-G2) will have problems in using two
clusters (each with its own private internal subnet).  At the moment,
every node (that you wish to use in a MPICH-G2 job) must
have a public IP address and must be fully accessible.  To run jobs
across several clusters with internal private networks, the MPI
programmer would need to provide a proxy process on the head node to
overcome this difficulty.

In summary, naming is a simple concept but just under the surface is a
can of worms created by established programming practices based on
diverse assumptions.  Multiply connected machines and/or public/private
network mixtures need to be set up with great care.  Tricky setups are
fragile; simplicity and transparency works better.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From tegner at nada.kth.se  Wed Apr 10 05:36:40 2002
From: tegner at nada.kth.se (tegner at nada.kth.se)
Date: Wed, 10 Apr 2002 14:36:40 +0200 (MET DST)
Subject: DHCP Help Again
In-Reply-To: <Pine.LNX.4.44.0204031619270.30309-100000@ganesh.phy.duke.edu>
References: <Pine.LNX.4.44.0204031619270.30309-100000@ganesh.phy.duke.edu>
Message-ID: <1018442200.3cb431d875b7d@mail1.nada.kth.se>

Quoting "Robert G. Brown" <rgb at phy.duke.edu>:
Is there a convenient way to obtain static ip-addresses using dhcp without
having to explicitly write down the mac-addresses in dhcpd.conf?

Regards,

/jon


> On Wed, 3 Apr 2002, Adrian Garcia Garcia wrote:
> 
> For one thing don't use the range statement -- it tells dhcpd the range
> of IP numbers to assign UNKNOWN ethernet numbers.  You are statically
> assigning an IP number in your "free" range to a particular host with a
> KNOWN ethernet number below.  I don't know what dhcpd would do in that
> case -- something sensible one would hope but then, maybe not.  The
> range statement is really there so you can dynamically allocate
> addresses from the range to hosts you may never have seen before that
> you don't care to ever address by name (as they might well get a
> different IP number on the next boot).  
> 
> DHCP servers run by ISP's not infrequently use the range feature to
> conserve IP numbers -- they only need enough to cover the greatest
> number of connections they are likely to have at any one time, not one
> IP number per host that might ever connect.  Departments might use it to
> give IP numbers to laptops brought in by visitors (with the extra
> benefit that they can assign a subnet block that isn't "trusted" by the
> usual department servers and/or is firewalled from the outside by an
> ip-forwarding/masquerading host).
> 
> You want "only" static IP's in your cluster, as you'd like nodo1 to be
> the same machine and IP address every time.
> 
> Be a bit careful about your use of domain names.  As it happens, I don't
> find cluster.org registered yet (amazingly enough!) but it is pretty
> easy to pick one that does exist in nameservice in the outside world.
> In that case you'll run a serious risk of routing or name resolution
> problems depending on things like the search order you use in
> /etc/nsswitch.conf.  Even my previous example of rgb.private.net is a
> bit risky.
> 
> You should run a nameserver (cache only is fine) on your 192.168.1.1
> server, presuming it lives on an external network and you care to
> resolve global names.
> 
> Similarly you may want:
> 
>  option routers		192.168.1.1;
> 
> if you want internal hosts to be able to get out through your (presumed
> gateway) server.
> 
> Finally, if you want nodo1 to come up knowing its own name without
> hardwiring it in on the node itself, add
> 
>  option host-name	nodo1;
> 
> to its definition.
> 
> I admit that I do tend to lay out my dhcpd.conf a bit differently than
> you have it below but I don't think that the differences are
> particularly significant, and you have a copy of the one I use anyway if
> you want to play with the pieces.  You should find a log trace of
> dhcpd's activities in /var/log/messages, which should help with any
> further debugging.
> 
> On your nodo1 host, make sure that:
> 
> cat /etc/sysconfig/network-scripts/ifcfg-eth0
> DEVICE=eth0
> BOOTPROTO=dhcp
> ONBOOT=yes
> 
> and
> 
> cat /etc/sysconfig/network
> NETWORKING=yes
> HOSTNAME=nodo1
> 
> and that in /etc/modules.conf there is something like:
> 
> cat /etc/modules.conf
> alias parport_lowlevel parport_pc
> alias eth0 tulip
> 
> (or instead of tulip, whatever your network module is).
> 
> If you then boot your e.g. RH client it SHOULD just come up,
> automatically try to start the network on device eth0 using dhcp as its
> protocol for obtaining and IP number, ask the dhcp server for an address
> and a route, and just "work" when they come back.
> 
>   Hope this helps.
> 
>        rgb
> 
> > server-name "server.cluster.org"
> >  
> > subnet 192.168.1.0 netmask 255.255.255.0
> > {
> >   range 192.168.1.2         192.168.1.10   #my client has the ip
> > 192.168.1.2
> >                                                                 #and
> my
> > server the static ip 192.168.1.1
> >  option subnet-mask                             255.255.255.0;
> >  option broadcast-address                    192.168.1.255;
> >  option domain-name-server                 192.168.1.1;  
> >  option domain-name                            "cluster.org";
> >  
> >  host  nodo1.cluster.org
> >  {
> >     hardware ethernet 00:60:97:a1:ef:e0; #here is the address of the
> > client's card
> >     fixed-address        192.168.1.2;
> >  }
> > } 
> >  
> > And finally some files on my server.
> >  
> > NETWORK
> > ------------------------------------------
> > networking = yes
> > hostname =server.cluster.org
> > gatewaydev = eth0
> > gatewaye=
> > ------------------------------------------
> >  
> > HOSTS ( In my server and in the client I have the same on this file )
> > ------------------------------------------
> > 127.0.0.1             localhost
> > 192.168.1.1         server.cluster.org
> > 192.168.1.2         nodo1.cluster.org
> >  
> >  
> > Ok thats the information, I am a little confuse, could you help me
> please
> > =). I can?t detect the mistake, I dont know if is the server or some
> card
> > =s. Thanks for all.
> > 
> >
> ________________________________________________________________________________
> > Get your FREE download of MSN Explorer at http://explorer.msn.com.
> > _______________________________________________ Beowulf mailing list,
> > Beowulf at beowulf.org To change your subscription (digest mode or
> > unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> 
> -- 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 


From tegner at nada.kth.se  Wed Apr 10 07:25:17 2002
From: tegner at nada.kth.se (tegner at nada.kth.se)
Date: Wed, 10 Apr 2002 16:25:17 +0200 (MET DST)
Subject: DHCP Help Again
In-Reply-To: <Pine.LNX.4.44.0204100842570.29933-100000@ganesh.phy.duke.edu>
References: <Pine.LNX.4.44.0204100842570.29933-100000@ganesh.phy.duke.edu>
Message-ID: <1018448717.3cb44b4d7ee2a@mail1.nada.kth.se>

Very helpful! Thanks!

But I'm still curious about how you make - automagically - the hardware ethernet
line in dhcpd.conf initially. Say you have 100 machines. One way I would think
of would be to use kickstart and:

Install the machines and boot them up in sequence and using the range statement
in dhcpd.conf (so that the first machine gets 192.168.1.101, the second
192.168.1.102 ...)

Once all nodes are up use some script to extract the mac addresses for all the
nodes and either modify dhcpd.conf - or - discard of dhcp completely and
hardwire the ip-addresses to each node.

But I'm sure there are better ways to do this?

Thanks again,

/jon

Quoting "Robert G. Brown" <rgb at phy.duke.edu>:

> On Wed, 10 Apr 2002 tegner at nada.kth.se wrote:
> 
> > Quoting "Robert G. Brown" <rgb at phy.duke.edu>:
> > Is there a convenient way to obtain static ip-addresses using dhcp
> without
> > having to explicitly write down the mac-addresses in dhcpd.conf?
> > 
> > Regards,
> > 
> > /jon
> 
> Static?  As in each machine gets a single IP number that remains its own
> "forever" through all reboots and which can be identified by a fixed
> name in host tables?
> 
> Following the time-honored tradition of actually reading the man pages
> for dhcpd, we see that the answer is "sort of".  As in in principle yes,
> but only in a wierd way and would you really want to?
> 
> First of all, let us consider, how COULD it do this?  All dhcp knows of
> a host is its mac address.  System needs IP number.  System broadcasts a
> DHCP request.  What can the daemon do?
> 
> It can assign the address out of a range without looking at the MAC
> address (beyond ensuring that isn't one that it recognizes already) or
> it can look at the MAC address, do a table lookup and find it in the
> table, and assign an IP address based on the table that maps MAC->IP.
> This is pretty much what actually happens, and of course the lookup
> table CAN ensure a static MAC->IP matchup.
> 
> The only question is how the lookup table is constructed.
> 
> The obvious way is by making explict per-host entries in the dhcpd.conf
> file.  dhcpd reads the file and builds the table from what it finds
> there.  You make the dhcpd.conf entries by hand or automagically by
> means of a clever script.  In general this isn't a real problem.  You
> have to make a per-host entry into e.g. /etc/hosts as well, or you won't
> know the NAME the host is going to have to correspond to the IP number
> the daemon happened to give it the first time it saw it.  The same
> script can do both, given e.g. the MAC address and hostname you wish to
> assign as arguments.
> 
> Now there is nothing to PREVENT the daemon from assigning IP numbers out
> of the free range, creating a MAC->IP mapping, and saving the mapping
> itself so that it is automagically reloaded after, say, a crash (which
> tends to wipe out the table it builds in memory.  By strange chance,
> this is pretty much exactly what dhcpd does.  It views IP's assigned out
> of a given subnet range as "leases", to be given to hosts for a certain
> amount of time and then recovered for reuse.  It saves its current lease
> table in /var/lib/dhcp/dhcpd.leases.  Periodically it goes through this
> table and "grooms" it, cleaning out expired leases so the IP numbers are
> reused.  In many/most cases where range addresses are used, this is just
> fine.  Remember, dhcp was "invented" at least in part to simplify
> address assignment to rooms full of PC's running WinXX, a well-known
> stupid operating system that wouldn't know what to do with a remote
> login attempt if it saw one.  Heck, it doesn't know what to do with a
> LOCAL login a lot of the time.  The IP<->name map is pretty unimportant
> in this case, because you tend never to address the system by its
> internet name.  So it's no big deal to let IP addresses for dumb WinXX
> clients recycle.
> 
> Of course this isn't always true even for WinXX, especially if XX is 2K
> or XP or NT.  Sometimes systems people really like to know that log
> traces by IP number can be mapped into specific machines just so they
> can go around with a sucker rod (see "man syslogd" and do a search on
> "sucker") to administer correction, for example, even if they cannot
> remotely login to the host in question.
> 
> dhcpd allows you to pretty much totally control the lease time used for
> any given subnet or range.  You can set it from very short to "very
> large", probably 4 billion or so seconds, which is (practically)
> "infinity".  Infinity would be your coveted static IP address
> assignment.
> 
> Once again I'd argue that although you CAN do this, you probably don't
> want to in just about any unixoid context including LAN management and
> cluster engineering.  There is something so satisfying, so USEFUL, about
> the hostname<->IP map, and in order for this map to correspond to some
> SPECIFIC box, you really are building the hostname<->IP<->MAC map,
> piecewise.  And of course you need to leave the NIC's in the boxes,
> since yes the map follows the NIC and not the actual box.  Although it
> likely isn't the "only" way to control the complete chain,
> simultaneously and explicity building /etc/hosts (or the NIS, LDAP,
> rsync exported versions thereof), the various hostname-related
> permissions (e.g. netgroups) and /etc/dhpcd.conf static entries is
> arguably the best way.
> 
> To emphasize this last point, note that there is additional information
> that can be specified in the dhcp static table entries, such as the name
> of a per-host kickstart file to be used in installing it and more.  dhcp
> is at least an approximation to a centralized configuration data server
> and can perform lots of useful services in this arena, not just handing
> out IP addresses.  Unfortunately (perhaps? as far as I know?) dhcp's
> options can only be passed from it's own internal list, so one can't
> QUITE use it as a way of globally synchronizing whole tables of
> important data (like /etc/hosts,netgroups,passwd) across a subnet as
> systems automatically and periodically renew their leases.  The list of
> options it supports as it stands now is quite large, though.
> 
> I also don't know how susceptible it is to spoofing -- one problem with
> daemon-based services like this is that if they aren't uniquely bound at
> both ends to an authorized server and somebody puts a faster server on
> the same physical network, one can sometimes do something like
> dynamically change a systems "identity" in real time and gain access
> privileges you otherwise might not have had.  Obviously, sending files
> like /etc/passwd around in this way would be a very dangerous thing to
> do unless the daemon were re-engineered to use something like ssl to
> simultaneously certify the server and encrypt the traffic.
> 
> Hope this helps.  BTW, in addition to the always useful man pages for
> dhcpd and dhcpd.conf (e.g.) you can and should look at the linux
> documentation project site and the various RFCs that specify dhcp's
> behavior and option spread.
> 
>    rgb
> 
> 


From joelja at darkwing.uoregon.edu  Thu Apr 11 09:27:47 2002
From: joelja at darkwing.uoregon.edu (Joel Jaeggli)
Date: Thu, 11 Apr 2002 09:27:47 -0700 (PDT)
Subject: DHCP Help Again
In-Reply-To: <1018442200.3cb431d875b7d@mail1.nada.kth.se>
Message-ID: <Pine.LNX.4.44.0204110922500.5112-100000@twin.uoregon.edu>

You have to have a host-specific value to key on... that would be the mac 
address...

you can approach the problem a different way (dynamic dns) so that the 
machine get the same hostname regardless of what ip they get but that's 
more trouble than it's worth for a cluster...

 On Wed, 10 Apr 2002 
tegner at nada.kth.se wrote:

> Quoting "Robert G. Brown" <rgb at phy.duke.edu>:
> Is there a convenient way to obtain static ip-addresses using dhcp without
> having to explicitly write down the mac-addresses in dhcpd.conf?
> 
> Regards,
> 
> /jon
> 
> 
> 
> > On Wed, 3 Apr 2002, Adrian Garcia Garcia wrote:
> > 
> > For one thing don't use the range statement -- it tells dhcpd the range
> > of IP numbers to assign UNKNOWN ethernet numbers.  You are statically
> > assigning an IP number in your "free" range to a particular host with a
> > KNOWN ethernet number below.  I don't know what dhcpd would do in that
> > case -- something sensible one would hope but then, maybe not.  The
> > range statement is really there so you can dynamically allocate
> > addresses from the range to hosts you may never have seen before that
> > you don't care to ever address by name (as they might well get a
> > different IP number on the next boot).  
> > 
> > DHCP servers run by ISP's not infrequently use the range feature to
> > conserve IP numbers -- they only need enough to cover the greatest
> > number of connections they are likely to have at any one time, not one
> > IP number per host that might ever connect.  Departments might use it to
> > give IP numbers to laptops brought in by visitors (with the extra
> > benefit that they can assign a subnet block that isn't "trusted" by the
> > usual department servers and/or is firewalled from the outside by an
> > ip-forwarding/masquerading host).
> > 
> > You want "only" static IP's in your cluster, as you'd like nodo1 to be
> > the same machine and IP address every time.
> > 
> > Be a bit careful about your use of domain names.  As it happens, I don't
> > find cluster.org registered yet (amazingly enough!) but it is pretty
> > easy to pick one that does exist in nameservice in the outside world.
> > In that case you'll run a serious risk of routing or name resolution
> > problems depending on things like the search order you use in
> > /etc/nsswitch.conf.  Even my previous example of rgb.private.net is a
> > bit risky.
> > 
> > You should run a nameserver (cache only is fine) on your 192.168.1.1
> > server, presuming it lives on an external network and you care to
> > resolve global names.
> > 
> > Similarly you may want:
> > 
> >  option routers		192.168.1.1;
> > 
> > if you want internal hosts to be able to get out through your (presumed
> > gateway) server.
> > 
> > Finally, if you want nodo1 to come up knowing its own name without
> > hardwiring it in on the node itself, add
> > 
> >  option host-name	nodo1;
> > 
> > to its definition.
> > 
> > I admit that I do tend to lay out my dhcpd.conf a bit differently than
> > you have it below but I don't think that the differences are
> > particularly significant, and you have a copy of the one I use anyway if
> > you want to play with the pieces.  You should find a log trace of
> > dhcpd's activities in /var/log/messages, which should help with any
> > further debugging.
> > 
> > On your nodo1 host, make sure that:
> > 
> > cat /etc/sysconfig/network-scripts/ifcfg-eth0
> > DEVICE=eth0
> > BOOTPROTO=dhcp
> > ONBOOT=yes
> > 
> > and
> > 
> > cat /etc/sysconfig/network
> > NETWORKING=yes
> > HOSTNAME=nodo1
> > 
> > and that in /etc/modules.conf there is something like:
> > 
> > cat /etc/modules.conf
> > alias parport_lowlevel parport_pc
> > alias eth0 tulip
> > 
> > (or instead of tulip, whatever your network module is).
> > 
> > If you then boot your e.g. RH client it SHOULD just come up,
> > automatically try to start the network on device eth0 using dhcp as its
> > protocol for obtaining and IP number, ask the dhcp server for an address
> > and a route, and just "work" when they come back.
> > 
> >   Hope this helps.
> > 
> >        rgb
> > 
> > > server-name "server.cluster.org"
> > >  
> > > subnet 192.168.1.0 netmask 255.255.255.0
> > > {
> > >   range 192.168.1.2         192.168.1.10   #my client has the ip
> > > 192.168.1.2
> > >                                                                 #and
> > my
> > > server the static ip 192.168.1.1
> > >  option subnet-mask                             255.255.255.0;
> > >  option broadcast-address                    192.168.1.255;
> > >  option domain-name-server                 192.168.1.1;  
> > >  option domain-name                            "cluster.org";
> > >  
> > >  host  nodo1.cluster.org
> > >  {
> > >     hardware ethernet 00:60:97:a1:ef:e0; #here is the address of the
> > > client's card
> > >     fixed-address        192.168.1.2;
> > >  }
> > > } 
> > >  
> > > And finally some files on my server.
> > >  
> > > NETWORK
> > > ------------------------------------------
> > > networking = yes
> > > hostname =server.cluster.org
> > > gatewaydev = eth0
> > > gatewaye=
> > > ------------------------------------------
> > >  
> > > HOSTS ( In my server and in the client I have the same on this file )
> > > ------------------------------------------
> > > 127.0.0.1             localhost
> > > 192.168.1.1         server.cluster.org
> > > 192.168.1.2         nodo1.cluster.org
> > >  
> > >  
> > > Ok thats the information, I am a little confuse, could you help me
> > please
> > > =). I can?t detect the mistake, I dont know if is the server or some
> > card
> > > =s. Thanks for all.
> > > 
> > >
> > ________________________________________________________________________________
> > > Get your FREE download of MSN Explorer at http://explorer.msn.com.
> > > _______________________________________________ Beowulf mailing list,
> > > Beowulf at beowulf.org To change your subscription (digest mode or
> > > unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > > 
> > 
> > -- 
> > Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> > Duke University Dept. of Physics, Box 90305
> > Durham, N.C. 27708-0305
> > Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> > 
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
-------------------------------------------------------------------------- 
Joel Jaeggli	      Academic User Services   joelja at darkwing.uoregon.edu    
--    PGP Key Fingerprint: 1DE9 8FCA 51FB 4195 B42A 9C32 A30D 121E      --
The accumulation of all powers, legislative, executive, and judiciary, in 
the same hands, whether of one, a few, or many, and whether hereditary, 
selfappointed, or elective, may justly be pronounced the very definition of
tyranny. - James Madison, Federalist Papers 47 -  Feb 1, 1788


From rgb at phy.duke.edu  Thu Apr 11 10:17:51 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 11 Apr 2002 13:17:51 -0400 (EDT)
Subject: DHCP Help Again
In-Reply-To: <1018448717.3cb44b4d7ee2a@mail1.nada.kth.se>
Message-ID: <Pine.LNX.4.44.0204111308060.1141-100000@ganesh.phy.duke.edu>

On Wed, 10 Apr 2002 tegner at nada.kth.se wrote:

> Very helpful! Thanks!
> 
> But I'm still curious about how you make - automagically - the hardware ethernet
> line in dhcpd.conf initially. Say you have 100 machines. One way I would think
> of would be to use kickstart and:
> 
> Install the machines and boot them up in sequence and using the range statement
> in dhcpd.conf (so that the first machine gets 192.168.1.101, the second
> 192.168.1.102 ...)
> 
> Once all nodes are up use some script to extract the mac addresses for all the
> nodes and either modify dhcpd.conf - or - discard of dhcp completely and
> hardwire the ip-addresses to each node.
> 
> But I'm sure there are better ways to do this?

Not that I know of.  Maybe somebody else knows of one.  I'd just use
perl or bash (either would probably work, although parsing is generally
easier in perl), parse e.g.

Apr 11 08:18:09 lucifer dhcpd: DHCPREQUEST for 192.168.1.140 from 00:20:e0:6d:a0:05 via eth0
Apr 11 08:18:09 lucifer dhcpd: DHCPACK on 192.168.1.140 to 00:20:e0:6d:a0:05 via eth0

from /var/log/messages on the dhcp server, and write an output routine
to generate

# golem (Linux/Windows laptop lilith, second/100BT interface)
host golem {
        hardware ethernet       00:20:e0:6d:a0:05;
        fixed-address           192.168.1.140;
        next-server             192.168.1.131;
        option routers          192.168.1.1;
        option domain-name      "rgb.private.net";
        option host-name        "golem";
}

and

192.168.1.140   golem.rgb.private.net   golem

and append them to /etc/dhcpd.conf and /etc/hosts respectively, and then
distribute copies of the resulting /etc/hosts -- as Josip made
eloquently clear your private internal network should resolve
consistently on all PIN hosts and probably should have SOME sort of
domainname defined so that software the might include a
getdomainbyname() call and might not include an adequate check and
handle of a null value can cope.  It's hard to know what assumptions
were made by the designer of every single piece of network software you
might want to run...

Of coures you'll probably want to do the b01, b02, b03... hostname
iteration -- I'm just pulling an example at random out of my own log
tables.

   rgb

> 
> Thanks again,
> 
> /jon
> 
> Quoting "Robert G. Brown" <rgb at phy.duke.edu>:
> 
> > On Wed, 10 Apr 2002 tegner at nada.kth.se wrote:
> > 
> > > Quoting "Robert G. Brown" <rgb at phy.duke.edu>:
> > > Is there a convenient way to obtain static ip-addresses using dhcp
> > without
> > > having to explicitly write down the mac-addresses in dhcpd.conf?
> > > 
> > > Regards,
> > > 
> > > /jon
> > 
> > Static?  As in each machine gets a single IP number that remains its own
> > "forever" through all reboots and which can be identified by a fixed
> > name in host tables?
> > 
> > Following the time-honored tradition of actually reading the man pages
> > for dhcpd, we see that the answer is "sort of".  As in in principle yes,
> > but only in a wierd way and would you really want to?
> > 
> > First of all, let us consider, how COULD it do this?  All dhcp knows of
> > a host is its mac address.  System needs IP number.  System broadcasts a
> > DHCP request.  What can the daemon do?
> > 
> > It can assign the address out of a range without looking at the MAC
> > address (beyond ensuring that isn't one that it recognizes already) or
> > it can look at the MAC address, do a table lookup and find it in the
> > table, and assign an IP address based on the table that maps MAC->IP.
> > This is pretty much what actually happens, and of course the lookup
> > table CAN ensure a static MAC->IP matchup.
> > 
> > The only question is how the lookup table is constructed.
> > 
> > The obvious way is by making explict per-host entries in the dhcpd.conf
> > file.  dhcpd reads the file and builds the table from what it finds
> > there.  You make the dhcpd.conf entries by hand or automagically by
> > means of a clever script.  In general this isn't a real problem.  You
> > have to make a per-host entry into e.g. /etc/hosts as well, or you won't
> > know the NAME the host is going to have to correspond to the IP number
> > the daemon happened to give it the first time it saw it.  The same
> > script can do both, given e.g. the MAC address and hostname you wish to
> > assign as arguments.
> > 
> > Now there is nothing to PREVENT the daemon from assigning IP numbers out
> > of the free range, creating a MAC->IP mapping, and saving the mapping
> > itself so that it is automagically reloaded after, say, a crash (which
> > tends to wipe out the table it builds in memory.  By strange chance,
> > this is pretty much exactly what dhcpd does.  It views IP's assigned out
> > of a given subnet range as "leases", to be given to hosts for a certain
> > amount of time and then recovered for reuse.  It saves its current lease
> > table in /var/lib/dhcp/dhcpd.leases.  Periodically it goes through this
> > table and "grooms" it, cleaning out expired leases so the IP numbers are
> > reused.  In many/most cases where range addresses are used, this is just
> > fine.  Remember, dhcp was "invented" at least in part to simplify
> > address assignment to rooms full of PC's running WinXX, a well-known
> > stupid operating system that wouldn't know what to do with a remote
> > login attempt if it saw one.  Heck, it doesn't know what to do with a
> > LOCAL login a lot of the time.  The IP<->name map is pretty unimportant
> > in this case, because you tend never to address the system by its
> > internet name.  So it's no big deal to let IP addresses for dumb WinXX
> > clients recycle.
> > 
> > Of course this isn't always true even for WinXX, especially if XX is 2K
> > or XP or NT.  Sometimes systems people really like to know that log
> > traces by IP number can be mapped into specific machines just so they
> > can go around with a sucker rod (see "man syslogd" and do a search on
> > "sucker") to administer correction, for example, even if they cannot
> > remotely login to the host in question.
> > 
> > dhcpd allows you to pretty much totally control the lease time used for
> > any given subnet or range.  You can set it from very short to "very
> > large", probably 4 billion or so seconds, which is (practically)
> > "infinity".  Infinity would be your coveted static IP address
> > assignment.
> > 
> > Once again I'd argue that although you CAN do this, you probably don't
> > want to in just about any unixoid context including LAN management and
> > cluster engineering.  There is something so satisfying, so USEFUL, about
> > the hostname<->IP map, and in order for this map to correspond to some
> > SPECIFIC box, you really are building the hostname<->IP<->MAC map,
> > piecewise.  And of course you need to leave the NIC's in the boxes,
> > since yes the map follows the NIC and not the actual box.  Although it
> > likely isn't the "only" way to control the complete chain,
> > simultaneously and explicity building /etc/hosts (or the NIS, LDAP,
> > rsync exported versions thereof), the various hostname-related
> > permissions (e.g. netgroups) and /etc/dhpcd.conf static entries is
> > arguably the best way.
> > 
> > To emphasize this last point, note that there is additional information
> > that can be specified in the dhcp static table entries, such as the name
> > of a per-host kickstart file to be used in installing it and more.  dhcp
> > is at least an approximation to a centralized configuration data server
> > and can perform lots of useful services in this arena, not just handing
> > out IP addresses.  Unfortunately (perhaps? as far as I know?) dhcp's
> > options can only be passed from it's own internal list, so one can't
> > QUITE use it as a way of globally synchronizing whole tables of
> > important data (like /etc/hosts,netgroups,passwd) across a subnet as
> > systems automatically and periodically renew their leases.  The list of
> > options it supports as it stands now is quite large, though.
> > 
> > I also don't know how susceptible it is to spoofing -- one problem with
> > daemon-based services like this is that if they aren't uniquely bound at
> > both ends to an authorized server and somebody puts a faster server on
> > the same physical network, one can sometimes do something like
> > dynamically change a systems "identity" in real time and gain access
> > privileges you otherwise might not have had.  Obviously, sending files
> > like /etc/passwd around in this way would be a very dangerous thing to
> > do unless the daemon were re-engineered to use something like ssl to
> > simultaneously certify the server and encrypt the traffic.
> > 
> > Hope this helps.  BTW, in addition to the always useful man pages for
> > dhcpd and dhcpd.conf (e.g.) you can and should look at the linux
> > documentation project site and the various RFCs that specify dhcp's
> > behavior and option spread.
> > 
> >    rgb
> > 
> > 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From shaeffer at neuralscape.com  Thu Apr 11 03:01:55 2002
From: shaeffer at neuralscape.com (Karen Shaeffer)
Date: Thu, 11 Apr 2002 03:01:55 -0700
Subject: DHCP Help Again
In-Reply-To: <Pine.LNX.4.44.0204100842570.29933-100000@ganesh.phy.duke.edu>; from rgb@phy.duke.edu on Wed, Apr 10, 2002 at 09:31:06AM -0400
References: <1018442200.3cb431d875b7d@mail1.nada.kth.se> <Pine.LNX.4.44.0204100842570.29933-100000@ganesh.phy.duke.edu>
Message-ID: <20020411030155.A30307@synapse.neuralscape.com>

On Wed, Apr 10, 2002 at 09:31:06AM -0400, Robert G. Brown wrote:
> 
> Hope this helps.  BTW, in addition to the always useful man pages for
> dhcpd and dhcpd.conf (e.g.) you can and should look at the linux
> documentation project site and the various RFCs that specify dhcp's
> behavior and option spread.


http://www.amazon.com/exec/obidos/search-handle-form/ref=s_sf_b_as/002-5123550-8208810

Is a reasonably well done book that folks interested in DHCP might consider
acquiring. It provides a comprehensive overview of the subject.

cheers,
Karen
-- 
 Karen Shaeffer
 Neuralscape; Santa Cruz, Ca. 95060
 shaeffer at neuralscape.com  http://www.neuralscape.com


From roger at ERC.MsState.Edu  Thu Apr 11 11:00:04 2002
From: roger at ERC.MsState.Edu (Roger L. Smith)
Date: Thu, 11 Apr 2002 13:00:04 -0500
Subject: DHCP Help Again
In-Reply-To: <1018448717.3cb44b4d7ee2a@mail1.nada.kth.se>
Message-ID: <Pine.SGI.4.44.0204111257070.1146-100000@Downforce.ERC.MsState.Edu>

On Wed, 10 Apr 2002 tegner at nada.kth.se wrote:

> Very helpful! Thanks!
>
> But I'm still curious about how you make - automagically - the hardware ethernet
> line in dhcpd.conf initially. Say you have 100 machines. One way I would think
> of would be to use kickstart and:
>
> Install the machines and boot them up in sequence and using the range statement
> in dhcpd.conf (so that the first machine gets 192.168.1.101, the second
> 192.168.1.102 ...)
>
> Once all nodes are up use some script to extract the mac addresses for all the
> nodes and either modify dhcpd.conf - or - discard of dhcp completely and
> hardwire the ip-addresses to each node.
>
> But I'm sure there are better ways to do this?

That's exactly how I do it.  Then, in the Kickstart configuration script,
I have the node configure itself not to use DHCP anymore.  It is a bit
cumbersome when new nodes are added, but since the nodes that I will be
installing in two weeks are the product of a purchase cycle that started
in February, I don't have to worry about doing it too often.

 _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_
| Roger L. Smith                        Phone: 662-325-3625               |
| Systems Administrator                 FAX:   662-325-7692               |
| roger at ERC.MsState.Edu                 http://WWW.ERC.MsState.Edu/~roger |
|                       Mississippi State University                      |
|_______________________Engineering Research Center_______________________|


From roger at ERC.MsState.Edu  Thu Apr 11 11:02:46 2002
From: roger at ERC.MsState.Edu (Roger L. Smith)
Date: Thu, 11 Apr 2002 13:02:46 -0500
Subject: DHCP Help Again
In-Reply-To: <Pine.LNX.4.44.0204111308060.1141-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.SGI.4.44.0204111301110.1146-100000@Downforce.ERC.MsState.Edu>

On Thu, 11 Apr 2002, Robert G. Brown wrote:

> > But I'm sure there are better ways to do this?
>
> Not that I know of.  Maybe somebody else knows of one.  I'd just use
> perl or bash (either would probably work, although parsing is generally
> easier in perl), parse e.g.
>
> Apr 11 08:18:09 lucifer dhcpd: DHCPREQUEST for 192.168.1.140 from 00:20:e0:6d:a0:05 via eth0
> Apr 11 08:18:09 lucifer dhcpd: DHCPACK on 192.168.1.140 to 00:20:e0:6d:a0:05 via eth0
>
> from /var/log/messages on the dhcp server, and write an output routine
> to generate

It's actually easier to grab them out of /var/lib/dhcpd.leases, since some
of the information that you're looking for is already in the format that
you need it.

 _\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_\|/_
| Roger L. Smith                        Phone: 662-325-3625               |
| Systems Administrator                 FAX:   662-325-7692               |
| roger at ERC.MsState.Edu                 http://WWW.ERC.MsState.Edu/~roger |
|                       Mississippi State University                      |
|_______________________Engineering Research Center_______________________|


From christon at pluto.dsu.edu  Thu Apr 11 11:09:59 2002
From: christon at pluto.dsu.edu (Christoffersen, Neils)
Date: Thu, 11 Apr 2002 13:09:59 -0500
Subject: scyld slave node problems
Message-ID: <0718ABB23368D2119FC200008362AF6816BDDA@pluto.dsu.edu>

Hello all,

I'm setting up a small cluster for my university using the Scyld distro.
The master is up and running and now I'm trying to get the nodes to operate.
However, the node I'm currently working on is having some difficulties. It
seems to be communicating with the master just fine, but when copying the
libraries from the master it starts spitting out "try_do_free_pages failed
for init" and similar messages.  It seems to me that maybe the hard drive is
not being recognized and it's trying to run everything on ram and just
running out of memory.

Does anyone know what could be causing this? I have the node log which I can
attach if you wish (I just don't have it with me at the moment).

Thanks for any help you can lend.

Sincerely
Neils Christoffersen


From canon at nersc.gov  Thu Apr 11 11:16:25 2002
From: canon at nersc.gov (canon at nersc.gov)
Date: Thu, 11 Apr 2002 11:16:25 -0700
Subject: DHCP Help Again 
In-Reply-To: Message from tegner@nada.kth.se 
   of "Wed, 10 Apr 2002 16:25:17 +0200." <1018448717.3cb44b4d7ee2a@mail1.nada.kth.se> 
Message-ID: <200204111816.g3BIGPw13292@pookie.nersc.gov>

Jon,

We install our machines in pretty much this fashion.  I wrote a 
script that yanks out the mac address and builds a dhcp entry that 
I append to the dhcpd.conf file.  Its not the most elegant solution 
but it works. Also, I think NPACI/ROCKS includes some utilities to 
stream-line this process.

--Shane Canon


From RSchilling at affiliatedhealth.org  Thu Apr 11 10:59:55 2002
From: RSchilling at affiliatedhealth.org (Schilling, Richard)
Date: Thu, 11 Apr 2002 10:59:55 -0700
Subject: Parallel povraying baby!!!!
Message-ID: <51FCCCF0C130D211BE550008C724149E01165AF7@mail1.affiliatedhealth.org>

 
Very nice results! Would you be willing to discuss or document the steps you
took to get set up?

Thanks!

--Richard Schiling

-----Original Message-----
From: Jayne Heger
To: Davis, Robin J.; Penfold, Brian; webmaster at wisewolf.com; Clever TW; Roy
Gudz; Stephen.Cooke at severntrent.co.uk; Symon Cook; Tasneem Sharif;
beowulf-newbie at fecundswamp.net; beowulf at beowulf.org; chris
Sent: 9/04/02 17:28
Subject: Parallel povraying baby!!!!


Right,
I've now ran an parallel application on my Beowulf Cluster, and its
working 
well!  ;)
When runnig pvmpov which is a parallel rendering farm application. I get

these results when I render skyvase.pov, (a picture of a vase)


1 host = 7 mins, 11 seconds
2 hosts = 3min, 30 seconds
3 hosts = 2min 18 seconds

One other machine to add yet though!
These are all 486's

This is my final year project at university

What do you think???

kw1el huh????

Jayne


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020411/787c81d7/attachment.html>

From siegert at sfu.ca  Thu Apr 11 11:34:38 2002
From: siegert at sfu.ca (Martin Siegert)
Date: Thu, 11 Apr 2002 11:34:38 -0700
Subject: DHCP Help Again
In-Reply-To: <1018448717.3cb44b4d7ee2a@mail1.nada.kth.se>; from tegner@nada.kth.se on Wed, Apr 10, 2002 at 04:25:17PM +0200
References: <Pine.LNX.4.44.0204100842570.29933-100000@ganesh.phy.duke.edu> <1018448717.3cb44b4d7ee2a@mail1.nada.kth.se>
Message-ID: <20020411113438.C20302@stikine.ucs.sfu.ca>

On Wed, Apr 10, 2002 at 04:25:17PM +0200, tegner at nada.kth.se wrote:
> Very helpful! Thanks!
> 
> But I'm still curious about how you make - automagically - the hardware ethernet
> line in dhcpd.conf initially. Say you have 100 machines. One way I would think
> of would be to use kickstart and:
> 
> Install the machines and boot them up in sequence and using the range statement
> in dhcpd.conf (so that the first machine gets 192.168.1.101, the second
> 192.168.1.102 ...)
> 
> Once all nodes are up use some script to extract the mac addresses for all the
> nodes and either modify dhcpd.conf - or - discard of dhcp completely and
> hardwire the ip-addresses to each node.
> 
> But I'm sure there are better ways to do this?

If you want to use static ip addresses anyway (as I do), why do you
use dhcp at all?

I use a kickstart file with something like

network --bootproto static --device eth3 --ip 172.17.254.1 --netmask 255.255.0.0 --gateway 172.17.0.1 --hostname ks1 --nameserver 172.17.0.1

and have on the master node a set of ip addresses reserved for kickstart
installations:

172.17.254.1		ks1
172.17.254.2		ks2
172.17.254.3		ks3
172.17.254.4		ks4
172.17.254.5		ks5

In the %post section of the kickstart file I then run a script that increases
a counter on the master node, returns that counter as the real ip address
of the new node, and updates the /etc/hosts file on all other nodes.

I have installed my cluster (96 nodes) that way all by myself without any
(big) problems ... maybe I just was too lazy to learn how to deal with dhcp.

Cheers,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================


From joachim at lfbs.RWTH-Aachen.DE  Thu Apr 11 11:46:28 2002
From: joachim at lfbs.RWTH-Aachen.DE (Joachim Worringen)
Date: Thu, 11 Apr 2002 20:46:28 +0200
Subject: very high bandwidth, low latency manner?
References: <Pine.LNX.4.30.0204050252190.7535-100000@elin.scali.no> <v05010178b8dacb026736@[192.168.1.4]>
Message-ID: <3CB5DA04.AF2C609F@lfbs.rwth-aachen.de>

...Iwao Makino wrote:
> 
> I think ... Quadrics<http://www.quadrics.com/> is another one.
[...]
> But pricing is MUCH higher than SCI/Myrinet.

Do you have any pricing information at all? AFAIK, they are only
distribute with Compaq clusters.

 Joachim

-- 
|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339


From emiller at techskills.com  Thu Apr 11 11:52:06 2002
From: emiller at techskills.com (Eric Miller)
Date: Thu, 11 Apr 2002 14:52:06 -0400
Subject: scyld slave node problems
In-Reply-To: <0718ABB23368D2119FC200008362AF6816BDDA@pluto.dsu.edu>
Message-ID: <NMELJLHHFNGMNFFNGJAECEFJCEAA.emiller@techskills.com>

>I'm setting up a small cluster for my university using the Scyld distro.
>The master is up and running and now I'm trying to get the nodes to
operate.
>However, the node I'm currently working on is having some difficulties. It
>seems to be communicating with the master just fine, but when copying the
>libraries from the master it starts spitting out "try_do_free_pages failed
>for init" and similar messages.  It seems to me that maybe the hard drive
is

There might be a more technical solution, but I had the same problem and was
able to solve it by booting that node diskless.  Just disconnect the hard
drive, and re-boot with the boot disk.  Like I said, there may be a better
way or technical solution, but ".....free_pages" led me to believe it was a
HDD problem.  I booted diskless, and had no problems.


From rok at ucsd.edu  Thu Apr 11 11:31:23 2002
From: rok at ucsd.edu (Robert Konecny)
Date: Thu, 11 Apr 2002 11:31:23 -0700
Subject: DHCP Help Again
In-Reply-To: <Pine.LNX.4.44.0204111308060.1141-100000@ganesh.phy.duke.edu>; from rgb@phy.duke.edu on Thu, Apr 11, 2002 at 01:17:51PM -0400
References: <1018448717.3cb44b4d7ee2a@mail1.nada.kth.se> <Pine.LNX.4.44.0204111308060.1141-100000@ganesh.phy.duke.edu>
Message-ID: <20020411113123.B26495@ucsd.edu>

that's pretty much how insert-ethers from Rocks clustering software works
(rocks.npaci.edu). You fire it up on frontend and it starts parsing
/var/log/messages in real time. Then you kick start a node and when
insert-ethers sees a request for a lease with unknown MAC it updates
Rocks MySQL database, generates new dhcpd.conf and restarts dhcpd. Works
like charm.

robert

On Thu, Apr 11, 2002 at 01:17:51PM -0400, Robert G. Brown wrote:
>
> Not that I know of.  Maybe somebody else knows of one.  I'd just use
> perl or bash (either would probably work, although parsing is generally
> easier in perl), parse e.g.
> 
> Apr 11 08:18:09 lucifer dhcpd: DHCPREQUEST for 192.168.1.140 from 00:20:e0:6d:a0:05 via eth0
> Apr 11 08:18:09 lucifer dhcpd: DHCPACK on 192.168.1.140 to 00:20:e0:6d:a0:05 via eth0
> 
> from /var/log/messages on the dhcp server, and write an output routine
> to generate
> 
> # golem (Linux/Windows laptop lilith, second/100BT interface)
> host golem {
>         hardware ethernet       00:20:e0:6d:a0:05;
>         fixed-address           192.168.1.140;
>         next-server             192.168.1.131;
>         option routers          192.168.1.1;
>         option domain-name      "rgb.private.net";
>         option host-name        "golem";
> }
> 
> and
> 
> 192.168.1.140   golem.rgb.private.net   golem
> 
> and append them to /etc/dhcpd.conf and /etc/hosts respectively, and then
> distribute copies of the resulting /etc/hosts -- as Josip made
> eloquently clear your private internal network should resolve
> consistently on all PIN hosts and probably should have SOME sort of
> domainname defined so that software the might include a
> getdomainbyname() call and might not include an adequate check and
> handle of a null value can cope.  It's hard to know what assumptions
> were made by the designer of every single piece of network software you
> might want to run...
> 
> Of coures you'll probably want to do the b01, b02, b03... hostname
> iteration -- I'm just pulling an example at random out of my own log
> tables.
> 
>    rgb
> 
> > 
> > Thanks again,
> > 
> > /jon
> > 
> > Quoting "Robert G. Brown" <rgb at phy.duke.edu>:
> > 
> > > On Wed, 10 Apr 2002 tegner at nada.kth.se wrote:
> > > 
> > > > Quoting "Robert G. Brown" <rgb at phy.duke.edu>:
> > > > Is there a convenient way to obtain static ip-addresses using dhcp
> > > without
> > > > having to explicitly write down the mac-addresses in dhcpd.conf?
> > > > 
> > > > Regards,
> > > > 
> > > > /jon
> > > 
> > > Static?  As in each machine gets a single IP number that remains its own
> > > "forever" through all reboots and which can be identified by a fixed
> > > name in host tables?
> > > 
> > > Following the time-honored tradition of actually reading the man pages
> > > for dhcpd, we see that the answer is "sort of".  As in in principle yes,
> > > but only in a wierd way and would you really want to?
> > > 
> > > First of all, let us consider, how COULD it do this?  All dhcp knows of
> > > a host is its mac address.  System needs IP number.  System broadcasts a
> > > DHCP request.  What can the daemon do?
> > > 
> > > It can assign the address out of a range without looking at the MAC
> > > address (beyond ensuring that isn't one that it recognizes already) or
> > > it can look at the MAC address, do a table lookup and find it in the
> > > table, and assign an IP address based on the table that maps MAC->IP.
> > > This is pretty much what actually happens, and of course the lookup
> > > table CAN ensure a static MAC->IP matchup.
> > > 
> > > The only question is how the lookup table is constructed.
> > > 
> > > The obvious way is by making explict per-host entries in the dhcpd.conf
> > > file.  dhcpd reads the file and builds the table from what it finds
> > > there.  You make the dhcpd.conf entries by hand or automagically by
> > > means of a clever script.  In general this isn't a real problem.  You
> > > have to make a per-host entry into e.g. /etc/hosts as well, or you won't
> > > know the NAME the host is going to have to correspond to the IP number
> > > the daemon happened to give it the first time it saw it.  The same
> > > script can do both, given e.g. the MAC address and hostname you wish to
> > > assign as arguments.
> > > 
> > > Now there is nothing to PREVENT the daemon from assigning IP numbers out
> > > of the free range, creating a MAC->IP mapping, and saving the mapping
> > > itself so that it is automagically reloaded after, say, a crash (which
> > > tends to wipe out the table it builds in memory.  By strange chance,
> > > this is pretty much exactly what dhcpd does.  It views IP's assigned out
> > > of a given subnet range as "leases", to be given to hosts for a certain
> > > amount of time and then recovered for reuse.  It saves its current lease
> > > table in /var/lib/dhcp/dhcpd.leases.  Periodically it goes through this
> > > table and "grooms" it, cleaning out expired leases so the IP numbers are
> > > reused.  In many/most cases where range addresses are used, this is just
> > > fine.  Remember, dhcp was "invented" at least in part to simplify
> > > address assignment to rooms full of PC's running WinXX, a well-known
> > > stupid operating system that wouldn't know what to do with a remote
> > > login attempt if it saw one.  Heck, it doesn't know what to do with a
> > > LOCAL login a lot of the time.  The IP<->name map is pretty unimportant
> > > in this case, because you tend never to address the system by its
> > > internet name.  So it's no big deal to let IP addresses for dumb WinXX
> > > clients recycle.
> > > 
> > > Of course this isn't always true even for WinXX, especially if XX is 2K
> > > or XP or NT.  Sometimes systems people really like to know that log
> > > traces by IP number can be mapped into specific machines just so they
> > > can go around with a sucker rod (see "man syslogd" and do a search on
> > > "sucker") to administer correction, for example, even if they cannot
> > > remotely login to the host in question.
> > > 
> > > dhcpd allows you to pretty much totally control the lease time used for
> > > any given subnet or range.  You can set it from very short to "very
> > > large", probably 4 billion or so seconds, which is (practically)
> > > "infinity".  Infinity would be your coveted static IP address
> > > assignment.
> > > 
> > > Once again I'd argue that although you CAN do this, you probably don't
> > > want to in just about any unixoid context including LAN management and
> > > cluster engineering.  There is something so satisfying, so USEFUL, about
> > > the hostname<->IP map, and in order for this map to correspond to some
> > > SPECIFIC box, you really are building the hostname<->IP<->MAC map,
> > > piecewise.  And of course you need to leave the NIC's in the boxes,
> > > since yes the map follows the NIC and not the actual box.  Although it
> > > likely isn't the "only" way to control the complete chain,
> > > simultaneously and explicity building /etc/hosts (or the NIS, LDAP,
> > > rsync exported versions thereof), the various hostname-related
> > > permissions (e.g. netgroups) and /etc/dhpcd.conf static entries is
> > > arguably the best way.
> > > 
> > > To emphasize this last point, note that there is additional information
> > > that can be specified in the dhcp static table entries, such as the name
> > > of a per-host kickstart file to be used in installing it and more.  dhcp
> > > is at least an approximation to a centralized configuration data server
> > > and can perform lots of useful services in this arena, not just handing
> > > out IP addresses.  Unfortunately (perhaps? as far as I know?) dhcp's
> > > options can only be passed from it's own internal list, so one can't
> > > QUITE use it as a way of globally synchronizing whole tables of
> > > important data (like /etc/hosts,netgroups,passwd) across a subnet as
> > > systems automatically and periodically renew their leases.  The list of
> > > options it supports as it stands now is quite large, though.
> > > 
> > > I also don't know how susceptible it is to spoofing -- one problem with
> > > daemon-based services like this is that if they aren't uniquely bound at
> > > both ends to an authorized server and somebody puts a faster server on
> > > the same physical network, one can sometimes do something like
> > > dynamically change a systems "identity" in real time and gain access
> > > privileges you otherwise might not have had.  Obviously, sending files
> > > like /etc/passwd around in this way would be a very dangerous thing to
> > > do unless the daemon were re-engineered to use something like ssl to
> > > simultaneously certify the server and encrypt the traffic.
> > > 
> > > Hope this helps.  BTW, in addition to the always useful man pages for
> > > dhcpd and dhcpd.conf (e.g.) you can and should look at the linux
> > > documentation project site and the various RFCs that specify dhcp's
> > > behavior and option spread.
> > > 
> > >    rgb
> > > 
> > > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> 
> -- 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From lightdee at netscape.net  Thu Apr 11 11:34:33 2002
From: lightdee at netscape.net (lightdee at netscape.net)
Date: Thu, 11 Apr 2002 14:34:33 -0400
Subject: How do you keep clusters running....
Message-ID: <1D889E10.452B89F1.009FF3AE@netscape.net>

Doug J Nordwall wrote:

>On Wed, 2002-04-03 at 13:04, Cris Rhea wrote:
>
>   What are folks doing about keeping hardware running on large clusters?
>    
>    Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 >nodes)...
>    
>    Sure seems like every week or two, I notice dead fans (each RS-1200
>    has 6 case fans in addition to the 2 CPU fans and 2 power supply > fans).
>     
>
>You running lm_sensors on your nodes? That's a handy tool for paying
>attention to things like that. We use ours in combination with ganglia
>and pump it to a web page and to big brother to see when a cpu might be
>getting hot, or a fan might be too slow. We actually saved a dozen
>machines that way...we have 32 4 processor racksaver boxes in a rack,
>and they rack was not designed to handle racksaver's fan system. That is
>to say, there was a solid sidewall on the rack, and it kept in heat. I
>set up lm_sensors on all the nodes (homogenous, so configured on one and
>pushed it out to all), then pumped the data into ganglia
>(ganglia.sourceforge.net) and then to a web page. I noticed that the
>temp on a dozen of the machines was extremely high. So, I took off the
>side panel of the rack. The temp dropped by 15 C on all the nodes, and
>everything was within normal parameters again.
>
>
>    My last fan failure was a CPU fan that toasted the CPU and motherboard.
>
>
>Ya, we would have seen this on ours earlier...excellent tool

[snip]

We use Clusterworx, which isn't open source (from Linux Networx), but it goes a step further than Ganglia.  It uses lm_sensors and a power control
box (again from linux networx) to actually shutdown a node if it is getting
too hot, and the event parameters are all tweakable.  It's always a good
idea to have some kind of cluster monitoring software installed, but it's
nice to be able to setup event triggers in your software in case something goes wrong and you're not around.

----
David Henry
Synergy Software, Inc.
lightdee at netscape.net


__________________________________________________________________
Your favorite stores, helpful shopping tools and great gift ideas. Experience the convenience of buying online with Shop at Netscape! http://shopnow.netscape.com/

Get your own FREE, personal Netscape Mail account today at http://webmail.netscape.com/


From ctierney at hpti.com  Thu Apr 11 11:59:20 2002
From: ctierney at hpti.com (Craig Tierney)
Date: Thu, 11 Apr 2002 12:59:20 -0600
Subject: What could be the performance of my cluster
In-Reply-To: <20020406113545.91938.qmail@web10504.mail.yahoo.com>; from suraj_peri@yahoo.com on Sat, Apr 06, 2002 at 03:35:45AM -0800
References: <20020405125956.D69845@velocet.ca> <20020406113545.91938.qmail@web10504.mail.yahoo.com>
Message-ID: <20020411125920.D32605@hpti.com>

It depends on what you are trying to do (doesn't everyone
love that answer). 

The number of flops your cluster can do should
be equal to:

flops = (no. of cpus) * (Mhz) * (flops per hz)

So for your cluster

flops =  8 * 1.53 Ghz * 2

  I am assuming that with SSE you can get 2 flops per cycle.

flops = 24.48 Gflops

Now, there are some issues with this.  First, you are never
going to get 1.53*2 Gflops out of a single processor.  Second,
leveraging all 8 cpus to get their maximum is going to be 
difficult if there is any communication between the nodes.

Compilers play a big role in extracting the best performance
out of the system.  If you don't have a commerical compiler
from the likes of Intel or Portland Group, I highly recommend
getting one.  You only have to purchase the compiler for where
you compile, and not where you run.  You can get away with
one copy of the compiler on your server.

If you are trying to compare the AMD system to the DS20E system,
it will depend on what you are actually trying to do.  If 
you are running single precision floating point codes that do
not require all the memory bandwidth a DS20E provides, I would
think that within 10% that AMD processor will do the work
of one 833 Mhz Alpha Cpu (You didn't say if you had 2 cpus
in your DS20e).   At least this is what I am seeing
for my codes when comparing Dual Xeon's, Dual AMD's, and
dual API 833 boxes.

Craig


On Sat, Apr 06, 2002 at 03:35:45AM -0800, Suraj Peri wrote:
> Hi group, 
> I was calculating the performance of my cluster. The
> features are 
> 
> 1. 8 nodes
> 2. Processor: AMD Athlon XP 1800+
> 3. 8 CPUs
> 4. 8*1.5 GB DDR RAM
> 5. 1 Server with 2 processorts with AMD MP 1800+ and
> 2GB DDR RAM
> 
> I calculated this to be 48 Mflops . Is this correct ?
> if not, what is the correct performance of my cluster.
> I also comparatively calculated that my cluster would
> be 3 times faster than AlphaServer DS20E ( 833 MHz
> alpha 64 bit processor, 4 GB max memory)
> 
> Is my calculation correct or wrong? please help me
> ASAP. thanks in advance.
> 
> cheers
> suraj.
> 
> =====
> PIL/BMB/SDU/DK
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Tax Center - online filing with TurboTax
> http://taxes.yahoo.com/
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Craig Tierney (ctierney at hpti.com)


From becker at scyld.com  Thu Apr 11 12:16:47 2002
From: becker at scyld.com (Donald Becker)
Date: Thu, 11 Apr 2002 15:16:47 -0400 (EDT)
Subject: scyld slave node problems
In-Reply-To: <NMELJLHHFNGMNFFNGJAECEFJCEAA.emiller@techskills.com>
Message-ID: <Pine.LNX.4.33.0204111514190.5957-100000@presario>

On Thu, 11 Apr 2002, Eric Miller wrote:

> >I'm setting up a small cluster for my university using the Scyld distro.
> >The master is up and running and now I'm trying to get the nodes to
> operate.
> >However, the node I'm currently working on is having some difficulties. It
> >seems to be communicating with the master just fine, but when copying the
> >libraries from the master it starts spitting out "try_do_free_pages failed
> >for init" and similar messages.

My first guess is that you don't have enough memory (64MB+) on the slave
node.  But this might also be a memory or disk problem.

> There might be a more technical solution, but I had the same problem and was
> able to solve it by booting that node diskless.  Just disconnect the hard
> drive, and re-boot with the boot disk.  Like I said, there may be a better
> way or technical solution, but ".....free_pages" led me to believe it was a
> HDD problem.  I booted diskless, and had no problems.

You should not need to physically disconnect the hard disk.  Just remove
any references to /dev/hda that you added in /etc/beowulf/fstab.
However if you do have a hardware problem, disconnecting the disk might
avoid the symptoms.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993


From skruglik at gmu.edu  Thu Apr 11 12:35:28 2002
From: skruglik at gmu.edu (Stepan Kruglikov)
Date: Thu, 11 Apr 2002 15:35:28 -0400
Subject: scyld slave node problems
References: <0718ABB23368D2119FC200008362AF6816BDDA@pluto.dsu.edu>
Message-ID: <00a901c1e190$0924c730$c932ae81@lyapunov>

Hello,

I solved this problem by increasing memory on each node to 64MB. You can
also get the node up and running with 64MB, setup node partitions, and then
delete ram drive and run with 32mB. Although it works, I recommend doing it
only in case if you are interested in proof of concept cluster.

Stepan Kruglikov
----- Original Message -----
From: "Christoffersen, Neils" <christon at pluto.dsu.edu>
To: <beowulf at beowulf.org>
Sent: Thursday, April 11, 2002 2:09 PM
Subject: scyld slave node problems


> Hello all,
>
> I'm setting up a small cluster for my university using the Scyld distro.
> The master is up and running and now I'm trying to get the nodes to
operate.
> However, the node I'm currently working on is having some difficulties. It
> seems to be communicating with the master just fine, but when copying the
> libraries from the master it starts spitting out "try_do_free_pages failed
> for init" and similar messages.  It seems to me that maybe the hard drive
is
> not being recognized and it's trying to run everything on ram and just
> running out of memory.
>
> Does anyone know what could be causing this? I have the node log which I
can
> attach if you wish (I just don't have it with me at the moment).
>
> Thanks for any help you can lend.
>
> Sincerely
> Neils Christoffersen
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From ctierney at hpti.com  Thu Apr 11 12:26:50 2002
From: ctierney at hpti.com (Craig Tierney)
Date: Thu, 11 Apr 2002 13:26:50 -0600
Subject: very high bandwidth, low latency manner?
In-Reply-To: <3CB5DA04.AF2C609F@lfbs.rwth-aachen.de>; from joachim@lfbs.RWTH-Aachen.DE on Thu, Apr 11, 2002 at 08:46:28PM +0200
References: <Pine.LNX.4.30.0204050252190.7535-100000@elin.scali.no> <v05010178b8dacb026736@[192.168.1.4]> <3CB5DA04.AF2C609F@lfbs.rwth-aachen.de>
Message-ID: <20020411132650.A32674@hpti.com>

I talked to a guy at SC2002 from Quadrics and he said
that list pricing on a Quadrics network was about $3500
per node when you are in the 100s of nodes and up.  
The price includes the cards, cables, switches,
etc.  This doesn't include any sort of discount that you
might get.  Myrinet is about $2000 for an equivelent 
network at list price.   Dolphin/SCI falls around $2245 list 
per node (if the system is > 144 nodes and you have to get
the 3d card).


I heard that Quadrics had a customer that just had to have
an Intel/Quadrics system so either they or he was working
on porting the drivers.   The web page says they support
Linux and Tru64.  You could probably get the hardware without
going through Compaq, but Compaq is most likely buying up
most of the supply.

Craig

-- 
Craig Tierney (ctierney at hpti.com)


On Thu, Apr 11, 2002 at 08:46:28PM +0200, Joachim Worringen wrote:
> ...Iwao Makino wrote:
> > 
> > I think ... Quadrics<http://www.quadrics.com/> is another one.
> [...]
> > But pricing is MUCH higher than SCI/Myrinet.
> 
> Do you have any pricing information at all? AFAIK, they are only
> distribute with Compaq clusters.
> 
>  Joachim
> 
> -- 
> |  _  RWTH|  Joachim Worringen
> |_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
>   | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
>     |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From epaulson at cs.wisc.edu  Thu Apr 11 12:32:02 2002
From: epaulson at cs.wisc.edu (Erik Paulson)
Date: Thu, 11 Apr 2002 14:32:02 -0500
Subject: (no subject)
In-Reply-To: <NMELJLHHFNGMNFFNGJAEEEEPCEAA.emiller@techskills.com>; from emiller@techskills.com on Thu, Apr 11, 2002 at 10:08:16AM -0400
References: <20020411130008.63691.qmail@web9608.mail.yahoo.com> <NMELJLHHFNGMNFFNGJAEEEEPCEAA.emiller@techskills.com>
Message-ID: <20020411143202.C27111@perdita.cs.wisc.edu>

On Thu, Apr 11, 2002 at 10:08:16AM -0400, Eric Miller wrote:
> 
> After you get the cluster up and running, that's where the help seems to
> drift off.  Most of the people in this group are upper-level users who know
> how to get these MPI enabled programs to run on thier clusters.  If you are
> like me, these topics are a little foreign.  If you are looking for
> something to run continuously, like a display, they say the MandelBrot
> renderer has a loop function, but I can't get it to work.  Someone suggested
> SETI many months ago, which would be perfect, but SETI does not offer an MPI
> enabled program.
> 

What possible good would an MPI-enabled SETI at Home do? The whole point of 
SETI at Home is that it's already parallelized.

If you've got N nodes, submit N copies of SETI at home to your queuing system,
and your cluster will get an N times speedup over a single node. I don't see
how you can hope to do better than that.

-Erik


From becker at scyld.com  Thu Apr 11 13:03:31 2002
From: becker at scyld.com (Donald Becker)
Date: Thu, 11 Apr 2002 16:03:31 -0400 (EDT)
Subject: scyld slave node problems
In-Reply-To: <00a901c1e190$0924c730$c932ae81@lyapunov>
Message-ID: <Pine.LNX.4.33.0204111601280.5957-100000@presario>

On Thu, 11 Apr 2002, Stepan Kruglikov wrote:

> I solved this problem by increasing memory on each node to 64MB. You can
> also get the node up and running with 64MB, setup node partitions, and then
> delete ram drive and run with 32mB. Although it works, I recommend doing it
> only in case if you are interested in proof of concept cluster.

It's possible to trim the cached library list in /etc/beowulf/config and
fit into 32MB.  But only the most trivial application will run with 32MB
and no local disk.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993


From fraser5 at cox.net  Thu Apr 11 13:38:44 2002
From: fraser5 at cox.net (Jim Fraser)
Date: Thu, 11 Apr 2002 16:38:44 -0400
Subject: Will the dual Tyan board boot without a graphics card installed?
Message-ID: <006a01c1e198$dff52c70$0300005a@papabear>

I have had this problem on a couple other boards and it can be annoying.

Thanks,

Jim


From emiller at techskills.com  Thu Apr 11 14:01:24 2002
From: emiller at techskills.com (Eric Miller)
Date: Thu, 11 Apr 2002 17:01:24 -0400
Subject: (no subject)
In-Reply-To: <20020411143202.C27111@perdita.cs.wisc.edu>
Message-ID: <NMELJLHHFNGMNFFNGJAEKEFNCEAA.emiller@techskills.com>

>> <Snip>
>> Someone suggested
>> SETI many months ago, which would be perfect, but SETI does not offer an
MPI
>> enabled program.
>
>What possible good would an MPI-enabled SETI at Home do? The whole point of
>SETI at Home is that it's already parallelized.
>

My definition of parrellelized is MPI or PVM enabled code, not _distributed_
applications like SETI.  When demonstrating to students the capabilities of
Linux, its not nearly as convincing to just start N number of instances on N
nodes.  The magic stuff that we newbie cluster builders seek is not found in
that.  It is found in having a bona-fide cluster with master and slave
nodes, and a single instance of a program being managed and executed by a
group of machines.  Am I alone in this opinion?

>If you've got N nodes, submit N copies of SETI at home to your queuing system,
>and your cluster will get an N times speedup over a single node. I don't
see
>how you can hope to do better than that.

I was aware of this possibility, but do not have the skills to implement it.
Please see my post from weeks ago, March 11th.  It was SETI that I was
referring to:

--For non-parallel applications, is it possible to run individual instances
on
--diskless nodes?  For example, I want to execute a non-MPI program "A" that
--is located in the /bin directory of my master node, but I want to run one
--instance of "A" on each of my diskless nodes.

--What is the syntax that equates to:

--#NP=1 "A" on node0 only
--#NP=1 "A" on node1 only
--#....
--#....


From math at velocet.ca  Thu Apr 11 14:25:02 2002
From: math at velocet.ca (Velocet)
Date: Thu, 11 Apr 2002 17:25:02 -0400
Subject: Will the dual Tyan board boot without a graphics card installed?
In-Reply-To: <006a01c1e198$dff52c70$0300005a@papabear>; from fraser5@cox.net on Thu, Apr 11, 2002 at 04:38:44PM -0400
References: <006a01c1e198$dff52c70$0300005a@papabear>
Message-ID: <20020411172502.F19272@velocet.ca>

On Thu, Apr 11, 2002 at 04:38:44PM -0400, Jim Fraser's all...
> 
> I have had this problem on a couple other boards and it can be annoying.

I have found that clearing the BIOS and setting everything back up
can solve this problem sometimes.

/kc

> 
> Thanks,
> 
> Jim
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From epaulson at cs.wisc.edu  Thu Apr 11 14:26:24 2002
From: epaulson at cs.wisc.edu (Erik Paulson)
Date: Thu, 11 Apr 2002 16:26:24 -0500
Subject: (no subject)
In-Reply-To: <NMELJLHHFNGMNFFNGJAEKEFNCEAA.emiller@techskills.com>; from emiller@techskills.com on Thu, Apr 11, 2002 at 05:01:24PM -0400
References: <20020411143202.C27111@perdita.cs.wisc.edu> <NMELJLHHFNGMNFFNGJAEKEFNCEAA.emiller@techskills.com>
Message-ID: <20020411162624.E27598@perdita.cs.wisc.edu>

On Thu, Apr 11, 2002 at 05:01:24PM -0400, Eric Miller wrote:
> >> <Snip>
> >> Someone suggested
> >> SETI many months ago, which would be perfect, but SETI does not offer an
> MPI
> >> enabled program.
> >
> >What possible good would an MPI-enabled SETI at Home do? The whole point of
> >SETI at Home is that it's already parallelized.
> >
> 
> My definition of parrellelized is MPI or PVM enabled code, not _distributed_
> applications like SETI.  When demonstrating to students the capabilities of
> Linux, its not nearly as convincing to just start N number of instances on N
> nodes.  The magic stuff that we newbie cluster builders seek is not found in
> that.  It is found in having a bona-fide cluster with master and slave
> nodes, and a single instance of a program being managed and executed by a
> group of machines.  Am I alone in this opinion?
> 

Yes. What you'll discover is that there is no magic to cluster building. 
If your problem can be solved in parallel just by running N unmodified 
copies of your code, then that's the way to do it.  And there's tons of
science to be done this way (in fact, I'd bet there's more to be done this
way than with big MPI jobs)

If your codes to solve your problem need to be parallelized with MPI or
PVM for whatever reason (maybe you don't need to solve N instanances of your
code, just one instance and minimize the time, or you need more resources than
any one machine can handle - ie 32 gigs of RAM or some such) then you don't
really have a choice and you have to break down and do it. But again, there's
no magic here. There is not a single instance of you program on the cluster -
if your code is using N nodes, then there are N copies of your program on 
the cluster. (Yes, maybe you're using some quasi-SSI thing like Scyld or 
MOSIX, but as far as I know both of them still transfer the entire memory 
image over to the machine, and don't page things over as needed)

You can write a program that works exactly like an MPI program with 0 MPI
calls - whereever you'd write MPI_Send, just use BSD sockets and send things
that way. Tons more to do (you have to locate all the other processes in the
computation, you have to worry about buffering, failures, etc) but none of
it's unknown.

> >If you've got N nodes, submit N copies of SETI at home to your queuing system,
> >and your cluster will get an N times speedup over a single node. I don't
> see
> >how you can hope to do better than that.
> 
> I was aware of this possibility, but do not have the skills to implement it.

Yes you do.  Download Condor, or PBS, or Sun Grid Engine, or buy Platform LSF,
and:
A. Install it on N nodes
B. Submit N copies

or, install Scyld or MOSIX. Type:
my_program &

N times.

-Erik


From laytonjb at bellsouth.net  Thu Apr 11 13:33:58 2002
From: laytonjb at bellsouth.net (Jeff Layton)
Date: Thu, 11 Apr 2002 16:33:58 -0400
Subject: How do you keep clusters running....
References: <1D889E10.452B89F1.009FF3AE@netscape.net>
Message-ID: <3CB5F336.4AEFE8EA@bellsouth.net>

lightdee at netscape.net wrote:

> Doug J Nordwall wrote:
>
> >On Wed, 2002-04-03 at 13:04, Cris Rhea wrote:
> >
> >   What are folks doing about keeping hardware running on large clusters?
> >
> >    Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 >nodes)...
> >
> >    Sure seems like every week or two, I notice dead fans (each RS-1200
> >    has 6 case fans in addition to the 2 CPU fans and 2 power supply > fans).
> >
> >
> >You running lm_sensors on your nodes? That's a handy tool for paying
> >attention to things like that. We use ours in combination with ganglia
> >and pump it to a web page and to big brother to see when a cpu might be
> >getting hot, or a fan might be too slow. We actually saved a dozen
> >machines that way...we have 32 4 processor racksaver boxes in a rack,
> >and they rack was not designed to handle racksaver's fan system. That is
> >to say, there was a solid sidewall on the rack, and it kept in heat. I
> >set up lm_sensors on all the nodes (homogenous, so configured on one and
> >pushed it out to all), then pumped the data into ganglia
> >(ganglia.sourceforge.net) and then to a web page. I noticed that the
> >temp on a dozen of the machines was extremely high. So, I took off the
> >side panel of the rack. The temp dropped by 15 C on all the nodes, and
> >everything was within normal parameters again.
> >
> >
> >    My last fan failure was a CPU fan that toasted the CPU and motherboard.
> >
> >
> >Ya, we would have seen this on ours earlier...excellent tool
>
> [snip]
>
> We use Clusterworx, which isn't open source (from Linux Networx), but it goes a step further than Ganglia.  It uses lm_sensors and a power control
> box (again from linux networx) to actually shutdown a node if it is getting
> too hot, and the event parameters are all tweakable.  It's always a good
> idea to have some kind of cluster monitoring software installed, but it's
> nice to be able to setup event triggers in your software in case something goes wrong and you're not around.

You can set a shutdown temperature via the BIOS on most
decent motherboards. You can also easily script this up if
you have some power control unit connected to a node
that you can talk to (e.g. APC's stuff). All of the stuff you need
it available as Opensource. You can hook all of this together
with Ganglia if you want. In fact, Matt has announced (or hinted)
at the next version of Ganglia that will start to have a number of
new features built in (but not nodal shutdown if I remember
correctly).

Jeff Layton


>
>
> ----
> David Henry
> Synergy Software, Inc.
> lightdee at netscape.net
>
> __________________________________________________________________
> Your favorite stores, helpful shopping tools and great gift ideas. Experience the convenience of buying online with Shop at Netscape! http://shopnow.netscape.com/
>
> Get your own FREE, personal Netscape Mail account today at http://webmail.netscape.com/
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From rgb at phy.duke.edu  Thu Apr 11 15:58:18 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 11 Apr 2002 18:58:18 -0400 (EDT)
Subject: (no subject)
In-Reply-To: <20020411162624.E27598@perdita.cs.wisc.edu>
Message-ID: <Pine.LNX.4.44.0204111845300.3369-100000@lucifer.rgb.private.net>

On Thu, 11 Apr 2002, Erik Paulson wrote:

> > >If you've got N nodes, submit N copies of SETI at home to your queuing system,
> > >and your cluster will get an N times speedup over a single node. I don't
> > see
> > >how you can hope to do better than that.
> > 
> > I was aware of this possibility, but do not have the skills to implement it.
> 
> Yes you do.  Download Condor, or PBS, or Sun Grid Engine, or buy Platform LSF,
> and:
> A. Install it on N nodes
> B. Submit N copies
> 
> or, install Scyld or MOSIX. Type:
> my_program &
> 
> N times.

And not even for SETI will you get an Nx speedup on N nodes.  There is
ALWAYS a serial fraction even for embarrassingly parallel applications,
and the time required to send the jobs out to the nodes (relative to
just looping N times on the node) is part of it.  In Amdahl's Law N-fold
speedup is the upper bound, not the general, practical limit.

This is the basis of Eric's observation about embarassingly parallel
jobs being ideal for clusters -- they're the ones that often get very
close to N-fold speedup on N nodes for nearly arbitrary N.  "Real"
parallel jobs (ones with nontrivial communications built on MPI or PVM
or raw sockets or even shared memory or some sort of specialized
communications channel) almost never do this well, and more often than
not will only speedup at all up to some maximum number of nodes and then
actually run more slowly if further partitioned.

It's also interesting that master-slave jobs were cited as being "real"
parallel applications as in many cases the master is nothing more than
an intelligent front end for an embarassingly parallel application core.
What's the difference between using a script or Mosix or even a bunch of
rsh's as the "master" that distributes the jobs and collects the results
and using PVM to do exactly the same thing?  Not much, really, but
perhaps a small edge in network efficiency for that part of things.
This may matter -- if the jobs run a short time and communicate with the
master a long time it will matter -- but in cases where this paradigm
makes sense at all (where the ratio of run to communication is the other
way around -- lots of computation, a little communication) it won't
matter much.

Most of this is in any decent book on parallel computing, including at
least one that is freely available on the web.  Then there is my online
book (which I make no claim for being "decent", but it is free:-).  Lots
of these resources are on or linked to various cluster sites, including:

  http://www.phy.duke.edu/brahma

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rickey-co at mug.biglobe.ne.jp  Thu Apr 11 15:38:24 2002
From: rickey-co at mug.biglobe.ne.jp (Iwao Makino)
Date: Fri, 12 Apr 2002 07:38:24 +0900
Subject: very high bandwidth, low latency manner?
In-Reply-To: <20020411132650.A32674@hpti.com>
References: <Pine.LNX.4.30.0204050252190.7535-100000@elin.scali.no>
 <v05010178b8dacb026736@[192.168.1.4]>
 <3CB5DA04.AF2C609F@lfbs.rwth-aachen.de> <20020411132650.A32674@hpti.com>
Message-ID: <v05010180b8dbbeafa789@[192.168.1.4]>

AFAIK, it's 'BESTBUY' for 16node per node cost.
@$3,500 is good guess for 16nodes including software licensing and
hardware(switch, card and cables). Actually for 16 nodes costs litttle less.

But for 64/128 is about that range.  And for larger...
256/512/1024 and beyond, general idea of $5,000/node.

These are their offering price so I assume some volume discounts are applied
for larger scales.

And yes, not only from Compaq we should be able to purchase, but haven't got
that details answered.

At 13:26 -0600 11.04.2002, Craig Tierney wrote:
>I talked to a guy at SC2002 from Quadrics and he said
>that list pricing on a Quadrics network was about $3500
>per node when you are in the 100s of nodes and up.
>The price includes the cards, cables, switches,
>etc.  This doesn't include any sort of discount that you
>might get.  Myrinet is about $2000 for an equivelent
>network at list price.   Dolphin/SCI falls around $2245 list
>per node (if the system is > 144 nodes and you have to get
>the 3d card).

Dolphin/SCI for smaller nodes(<144?) is from $1,695 and
larger with 3D chain from $2,245 list.
I haven't tested this new 3D version yet.

>I heard that Quadrics had a customer that just had to have
>an Intel/Quadrics system so either they or he was working
>on porting the drivers.   The web page says they support
>Linux and Tru64.  You could probably get the hardware without
>going through Compaq, but Compaq is most likely buying up
>most of the supply.

I know they works on ServerWorks HE and i860 Xeon, also they are
working on Plumas and GC-LE.

-- 

Best regards,

Iwao Makino
Hard Data Ltd. Tokyo branch
mailto:iwao at harddata.com
http://www.harddata.com/

--> Now Shipping 1U Dual Athlon DDR <-
--> Ask me about the new Alpha DDR UP1500 Systems  <-


From emiller at techskills.com  Thu Apr 11 16:52:04 2002
From: emiller at techskills.com (Eric Miller)
Date: Thu, 11 Apr 2002 19:52:04 -0400
Subject: (no subject)
In-Reply-To: <Pine.LNX.4.44.0204111845300.3369-100000@lucifer.rgb.private.net>
Message-ID: <NMELJLHHFNGMNFFNGJAEMEGCCEAA.emiller@techskills.com>

<snip>
>And not even for SETI will you get an Nx speedup on N nodes.  There is
>ALWAYS a serial fraction even for embarrassingly parallel applications,
>and the time required to send the jobs out to the nodes
<Snip>

I guess what I am fundamentally saying is, for a cluster to be "working its
magic" it a student's eyes consider two scenarios:

1- Running N iterations of a program, and seeing work^N being done.  It's
like, um... well yeah if I run SETI on 8 systems, then I will crunch 8 times
as many units, but I will NOT crunch 1 unit in 1/8th the time as percieved
on the front end node.

-OR-

2- Having an 8 node cluster running, say, a raytracer.  Then, having a solo
machine running the same application.  Actually seeing ONE instance render
an image (roughly) 8 times faster than a single system (esp. when all of the
systems were pulled out of the trash can!!), THAT is the magic that newbies
and students want to see.  That's the "cool" factor that bring annoying
#^$%s like me to this forum and post questions that are outside the arena of
analyzing proteins and DNA molecules on 256 node AthlonXP rackmounts with
Myrinet.  We are not experts, we have ALOT of questions, and all we want to
do is see Linux do something cool that we can show our
freinds/students/selves.

Robert, thank you for your positive and informative reply.


From sp at scali.com  Thu Apr 11 20:29:07 2002
From: sp at scali.com (Steffen Persvold)
Date: Fri, 12 Apr 2002 05:29:07 +0200 (CEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <20020411132650.A32674@hpti.com>
Message-ID: <Pine.LNX.4.30.0204120524590.10585-100000@elin.scali.no>

On Thu, 11 Apr 2002, Craig Tierney wrote:

>
> I talked to a guy at SC2002 from Quadrics and he said
> that list pricing on a Quadrics network was about $3500
> per node when you are in the 100s of nodes and up.
> The price includes the cards, cables, switches,
> etc.  This doesn't include any sort of discount that you
> might get.  Myrinet is about $2000 for an equivelent
> network at list price.   Dolphin/SCI falls around $2245 list
> per node (if the system is > 144 nodes and you have to get
> the 3d card).
>
>

This is list prices for the cards only, right ? What about the switches
needed. AFAIK Quadrics and Myrinet both need switches, SCI don't (which
makes the total system cost a bit lower doesn't it ?).

Regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From sp at scali.com  Thu Apr 11 20:37:17 2002
From: sp at scali.com (Steffen Persvold)
Date: Fri, 12 Apr 2002 05:37:17 +0200 (CEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <v05010178b8dacb026736@[192.168.1.4]>
Message-ID: <Pine.LNX.4.30.0204120530500.10585-100000@elin.scali.no>

On Thu, 11 Apr 2002, Iwao Makino wrote:

> I think ... Quadrics<http://www.quadrics.com/> is another one.
>
Yep, sorry I forgot that one.

> Here's quick figures I have on hand....
>
> RH7.2, 2.4.9 kernel for i860 cluster.
> On their site, they claim;
> after protocol, of 340Mbytes/second in each direction. The
> process-to-process latency for remote write operations is2us, and 5us for
> MPI messages.
>
But this 340MBytes/second and 2us latency is also chipset dependent, as I
mentioned for SCI (in my examples latency was lowest on 760MPX but
bandwidth was highest on IA64 460GX...). I can't imagine that the i860 can
actually perform as well as 340MByte/sec since the Hub-Link (between
the MCH and the P64H) has a limit of 266MByte/sec (AFAIK) ....

> But pricing is MUCH higher than SCI/Myrinet.
>

Certainly.


> Best regards,
>
> At 4:08 +0200 5.04.2002, Steffen Persvold wrote:
> >On Thu, 4 Apr 2002, Jim Lux wrote:
> >
> >>  What's high bandwidth?
> >>  What's low latency?
> >  > How much money do you want to spend?
> >I don't want to start a flamewar here, but I _think_ (not knowing real
> >numbers for other high speed interconnects) that SCI has atleast the
> >lowest latency and maybe also the highest point to point bandwidth :
> >
> >SCI application to application latency   : 2.5 us
> >SCI application to application bandwidth : 325 MByte/sec
> >
> >Note that these numbers are very chipset specific (as most high speed
> >interconnect numbers are), these numbers are from IA64. Here are numbers
> >from a popular IA32 platform, the AMD 760MPX :
> >
> >SCI application to application latency   : 1.8 us
> >SCI application to application bandwidth : 283 MByte/sec
>
>

Regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From justin at cs.duke.edu  Thu Apr 11 20:54:16 2002
From: justin at cs.duke.edu (Justin Moore)
Date: Thu, 11 Apr 2002 23:54:16 -0400 (EDT)
Subject: DHCP Help Again
In-Reply-To: <Pine.LNX.4.44.0204111308060.1141-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.GSO.4.43.0204112345030.111-100000@hopi.cs.duke.edu>

Hello all,
   Part of a project I've been working on deals some with boot management
and DHCP specifically.  I'm not familiar with the NPACI solution/codebase,
but I hacked a version of proxydhcp to work with a MySQL backend.  It has
some nice hooks in it which let you know if the machine is booting PXE or
booting from dhclient/pump/whatever.  I think having the DB backend is a
little nicer than having to worry about the leases file (XML or not) since
it gives you more fine-grained control over who has access to the
information and how that information gets parsed.  Plus it can detect when
a new host is coming up and add a mapping to the DB without requiring you
to parse through /var/log/messages. :)

   Obviously my code has some parts which are somewhat project-specific
for me (I don't think everyone wants to boot off the same ramdisk I do by
default :)) but I could post the code in a few weeks (deadlines coming up)
if anyone's interested in such a beast.

   Another nice part of the DB backend is that generating a future
dhcpd.conf file is pretty easy:

mysql_query("SELECT HWaddr,IPaddr FROM nics ORDER BY IPaddr");

and then spew the output to a file as desired. :)

-jdm

Department of Computer Science, Duke University, Durham, NC 27708-0129
Email:  justin at cs.duke.edu

On Thu, 11 Apr 2002, Robert G. Brown wrote:

> On Wed, 10 Apr 2002 tegner at nada.kth.se wrote:
>
> > Very helpful! Thanks!
> >
> > But I'm still curious about how you make - automagically - the hardware ethernet
> > line in dhcpd.conf initially. Say you have 100 machines. One way I would think
> > of would be to use kickstart and:
> >
> > Install the machines and boot them up in sequence and using the range statement
> > in dhcpd.conf (so that the first machine gets 192.168.1.101, the second
> > 192.168.1.102 ...)
> >
> > Once all nodes are up use some script to extract the mac addresses for all the
> > nodes and either modify dhcpd.conf - or - discard of dhcp completely and
> > hardwire the ip-addresses to each node.
> >
> > But I'm sure there are better ways to do this?
>
> Not that I know of.  Maybe somebody else knows of one.  I'd just use
> perl or bash (either would probably work, although parsing is generally
> easier in perl), parse e.g.
>
> Apr 11 08:18:09 lucifer dhcpd: DHCPREQUEST for 192.168.1.140 from 00:20:e0:6d:a0:05 via eth0
> Apr 11 08:18:09 lucifer dhcpd: DHCPACK on 192.168.1.140 to 00:20:e0:6d:a0:05 via eth0
>
> from /var/log/messages on the dhcp server, and write an output routine
> to generate
>
> # golem (Linux/Windows laptop lilith, second/100BT interface)
> host golem {
>         hardware ethernet       00:20:e0:6d:a0:05;
>         fixed-address           192.168.1.140;
>         next-server             192.168.1.131;
>         option routers          192.168.1.1;
>         option domain-name      "rgb.private.net";
>         option host-name        "golem";
> }
>
> and
>
> 192.168.1.140   golem.rgb.private.net   golem
>
> and append them to /etc/dhcpd.conf and /etc/hosts respectively, and then
> distribute copies of the resulting /etc/hosts -- as Josip made
> eloquently clear your private internal network should resolve
> consistently on all PIN hosts and probably should have SOME sort of
> domainname defined so that software the might include a
> getdomainbyname() call and might not include an adequate check and
> handle of a null value can cope.  It's hard to know what assumptions
> were made by the designer of every single piece of network software you
> might want to run...
>
> Of coures you'll probably want to do the b01, b02, b03... hostname
> iteration -- I'm just pulling an example at random out of my own log
> tables.
>
>    rgb
>
> --
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From hartner at cs.utah.edu  Thu Apr 11 20:55:00 2002
From: hartner at cs.utah.edu (Mark Hartner)
Date: Thu, 11 Apr 2002 21:55:00 -0600 (MDT)
Subject: (no subject)
In-Reply-To: <NMELJLHHFNGMNFFNGJAEMEGCCEAA.emiller@techskills.com>
Message-ID: <Pine.LNX.4.21.0204112150200.27855-100000@famine.cs.utah.edu>

> analyzing proteins and DNA molecules on 256 node AthlonXP rackmounts with
> Myrinet.  We are not experts, we have ALOT of questions, and all we want to
> do is see Linux do something cool that we can show our
> freinds/students/selves.

How about encoding some mp3's
www.osl.ui.edu/~jsquyres/bladeenc/

Mark


From patrick at myri.com  Thu Apr 11 22:40:08 2002
From: patrick at myri.com (Patrick Geoffray)
Date: Fri, 12 Apr 2002 01:40:08 -0400
Subject: very high bandwidth, low latency manner?
References: <Pine.LNX.4.30.0204120524590.10585-100000@elin.scali.no>
Message-ID: <3CB67338.8080906@myri.com>

Steffen Persvold wrote:

>>I talked to a guy at SC2002 from Quadrics and he said
>>that list pricing on a Quadrics network was about $3500
>>per node when you are in the 100s of nodes and up.
>>The price includes the cards, cables, switches,
>>etc.  This doesn't include any sort of discount that you
>>might get.  Myrinet is about $2000 for an equivelent
>>network at list price.   Dolphin/SCI falls around $2245 list
>>per node (if the system is > 144 nodes and you have to get
>>the 3d card).
 >
> This is list prices for the cards only, right ? 

Not for Myrinet. Actually $2000 per node is the total cost 
(NIC/cable/port/software) for the high-end products (with L9/200 MHz), 
should be more like $1500 for low-end ones. Craig is spoiled, only buys 
the top stuff :-)

> What about the switches
> needed. AFAIK Quadrics and Myrinet both need switches, SCI don't (which
> makes the total system cost a bit lower doesn't it ?).

Dunno for QSW, but the NIC represent roughly 3/4 of the price per node 
for Myrinet. Sure, as the smallest switch has 8 ports (16 ports chassis 
and one blade with 8 fibers), It is not interesting for very small 
configurations, i.e less than 8 nodes, but I don't think it's Myricom's 
market.

It's a common mistake to believe that switchless solutions are by 
definition cheaper.

Patrick

----------------------------------------------------------
|   Patrick Geoffray, Ph.D.      patrick at myri.com
|   Myricom, Inc.                http://www.myri.com
|   Cell:  865-389-8852          685 Emory Valley Rd (B)
|   Phone: 865-425-0978          Oak Ridge, TN 37830
----------------------------------------------------------


From manel at labtie.mmt.upc.es  Fri Apr 12 01:12:49 2002
From: manel at labtie.mmt.upc.es (Manel Soria)
Date: Fri, 12 Apr 2002 10:12:49 +0200
Subject: power control
Message-ID: <3CB69701.B9AC38C5@labtie.mmt.upc.es>

We need a power control unit for our 72 nodes cluster. My first
idea was to do it ourselves with a digital i/o card and a set of relais, but
I can't find such a card for Linux.  Actually, I have an ISA card that
is perfect for this application but for some reason with PCI bus it
is more difficult. Also, it seems that the "normal" solution is to buy a
comercial APC system.

Any experiences with in-house made power controls ? Would you
recomend us to buy the APC product ?

--
===============================================
Dr. Manel Soria
ETSEIT - Centre Tecnologic de Transferencia de Calor
C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
E-Mail: manel at labtie.mmt.upc.es


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020412/9b8380df/attachment.html>

From manel at labtie.mmt.upc.es  Fri Apr 12 01:53:09 2002
From: manel at labtie.mmt.upc.es (Manel Soria)
Date: Fri, 12 Apr 2002 10:53:09 +0200
Subject: power control
Message-ID: <3CB6A075.5AE6626@labtie.mmt.upc.es>

We need a power control unit for our 72 nodes cluster. My first
idea was to do it ourselves with a digital i/o card and a set of relais, but
I can't find such a card for Linux.  Actually, I have an ISA card that
is perfect for this application but for some reason with PCI bus it
is more difficult. Also, it seems that the "normal" solution is to buy a
comercial APC system.

Any experiences with in-house made power controls ? Would you
recomend us to buy the APC product ?

--
===============================================
Dr. Manel Soria
ETSEIT - Centre Tecnologic de Transferencia de Calor
C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
E-Mail: manel at labtie.mmt.upc.es


From suraj_peri at yahoo.com  Fri Apr 12 03:16:15 2002
From: suraj_peri at yahoo.com (Suraj Peri)
Date: Fri, 12 Apr 2002 03:16:15 -0700 (PDT)
Subject: What could be the performance of my cluster 
Message-ID: <20020412101615.77502.qmail@web10507.mail.yahoo.com>

Hi group, 
I was calculating the performance of my cluster. The
features are 

1. 8 nodes
2. Processor: AMD Athlon XP 1800+
3. 8 CPUs
4. 8*1.5 GB DDR RAM
5. 1 Server with 2 processorts with AMD MP 1800+ and
2GB DDR RAM

I calculated this to be 48 Mflops . Is this correct ?
if not, what is the correct performance of my cluster.
I also comparatively calculated that my cluster would
be 3 times faster than AlphaServer DS20E ( 833 MHz
alpha 64 bit processor, 4 GB max memory)

Is my calculation correct or wrong? please help me
ASAP. thanks in advance.

cheers
suraj.

=====
PIL/BMB/SDU/DK


=====
PIL/BMB/SDU/DK

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From suraj_peri at yahoo.com  Fri Apr 12 03:32:34 2002
From: suraj_peri at yahoo.com (Suraj Peri)
Date: Fri, 12 Apr 2002 03:32:34 -0700 (PDT)
Subject: What could be the performance of my cluster
In-Reply-To: <20020411125920.D32605@hpti.com>
Message-ID: <20020412103234.7580.qmail@web10508.mail.yahoo.com>

Hi Craig, 
Many thanks for your mail. please excuse me for asking
a dumb question and I am novice in this area. 
I am interested in using this cluster for BLAST
purposes. I want to store ESTs( Expressed Sequence
Tags) and GenBank ( nucleotide sequence database) and
GenPept ( Protein sequence database) and total
predicted protein sets of Human genome. 
I will use BLAST ( basic local alignment search tool
algorithm) on this cluster. As the computataions are
intensive and time consuming. 
So I wanted to compare the AlphaServer DS20E and my
cluster in their computing abilities. 
Because there are no one in my friend circles no about
this. Please help me if you have used clusters for
BLAST purpose. 
thanks
Suraj. 

--- Craig Tierney <ctierney at hpti.com> wrote:
> It depends on what you are trying to do (doesn't
> everyone
> love that answer). 
> 
> The number of flops your cluster can do should
> be equal to:
> 
> flops = (no. of cpus) * (Mhz) * (flops per hz)
> 
> So for your cluster
> 
> flops =  8 * 1.53 Ghz * 2
> 
>   I am assuming that with SSE you can get 2 flops
> per cycle.
> 
> flops = 24.48 Gflops
> 
> Now, there are some issues with this.  First, you
> are never
> going to get 1.53*2 Gflops out of a single
> processor.  Second,
> leveraging all 8 cpus to get their maximum is going
> to be 
> difficult if there is any communication between the
> nodes.
> 
> Compilers play a big role in extracting the best
> performance
> out of the system.  If you don't have a commerical
> compiler
> from the likes of Intel or Portland Group, I highly
> recommend
> getting one.  You only have to purchase the compiler
> for where
> you compile, and not where you run.  You can get
> away with
> one copy of the compiler on your server.
> 
> If you are trying to compare the AMD system to the
> DS20E system,
> it will depend on what you are actually trying to
> do.  If 
> you are running single precision floating point
> codes that do
> not require all the memory bandwidth a DS20E
> provides, I would
> think that within 10% that AMD processor will do the
> work
> of one 833 Mhz Alpha Cpu (You didn't say if you had
> 2 cpus
> in your DS20e).   At least this is what I am seeing
> for my codes when comparing Dual Xeon's, Dual AMD's,
> and
> dual API 833 boxes.
> 
> Craig
> 
> 
> 
> 
> 
> On Sat, Apr 06, 2002 at 03:35:45AM -0800, Suraj Peri
> wrote:
> > Hi group, 
> > I was calculating the performance of my cluster.
> The
> > features are 
> > 
> > 1. 8 nodes
> > 2. Processor: AMD Athlon XP 1800+
> > 3. 8 CPUs
> > 4. 8*1.5 GB DDR RAM
> > 5. 1 Server with 2 processorts with AMD MP 1800+
> and
> > 2GB DDR RAM
> > 
> > I calculated this to be 48 Mflops . Is this
> correct ?
> > if not, what is the correct performance of my
> cluster.
> > I also comparatively calculated that my cluster
> would
> > be 3 times faster than AlphaServer DS20E ( 833 MHz
> > alpha 64 bit processor, 4 GB max memory)
> > 
> > Is my calculation correct or wrong? please help me
> > ASAP. thanks in advance.
> > 
> > cheers
> > suraj.
> > 
> > =====
> > PIL/BMB/SDU/DK
> > 
> > __________________________________________________
> > Do You Yahoo!?
> > Yahoo! Tax Center - online filing with TurboTax
> > http://taxes.yahoo.com/
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> -- 
> Craig Tierney (ctierney at hpti.com)


=====
PIL/BMB/SDU/DK

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From rickey-co at mug.biglobe.ne.jp  Thu Apr 11 20:54:59 2002
From: rickey-co at mug.biglobe.ne.jp (Iwao Makino)
Date: Fri, 12 Apr 2002 12:54:59 +0900
Subject: very high bandwidth, low latency manner?
In-Reply-To: <Pine.LNX.4.30.0204120524590.10585-100000@elin.scali.no>
References: <Pine.LNX.4.30.0204120524590.10585-100000@elin.scali.no>
Message-ID: <v05010195b8dc0a1d6063@[192.168.1.4]>

At 5:29 +0200 12.04.2002, Steffen Persvold wrote:
>On Thu, 11 Apr 2002, Craig Tierney wrote:
>
>>
>>  I talked to a guy at SC2002 from Quadrics and he said
>>  that list pricing on a Quadrics network was about $3500
>>  per node when you are in the 100s of nodes and up.
>>  The price includes the cards, cables, switches,
>>  etc.  This doesn't include any sort of discount that you
>>  might get.  Myrinet is about $2000 for an equivelent
>>  network at list price.   Dolphin/SCI falls around $2245 list
>>  per node (if the system is > 144 nodes and you have to get
>>  the 3d card).
>>
>>
>
>This is list prices for the cards only, right ? What about the switches
>needed. AFAIK Quadrics and Myrinet both need switches, SCI don't (which
>makes the total system cost a bit lower doesn't it ?).

As said on above,  QsNet is per node price including all.
Do does Myrinet, so SCI/Dolphin and Myrinet is about equal.

SCI has good idea of not using switches, but on the other hand, it is
little more complex to connect.
-- 

Best regards,

Iwao Makino
Hard Data Ltd. Tokyo branch
mailto:iwao at harddata.com
http://www.harddata.com/

--> Now Shipping 1U Dual Athlon DDR <-
--> Ask me about the new Alpha DDR UP1500 Systems  <-


From sp at scali.com  Fri Apr 12 06:10:34 2002
From: sp at scali.com (Steffen Persvold)
Date: Fri, 12 Apr 2002 15:10:34 +0200 (CEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <v05010195b8dc0a1d6063@[192.168.1.4]>
Message-ID: <Pine.LNX.4.30.0204121455250.16089-100000@elin.scali.no>

On Fri, 12 Apr 2002, Iwao Makino wrote:

> At 5:29 +0200 12.04.2002, Steffen Persvold wrote:
> >On Thu, 11 Apr 2002, Craig Tierney wrote:
> >
> >>
> >>  I talked to a guy at SC2002 from Quadrics and he said
> >>  that list pricing on a Quadrics network was about $3500
> >>  per node when you are in the 100s of nodes and up.
> >>  The price includes the cards, cables, switches,
> >>  etc.  This doesn't include any sort of discount that you
> >>  might get.  Myrinet is about $2000 for an equivelent
> >>  network at list price.   Dolphin/SCI falls around $2245 list
> >>  per node (if the system is > 144 nodes and you have to get
> >>  the 3d card).
> >>
> >>
> >
> >This is list prices for the cards only, right ? What about the switches
> >needed. AFAIK Quadrics and Myrinet both need switches, SCI don't (which
> >makes the total system cost a bit lower doesn't it ?).
>
> As said on above,  QsNet is per node price including all.
> Do does Myrinet, so SCI/Dolphin and Myrinet is about equal.
>

Yes, sorry I missed that statement :)

> SCI has good idea of not using switches, but on the other hand, it is
> little more complex to connect.
>

True, but if one of your Myrinet switches breaks down you loose 64 nodes
in a 256 node system (standard "CLOS" configuration). I don't know the
MBTF for Myrinet switches, but I would expect it to be rather high
(redundant power supplies ?).

Please don't misunderstand me, I find the Myrinet interconnect very
interesting and also competitive with SCI both from a technological point
of view and wrt. pricing. The only thing this list is lacking is some head
to head performance comparisons of the different interconnects e.g some
NAS benchmarks and maybe also PMB.

Regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From sp at scali.com  Fri Apr 12 06:15:37 2002
From: sp at scali.com (Steffen Persvold)
Date: Fri, 12 Apr 2002 15:15:37 +0200 (CEST)
Subject: power control
In-Reply-To: <3CB6A075.5AE6626@labtie.mmt.upc.es>
Message-ID: <Pine.LNX.4.30.0204121511230.16089-100000@elin.scali.no>

On Fri, 12 Apr 2002, Manel Soria wrote:

>
> We need a power control unit for our 72 nodes cluster. My first
> idea was to do it ourselves with a digital i/o card and a set of relais, but
> I can't find such a card for Linux.  Actually, I have an ISA card that
> is perfect for this application but for some reason with PCI bus it
> is more difficult. Also, it seems that the "normal" solution is to buy a
> comercial APC system.
>
> Any experiences with in-house made power controls ? Would you
> recomend us to buy the APC product ?
>

We've had success with the Baytech (www.baytechdcd.com) RPC-3 units. The
only disadvantage is that they only have 8 controllable ports (which
means that you need 9 of them...).

Regards,
 --
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From rgb at phy.duke.edu  Fri Apr 12 06:17:57 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 12 Apr 2002 09:17:57 -0400 (EDT)
Subject: (no subject)
In-Reply-To: <NMELJLHHFNGMNFFNGJAEMEGCCEAA.emiller@techskills.com>
Message-ID: <Pine.LNX.4.44.0204120855440.3369-100000@lucifer.rgb.private.net>

On Thu, 11 Apr 2002, Eric Miller wrote:

> Myrinet.  We are not experts, we have ALOT of questions, and all we want to
> do is see Linux do something cool that we can show our
> freinds/students/selves.
> 
> Robert, thank you for your positive and informative reply.

I appreciate your interest and understand your goals, actually; whenever
I present beowulfery to a new group (which I seem to do three or four
times a year) I do exactly the same thing -- a bit of a dog and pony
show.  In addition to the pvm povray, I like the pvm mandelbrot set demo
(xep) which I've hacked so the colormap is effectively deeper and so
that it doesn't run out of floating point room so rapidly.  I've been
using or playing with mandelbrot set demo programs long enough that I
can remember when it would take a LONG time to update a single
rubberbanded section.  

Nowadays one can quickly enough get to the bottom of double precision
resolution even on a single CPU -- 13 digits isn't really all that many
when you rubberband down close to an order of magnitude at a time.
Still, with even a small cluster you can get nearly linear speedup and
actually "see" the nodes returning their independent strips -- if you
have mix of "slow" nodes and faster ones you can even learn some useful
things about parallel programming just watching them come in and
discussing what you see.

The only point I was making is that your class should definitely take
the time to go over at least Amdahl's law and one of the improved
estimates that account for both the serial fraction and the
communications time, and get some understanding of the

  embarassingly parallel (SETI, distributed monte carlo) -> coarse
grained, non-synchronous (pvmpov, xep) -> coarse grained, synchronous
(lattice partitioned monte carlo) -> medium-to-fine grained,
(non-)synchronous (galactic evolution, weather models)

sequencing where for each step up the chain one has to exercise
additional care in engineering an effective cluster to deal with it.  EP
chores (as Eric pointed out) are "perfect" for a cluster because "any"
cluster or parallel computer including the simplest SMP boxes will do.
Coarse grained tasks will also generally run well on a "standard" linux
cluster -- a bunch of boxes on a network, where the kind of network and
whether the boxes are workstations, desktops in active use, or dedicated
nodes doesn't much matter.  When you hit synchronous tasks in general,
but especially the finer grained synchronous tasks (tasks where all
nodes have to complete a parallel computation sequence -- reach a
"barrier" -- and then exchange information before beginning the next
parallel computation sequence) then you really have to start paying
attention to the network (latency and bandwidth both), it helps to have
dedicated nodes that AREN'T doing double duty as workstations (since the
rate of progress is determined by the slowest node), and most of these
tasks have a strict upper bound on the number of nodes that one can
assign to a task and still decrease the time of completion.

This last point is a very important one.  It is easy to see a coarse
grained task speed up N-fold on N nodes and conclude that all problems
can them be solved faster if we just add more nodes.  Make sure that
your students see that this is not so, so that if they ever DO engineer
a compute cluster to accomplish some particular task, they don't just
buy lots of nodes, but instead do the arithmetic first...

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From bob at drzyzgula.org  Fri Apr 12 06:43:31 2002
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Fri, 12 Apr 2002 09:43:31 -0400
Subject: power control
In-Reply-To: <3CB6A075.5AE6626@labtie.mmt.upc.es>
References: <3CB6A075.5AE6626@labtie.mmt.upc.es>
Message-ID: <20020412094331.I20839@www2>

We have good luck with the Pulizzi "Z-line" controllers:
http://www.pulizzi.com/

Their high-end units can be networked into an RS-485 chain,
so that dozens of units can be controled from a single
serial interface.

--BOb

On Fri, Apr 12, 2002 at 10:53:09AM +0200, Manel Soria wrote:
> 
> We need a power control unit for our 72 nodes cluster. My first
> idea was to do it ourselves with a digital i/o card and a set of relais, but
> I can't find such a card for Linux.  Actually, I have an ISA card that
> is perfect for this application but for some reason with PCI bus it
> is more difficult. Also, it seems that the "normal" solution is to buy a
> comercial APC system.
> 
> Any experiences with in-house made power controls ? Would you
> recomend us to buy the APC product ?
> 
> --
> ===============================================
> Dr. Manel Soria
> ETSEIT - Centre Tecnologic de Transferencia de Calor
> C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> E-Mail: manel at labtie.mmt.upc.es
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From timm at fnal.gov  Fri Apr 12 07:19:07 2002
From: timm at fnal.gov (Steven Timm)
Date: Fri, 12 Apr 2002 09:19:07 -0500 (CDT)
Subject: power control
In-Reply-To: <3CB69701.B9AC38C5@labtie.mmt.upc.es>
Message-ID: <Pine.LNX.4.31.0204120916410.1915-100000@snowball.fnal.gov>

We run with the APC units on our cluster.  We use the vertical-mount
strips which are good for 20 amps apiece.  They provide three major
benefits.  One is that you can sequence the power up to have a delay
so that the power draw of all systems coming on at once doesn't
saturate your circuit.  Second is that you can remote reset a node
that is hung and not responding without going into the computer room.
Third is that you get a real-time monitor of how much current your
systems are drawing.

They are somewhat expensive but do have good discounts for educational
institutions and non-profits.

Steve Timm


------------------------------------------------------------------
Steven C. Timm (630) 840-8525  timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations

On Fri, 12 Apr 2002, Manel Soria wrote:

> We need a power control unit for our 72 nodes cluster. My first
> idea was to do it ourselves with a digital i/o card and a set of relais, but
> I can't find such a card for Linux.  Actually, I have an ISA card that
> is perfect for this application but for some reason with PCI bus it
> is more difficult. Also, it seems that the "normal" solution is to buy a
> comercial APC system.
>
> Any experiences with in-house made power controls ? Would you
> recomend us to buy the APC product ?
>
> --
> ===============================================
> Dr. Manel Soria
> ETSEIT - Centre Tecnologic de Transferencia de Calor
> C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> E-Mail: manel at labtie.mmt.upc.es
>
>
>


From muno at aem.umn.edu  Fri Apr 12 07:54:05 2002
From: muno at aem.umn.edu (Ray Muno)
Date: Fri, 12 Apr 2002 09:54:05 -0500
Subject: power control
In-Reply-To: <Pine.LNX.4.30.0204121511230.16089-100000@elin.scali.no>
References: <3CB6A075.5AE6626@labtie.mmt.upc.es> <Pine.LNX.4.30.0204121511230.16089-100000@elin.scali.no>
Message-ID: <20020412095405.A10200@aem.umn.edu>

We are using a variety of Baytech power control units in our 2 clusters.

We have 3 RPC28 units powering 48 1U Dual PIII machines in 2 racks.  They
have 30A inputs and 21 outlets (20 controlled, 1 always on). With the Dual
PIII boxes, I can only run 16 from each strip.  Each strip is divided in to
a pair of 15A segments, 8 machines per segment. At full load, 10 machines
was too much for a 15A segment.  I was really suprised at the increase in
power draw under load when we first started running these.

In addition, there are 2 RPC4-20 running the disk arrays, server boxes and
ethernet and Myrinet switches in the 2 racks. All told, 130A availble in
the 2 racks, all pretty well utilized.

We also have a pair of RPC3-20 powering 2 racks of Alpha machines. These
have ethernet interfaces but we decided later it was not worth the added 
cost.

I could not be happier with these units.  They can be configured to stage
the startup of the machine in sequence so you do not try and power up 48
machines all at one time. 

On Fri, Apr 12, 2002 at 03:15:37PM +0200, Steffen Persvold wrote:
> On Fri, 12 Apr 2002, Manel Soria wrote:
> 
> >
> > We need a power control unit for our 72 nodes cluster. My first
> > idea was to do it ourselves with a digital i/o card and a set of relais, but
> > I can't find such a card for Linux.  Actually, I have an ISA card that
> > is perfect for this application but for some reason with PCI bus it
> > is more difficult. Also, it seems that the "normal" solution is to buy a
> > comercial APC system.
> >
> > Any experiences with in-house made power controls ? Would you
> > recomend us to buy the APC product ?
> >
> 
> We've had success with the Baytech (www.baytechdcd.com) RPC-3 units. The
> only disadvantage is that they only have 8 controllable ports (which
> means that you need 9 of them...).
> 
> Regards,
>  --
>   Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
>  mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
> Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
> Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
=============================================================================
 
 Ray Muno                           http://www.aem.umn.edu/people/staff/muno
 University of Minnesota                          e-mail:   muno at aem.umn.edu
 Aerospace Engineering and Mechanics               Phone:     (612) 625-9531
 110 Union St. S.E.		                     FAX:     (612) 626-1558
 Minneapolis, Mn 55455			

=============================================================================


From patrick at myri.com  Fri Apr 12 08:17:08 2002
From: patrick at myri.com (Patrick Geoffray)
Date: Fri, 12 Apr 2002 11:17:08 -0400
Subject: very high bandwidth, low latency manner?
References: <Pine.LNX.4.30.0204121455250.16089-100000@elin.scali.no>
Message-ID: <3CB6FA74.6040704@myri.com>

Steffen Persvold wrote:

> True, but if one of your Myrinet switches breaks down you loose 64 nodes
> in a 256 node system (standard "CLOS" configuration). I don't know the
> MBTF for Myrinet switches, but I would expect it to be rather high
> (redundant power supplies ?).

The calculated MTBF of the switches is +50 years. Actually, if all 6 
fans go off, it will still work, then the switch will drop more and more 
packets, then the uC will shutdown the blades one by one if they reach 
the  critical temperature limit.
If there is a failure on a blade itself, it will affect only 8 ports.
If there is a failure in a crossbar on the backplane, the mapper will 
use a redondant route (as many redondant routes as crossbars, so a 
failure in each 8 crossbars on the backplane is required to loose all 
ports).

Chuck made a very nice talk at Cluster2001 about Clos topology. It 
presents thing very clearly, I like it a lot: 
http://www.cacr.caltech.edu/cluster2001/program/talks/seitz.pdf

Regards.

Patrick

----------------------------------------------------------
|   Patrick Geoffray, Ph.D.      patrick at myri.com
|   Myricom, Inc.                http://www.myri.com
|   Cell:  865-389-8852          685 Emory Valley Rd (B)
|   Phone: 865-425-0978          Oak Ridge, TN 37830
----------------------------------------------------------


From eugen at leitl.org  Fri Apr 12 08:55:02 2002
From: eugen at leitl.org (Eugen Leitl)
Date: Fri, 12 Apr 2002 17:55:02 +0200 (CEST)
Subject: [gamma_sw] New release of GAMMA available (fwd)
Message-ID: <Pine.LNX.4.33.0204121754560.3227-100000@hydrogen.leitl.org>


---------- Forwarded message ----------
Date: Fri, 12 Apr 2002 17:34:47 +0200 (MET DST)
From: Giuseppe Ciaccio <ciaccio at disi.unige.it>
To: GAMMA mailing list <gamma_sw at lists.dsi.uniroma1.it>
Subject: [gamma_sw] New release of GAMMA available

A wonderful, new release of GAMMA is available for download:

	http://www.disi.unige.it/project/gamma,  section "How to install"

Main features:

1)
A driver for the Netgear GA621/GA622 Gigabit Ethernet adapter is now
provided.  The driver has been excellently implemented by Marco Ehlert
(mehlert at cs.uni-potsdam.de), with support of prof. Bettina Schnor
(schnor at cs.uni-potsdam.de) and in cooperation with myself (supported
by prof. Schnor during a nice stage at Potsdam).  I thank Marco and
Bettina very much for this beautiful experience.

The driver has been tested on a 16-nodes cluster using the GA621 adapters.
We still miss tests on the GA622 (which should be backward-compatible).

Performance numbers are impressive.
The Potsdam testbed was a pair of back-to-back connected PCs, each with
	CPU Intel Pentium III 1 GHz
	motherboard: SuperMicro 370DE6 (chipset: ServerSet III HE-SL)
	133 MHz FSB
	PCI bus 66 MHz, 64 bit
	Netgear GA621 adapter, dedicated to GAMMA
	Linux 2.4.16 + GAMMA

On such a testbed, Marco got the following numbers:

MTU size	Latency (usec)		Throughput (MByte/s)
1500		8.5			118.5
4116		8.5			122


2)
Minor changes to the GAMMA user API: the family of set_port() routines
has been slightly rearranged.  This has implications on MPI/GAMMA, a
new release of which is also available for download:
	http://www.disi.unige.it/project/gamma/mpigamma
Older versions of MPI/GAMMA will no longer compile under the current
version of GAMMA.

3)
Documentation has been updated.

The mysterious lock-up problems reported by someone on this mailing list
might have been caused by the use of gcc 2.96.  Still investigating...but
I'm not yet able to reproduce the bug here (because I don't use gcc 2.96 ?).

Enjoy!


Giuseppe Ciaccio               http://www.disi.unige.it/person/CiaccioG/
DISI - Universita' di Genova   via Dodecaneso 35   16146 Genova,   Italy
phone +39 10 353 6638          fax +39 010 3536699 ciaccio at disi.unige.it
------------------------------------------------------------------------

_______________________________________________
gamma_sw mailing list
gamma_sw at lists.dsi.uniroma1.it
http://lists.dsi.uniroma1.it/mailman/listinfo/gamma_sw


From hungjunglu at yahoo.com  Fri Apr 12 08:56:35 2002
From: hungjunglu at yahoo.com (Hung Jung Lu)
Date: Fri, 12 Apr 2002 08:56:35 -0700 (PDT)
Subject: BLAS-1, AMD, Pentium, gcc
Message-ID: <20020412155635.92564.qmail@web12605.mail.yahoo.com>

Hi,

I am thinking in migrating some calculation programs
from Windows to Linux, maybe eventually using a
Beowulf cluster. However, I am kind of worried after I
read in the mailing list archive about lack of
CPU-optimized BLAS-1 code in Linux systems. Currently
I run on a Wintel (Windows+Pentium) machine, and I
know it's substantially faster than equivalent AMD
machine, because I use the Intel's BLAS (MKL) library.
(I apologize for any misapprehensions in what
follows... I am only starting to explore in this
arena.)

(1) Does anyone know when gcc will have memory
prefetching features? Any time frame? I can notice
very significant performance improvement on my Wintel
machine, and I think it's due to memory prefetching.

(2) I am a bit confused on the following issue: Intel
does release MKL for Linux. So, does this mean that if
I use Pentium, I still get full benefit of the
CPU-optimized features in BLAS-1, despite of gcc does
not do memory prefetching? How is this possible?

(3) Related to the above: for general linear algebra
operations, is Pentium processor then better than AMD,
since Intel has the machine-optimized BLAS library? I
get contradictory information sometimes... I've seen
somewhere that Pentium-4 compares unfavorably with AMD
chips in calculation speed... Any opinions?

thanks,

Hung Jung Lu

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From rochus.schmid at ch.tum.de  Fri Apr 12 09:33:18 2002
From: rochus.schmid at ch.tum.de (Dr. Rochus Schmid)
Date: Fri, 12 Apr 2002 18:33:18 +0200
Subject: k7s5a mobo based cluster
Message-ID: <3CB70C4E.9D1E720E@ch.tum.de>

dear wulfers,

i recently assembled a tiny and cheap (i know :-) cluster using the ecs
k7s5a mobo (SIS 735 chipset).
the board is very cheap (~75 ?) and comes with FE (SIS900) onboard.
just want to tell about my experience (other hardware, netbooot  +
stream / netpipe results).

i would be happy to know of others using this board. this mail also
contains some questions ... maybe someone can answer or help me here?
if this doesn't interst you please skip - apologies for the bandwidth.

#############
HARDWARE
(currently 4 nodes .. hope to get 4 more :-)
   mobo: k7s5a
   cpu:  athlonxp 1,4 ghz (1600+)
   ram: 256 MB DDR (266MHz, CL2)
   graphics: various pci/agp graphics cards i could find
   floppy
  small tower with 250W PS

nodes are diskless, master has an additional 40GB IDE disk.
switch: D-Link DES 1008D 8 port switch.

##############
POWER /GRAPHICS

the cluster has continuously been up for about 3 weeks now with quite
some load for most of the time. as far as i can tell, the 250W seems to
be ok for the board and the 1,4 ghz athlonxp.

the ami-bios does not allow booting without a graphics adapter. someone
on the net (using a lot of the boards for a SETI at home "farm" told me
that he did not get around it even with teaking tools for the ami-bios.
i am happy to have a console for maintenance but one has to find a
cheapo graphics card  ... anyone out there managed to avoid this?

##############
NETBOOT / OS

i run a RH7.2 on it with a 2.4.17 kernel with NFS-root.
the bios supports the RPL protocol for netbooting. i tried the rpld for
linux and the board seems to communicate with the rpl-server and
download something, but i didnt get it to boot. the rpld-developers sent
me some patch to "switch off a DMA channel" of the onboard NIC but i
have to admit that i didnt really understand what to do, nor did i try
it. i currently boot from a syslinux floppy and use NFS-root.
did soemone manage to netboot linux with this hardware?

###############
STREAM

because of the comments on the gcc versions 2.96 versus 2.95 issue
mentioned on the ATLAS webpages i reinstalled the gcc 2.95 and found
differences also for stream results and therefore i will post both here
(both compiled with -O2 for comparison)
Array size = 2000000, Offset = 0

gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-98)
-O2
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         666.0980       0.0489       0.0480       0.0559
Scale:        585.3939       0.0547       0.0547       0.0549
Add:          726.1178       0.0662       0.0661       0.0663
Triad:        679.6655       0.0707       0.0706       0.0707

gcc version 2.95.3 20010315 (release)
-O2
Function      Rate (MB/s)   RMS time     Min time     Max time
Copy:         727.6031       0.0440       0.0440       0.0443
Scale:        627.5864       0.0526       0.0510       0.0649
Add:          798.1775       0.0602       0.0601       0.0603
Triad:        727.6691       0.0660       0.0660       0.0661

i guess it is obvious to reinstall gcc-2.95 when using RH7.1 or RH7.2.
these results are not as good as reported recently for the nforce
chipset.
i tried to set the bios settings for the ddr-ram to optimal, but i didnt
test/experiment.

##########
NetPipe-2.4

The following results are NOT from a crossconnect cable but measured
throug the D-Link switch!!

kernel 2.4.17 / NIC-driver SIS900
for MPI: LAM-MPI 6.5.1

NPtcp:
    latency: 33 us
    bandwidth: ~89.7 MBit/s

NPmpi:
    latency: 41 us
    bandwidth: ~82 MBit/s (maximux at about 85 MBit/s)

the latency of around 40 microsec seems to be very low as far as i can
tell from the information on the net (i am absolutly a beginner in this
field). is there anything one can seriously do wrong? i tried it a
couple of times. between different nodes always with basically the same
result.

###################
i hope i did not anoy the pros on this list too much, and this is
helpfull for comparison.
again: please contact me off list if you also use this type of hardware.

thanks and best greetings from munich,

    rochus


--

Dr. Rochus Schmid
Technische Universit?t M?nchen
Lehrstuhl f. Anorganische Chemie
Lichtenbergstrasse 4, 85747 Garching

Tel.    ++49 89 2891 3174
Fax.    ++49 89 2891 3473
Email   rochus.schmid at ch.tum.de


From ctierney at hpti.com  Fri Apr 12 10:01:51 2002
From: ctierney at hpti.com (Craig Tierney)
Date: Fri, 12 Apr 2002 11:01:51 -0600
Subject: very high bandwidth, low latency manner?
In-Reply-To: <3CB67338.8080906@myri.com>; from patrick@myri.com on Fri, Apr 12, 2002 at 01:40:08AM -0400
References: <Pine.LNX.4.30.0204120524590.10585-100000@elin.scali.no> <3CB67338.8080906@myri.com>
Message-ID: <20020412110151.A1491@hpti.com>

On Fri, Apr 12, 2002 at 01:40:08AM -0400, Patrick Geoffray wrote:
> Steffen Persvold wrote:
> 
> >>I talked to a guy at SC2002 from Quadrics and he said
> >>that list pricing on a Quadrics network was about $3500
> >>per node when you are in the 100s of nodes and up.
> >>The price includes the cards, cables, switches,
> >>etc.  This doesn't include any sort of discount that you
> >>might get.  Myrinet is about $2000 for an equivelent
> >>network at list price.   Dolphin/SCI falls around $2245 list
> >>per node (if the system is > 144 nodes and you have to get
> >>the 3d card).
>  >
> > This is list prices for the cards only, right ? 
> 
> Not for Myrinet. Actually $2000 per node is the total cost 
> (NIC/cable/port/software) for the high-end products (with L9/200 MHz), 
> should be more like $1500 for low-end ones. Craig is spoiled, only buys 
> the top stuff :-)

Sorry Patrick.  The problem with trying to state numbers is that
if you get it wrong, then the real knowledgeable ones can point it out.

I figure out list cost on a 256 node system at about $2000 before for
basic hardware.  I as wrong.  I reworked it and it is $1500 for
256 (and would be the same for 512 and 1024).

I thought I had decent information.  What I was trying to provide was
info on the three options.  The SCI price is per node (cards, cables,
software).  It is $2245 list.  This would be appropriate for systems over
144 nodes.

The Quadrics number I got from the rep might have been the sales number,
and not a perfect comparsion to all the required hardware for a system of
a few hundred nodes.

Craig


From ctierney at hpti.com  Fri Apr 12 10:15:52 2002
From: ctierney at hpti.com (Craig Tierney)
Date: Fri, 12 Apr 2002 11:15:52 -0600
Subject: What could be the performance of my cluster
In-Reply-To: <20020412103234.7580.qmail@web10508.mail.yahoo.com>; from suraj_peri@yahoo.com on Fri, Apr 12, 2002 at 03:32:34AM -0700
References: <20020411125920.D32605@hpti.com> <20020412103234.7580.qmail@web10508.mail.yahoo.com>
Message-ID: <20020412111552.B1491@hpti.com>

All my experience is with oceanography and
atmospheric applications.

Is the BLAST code something that spends lots
of time trying doing lots of little calculations,
or doing one big calculation?  How important is
the speed of access to the database?  What is
the memory footprint of the code when it runs
on the DS20E?

Craig


On Fri, Apr 12, 2002 at 03:32:34AM -0700, Suraj Peri wrote:
> Hi Craig, 
> Many thanks for your mail. please excuse me for asking
> a dumb question and I am novice in this area. 
> I am interested in using this cluster for BLAST
> purposes. I want to store ESTs( Expressed Sequence
> Tags) and GenBank ( nucleotide sequence database) and
> GenPept ( Protein sequence database) and total
> predicted protein sets of Human genome. 
> I will use BLAST ( basic local alignment search tool
> algorithm) on this cluster. As the computataions are
> intensive and time consuming. 
> So I wanted to compare the AlphaServer DS20E and my
> cluster in their computing abilities. 
> Because there are no one in my friend circles no about
> this. Please help me if you have used clusters for
> BLAST purpose. 
> thanks
> Suraj. 
> 
> --- Craig Tierney <ctierney at hpti.com> wrote:
> > It depends on what you are trying to do (doesn't
> > everyone
> > love that answer). 
> > 
> > The number of flops your cluster can do should
> > be equal to:
> > 
> > flops = (no. of cpus) * (Mhz) * (flops per hz)
> > 
> > So for your cluster
> > 
> > flops =  8 * 1.53 Ghz * 2
> > 
> >   I am assuming that with SSE you can get 2 flops
> > per cycle.
> > 
> > flops = 24.48 Gflops
> > 
> > Now, there are some issues with this.  First, you
> > are never
> > going to get 1.53*2 Gflops out of a single
> > processor.  Second,
> > leveraging all 8 cpus to get their maximum is going
> > to be 
> > difficult if there is any communication between the
> > nodes.
> > 
> > Compilers play a big role in extracting the best
> > performance
> > out of the system.  If you don't have a commerical
> > compiler
> > from the likes of Intel or Portland Group, I highly
> > recommend
> > getting one.  You only have to purchase the compiler
> > for where
> > you compile, and not where you run.  You can get
> > away with
> > one copy of the compiler on your server.
> > 
> > If you are trying to compare the AMD system to the
> > DS20E system,
> > it will depend on what you are actually trying to
> > do.  If 
> > you are running single precision floating point
> > codes that do
> > not require all the memory bandwidth a DS20E
> > provides, I would
> > think that within 10% that AMD processor will do the
> > work
> > of one 833 Mhz Alpha Cpu (You didn't say if you had
> > 2 cpus
> > in your DS20e).   At least this is what I am seeing
> > for my codes when comparing Dual Xeon's, Dual AMD's,
> > and
> > dual API 833 boxes.
> > 
> > Craig
> > 
> > 
> > 
> > 
> > 
> > On Sat, Apr 06, 2002 at 03:35:45AM -0800, Suraj Peri
> > wrote:
> > > Hi group, 
> > > I was calculating the performance of my cluster.
> > The
> > > features are 
> > > 
> > > 1. 8 nodes
> > > 2. Processor: AMD Athlon XP 1800+
> > > 3. 8 CPUs
> > > 4. 8*1.5 GB DDR RAM
> > > 5. 1 Server with 2 processorts with AMD MP 1800+
> > and
> > > 2GB DDR RAM
> > > 
> > > I calculated this to be 48 Mflops . Is this
> > correct ?
> > > if not, what is the correct performance of my
> > cluster.
> > > I also comparatively calculated that my cluster
> > would
> > > be 3 times faster than AlphaServer DS20E ( 833 MHz
> > > alpha 64 bit processor, 4 GB max memory)
> > > 
> > > Is my calculation correct or wrong? please help me
> > > ASAP. thanks in advance.
> > > 
> > > cheers
> > > suraj.
> > > 
> > > =====
> > > PIL/BMB/SDU/DK
> > > 
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Yahoo! Tax Center - online filing with TurboTax
> > > http://taxes.yahoo.com/
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or
> > unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> > -- 
> > Craig Tierney (ctierney at hpti.com)
> 
> 
> =====
> PIL/BMB/SDU/DK
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Tax Center - online filing with TurboTax
> http://taxes.yahoo.com/
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Craig Tierney (ctierney at hpti.com)


From tim at dolphinics.com  Fri Apr 12 10:02:04 2002
From: tim at dolphinics.com (Tim Wilcox)
Date: Fri, 12 Apr 2002 11:02:04 -0600
Subject: very high bandwidth, low latency manner?
References: <Pine.LNX.4.30.0204050252190.7535-100000@elin.scali.no> <v05010178b8dacb026736@[192.168.1.4]> <3CB5DA04.AF2C609F@lfbs.rwth-aachen.de> <20020411132650.A32674@hpti.com>
Message-ID: <3CB7130C.60107@dolphinics.com>


Craig Tierney wrote:

>I talked to a guy at SC2002 from Quadrics and he said
>that list pricing on a Quadrics network was about $3500
>per node when you are in the 100s of nodes and up.  
>The price includes the cards, cables, switches,
>etc.  This doesn't include any sort of discount that you
>might get.  Myrinet is about $2000 for an equivelent 
>network at list price.   Dolphin/SCI falls around $2245 list 
>per node (if the system is > 144 nodes and you have to get
>the 3d card).
>
A couple of corrections to this, the 2D card lists at $1695 per node and 
is suitable for up to 256 nodes.  No one has built one that size and it 
is correct that is it is recommended to use 3D (lists $2245/node) for 
larger than 144 nodes.  This is due to a potential saturation of a ring 
for certain communication patterns, it is not always the case.  By going 
to 3D you shorten the rings and avoid this up to 1728 nodes.  Anyone 
interested in building one?

>
>
>I heard that Quadrics had a customer that just had to have
>an Intel/Quadrics system so either they or he was working
>on porting the drivers.   The web page says they support
>Linux and Tru64.  You could probably get the hardware without
>going through Compaq, but Compaq is most likely buying up
>most of the supply.
>
>Craig
>
The Quadrics looks interesting, but I haven't the resources to afford 
the pleasure of playing with it.  The major issue with it is pricing and 
lack of nodes out there using it pricing.

Myricom and Dolphin tend to come to about the same price per node, chalk 
it up to friendly competition.

Regards,
Tim Wilcox


From djholm at fnal.gov  Fri Apr 12 10:36:22 2002
From: djholm at fnal.gov (Don Holmgren)
Date: Fri, 12 Apr 2002 12:36:22 -0500
Subject: BLAS-1, AMD, Pentium, gcc
In-Reply-To: <20020412155635.92564.qmail@web12605.mail.yahoo.com>
Message-ID: <Pine.SGI.4.21.0204121210450.26154338-100000@hppc.fnal.gov>

On Fri, 12 Apr 2002, Hung Jung Lu wrote:

> Hi,
> 
> I am thinking in migrating some calculation programs
> from Windows to Linux, maybe eventually using a
> Beowulf cluster. However, I am kind of worried after I
> read in the mailing list archive about lack of
> CPU-optimized BLAS-1 code in Linux systems. Currently
> I run on a Wintel (Windows+Pentium) machine, and I
> know it's substantially faster than equivalent AMD
> machine, because I use the Intel's BLAS (MKL) library.
> (I apologize for any misapprehensions in what
> follows... I am only starting to explore in this
> arena.)
> 
> (1) Does anyone know when gcc will have memory
> prefetching features? Any time frame? I can notice
> very significant performance improvement on my Wintel
> machine, and I think it's due to memory prefetching.


If you mean, "when will gcc's optimizer do automatic prefetching?", I
have no idea.  But, many programmers have been doing manual prefetching
with gcc for quite a while. If you don't mind defining and using
assembler macros, gcc handles it just fine now.  Here's an example:

#define prefetch_loc(addr) \
__asm__ __volatile__ ("prefetchnta %0" \
                      : \
                      : \
                      "m" (*(((char*)(((unsigned int)(addr))&~0x7f)))))


> (2) I am a bit confused on the following issue: Intel
> does release MKL for Linux. So, does this mean that if
> I use Pentium, I still get full benefit of the
> CPU-optimized features in BLAS-1, despite of gcc does
> not do memory prefetching? How is this possible?


The Intel compiler produces object files compatible with gcc, and vice
versa.  I would assume they implemented the library with the Intel
compiler, which has full SSE/SSE2 support (including prefetching).  They
list the MKL for Linux as compatible with both gnu and Intel compilers.


> (3) Related to the above: for general linear algebra
> operations, is Pentium processor then better than AMD,
> since Intel has the machine-optimized BLAS library? I
> get contradictory information sometimes... I've seen
> somewhere that Pentium-4 compares unfavorably with AMD
> chips in calculation speed... Any opinions?
> 
> thanks,
> 
> Hung Jung Lu


For the very simple SU3 linear algebra (3X3 complex matrices and 3X1
complex vectors) used in our codes, the Pentium 4 outperforms the Athlon
on most of our SSE-assisted routines.  See the table near the bottom of
   http://qcdhome.fnal.gov/sse/inline.html
for Mflops per gigahertz on various routines for P-III, P4, and Athlon.
Perhaps re-coding in 3DNow! would give the Athlon a boost.

For our codes, which are bound by memory bandwidth, P4's do
significantly better than Athlons because of the faster front side bus
(400 Mhz effective).  See 
   http://qcdhome.fnal.gov/qcdstream/compare.qcdstream
for a table comparing memory bandwidth and SU3 linear algebra
performance on a 1.2 GHz Athlon, 1.4 GHz P4, and 1.7 GHz P7 (see
   http://qcdhome.fnal.gov/qcdstream/   
for information about this benchmark).

Don Holmgren
Fermilab


From sp at scali.com  Fri Apr 12 10:50:26 2002
From: sp at scali.com (Steffen Persvold)
Date: Fri, 12 Apr 2002 19:50:26 +0200 (CEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <20020412110151.A1491@hpti.com>
Message-ID: <Pine.LNX.4.30.0204121919130.16089-100000@elin.scali.no>

On Fri, 12 Apr 2002, Craig Tierney wrote:

> On Fri, Apr 12, 2002 at 01:40:08AM -0400, Patrick Geoffray wrote:
> > Steffen Persvold wrote:
> >
> > >>I talked to a guy at SC2002 from Quadrics and he said
> > >>that list pricing on a Quadrics network was about $3500
> > >>per node when you are in the 100s of nodes and up.
> > >>The price includes the cards, cables, switches,
> > >>etc.  This doesn't include any sort of discount that you
> > >>might get.  Myrinet is about $2000 for an equivelent
> > >>network at list price.   Dolphin/SCI falls around $2245 list
> > >>per node (if the system is > 144 nodes and you have to get
> > >>the 3d card).
> >  >
> > > This is list prices for the cards only, right ?
> >
> > Not for Myrinet. Actually $2000 per node is the total cost
> > (NIC/cable/port/software) for the high-end products (with L9/200 MHz),
> > should be more like $1500 for low-end ones. Craig is spoiled, only buys
> > the top stuff :-)
>
> Sorry Patrick.  The problem with trying to state numbers is that
> if you get it wrong, then the real knowledgeable ones can point it out.
>
> I figure out list cost on a 256 node system at about $2000 before for
> basic hardware.  I as wrong.  I reworked it and it is $1500 for
> 256 (and would be the same for 512 and 1024).

So what is wron with my calculations :

256 node L9/2MB/133MHz config :

M3F-PCI64B-2 NICs           256 *  $1,195 = $305,920
M3-E128 Switch enclosures     6 * $12,800 =  $76,800
M3-SW16-8F "Leaf" cards      32 *  $2,400 =  $76,800
M3-SPINE-8F "Spine" cards    64 *  $1,600 = $102,400
-----------------------------------------------------
Total cost                                = $561,920
Node cost                                 =   $2,195

and for a L9/2MB/200MHz config :

M3F-PCI64B-2 NICs           256 *  $1,495 = $382,720
M3-E128 Switch enclosures     6 * $12,800 =  $76,800
M3-SW16-8F "Leaf" cards      32 *  $2,400 =  $76,800
M3-SPINE-8F "Spine" cards    64 *  $1,600 = $102,400
-----------------------------------------------------
Total cost                                = $638,720
Node cost                                 =   $2,495

And this is without cable cost (since I don't quite know the cable
requirements for the total system, but atleast it is approx $100 per
node).

>
> I thought I had decent information.  What I was trying to provide was
> info on the three options.  The SCI price is per node (cards, cables,
> software).  It is $2245 list.  This would be appropriate for systems over
> 144 nodes.
>

Actually, Wulfkit3 comes in two flavors wether you want 1U or can manage
with a 2U (or higher) solutuion. The 1U is $2,445 and the 2U solution is
$2,245 per node.

> The Quadrics number I got from the rep might have been the sales number,
> and not a perfect comparsion to all the required hardware for a system of
> a few hundred nodes.
>

Now we have price comparisons for the interconnects (SCI,Myrinet and
Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII ServerWorks
HE-SL based cluster).

Regards,
--
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From lindahl at keyresearch.com  Fri Apr 12 10:43:35 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 12 Apr 2002 13:43:35 -0400
Subject: What could be the performance of my cluster
In-Reply-To: <20020412111552.B1491@hpti.com>; from ctierney@hpti.com on Fri, Apr 12, 2002 at 11:15:52AM -0600
References: <20020411125920.D32605@hpti.com> <20020412103234.7580.qmail@web10508.mail.yahoo.com> <20020412111552.B1491@hpti.com>
Message-ID: <20020412134335.B1810@wumpus.skymv.com>

On Fri, Apr 12, 2002 at 11:15:52AM -0600, Craig Tierney wrote:

> Is the BLAST code something that spends lots
> of time trying doing lots of little calculations,
> or doing one big calculation?  How important is
> the speed of access to the database?  What is
> the memory footprint of the code when it runs
> on the DS20E?

It depends.

What BLAST does is compare a set of sequences against a big database of
sequences. The databases come in small, medium, and large (bigger than
2 GByte) sizes; the sequences can either be a single sequence (imagine
a researcher looking up a single protein using a web interface) or a
large set of them. If it's a large set, the problem is embarrassingly
parallel.

The BLAST implementation used by most people isn't parallel. It can be
fairly easily parallelized to divide the big database up into pieces.

People build fairly different clusters to run BLAST depending on their
details. The guys at Celera Geonmics didn't want to use a parallel
version, and their database is bigger than 2 GBytes, so they bought
Alphas. Most people have small enough databases to fit into 2 GBytes,
but search against 1 sequence at a time, so they can't afford to read
the entire database over NFS every time, and keep it on a local disk.

greg


From ctierney at hpti.com  Fri Apr 12 11:19:13 2002
From: ctierney at hpti.com (Craig Tierney)
Date: Fri, 12 Apr 2002 12:19:13 -0600
Subject: very high bandwidth, low latency manner?
In-Reply-To: <Pine.LNX.4.30.0204121919130.16089-100000@elin.scali.no>; from sp@scali.com on Fri, Apr 12, 2002 at 07:50:26PM +0200
References: <20020412110151.A1491@hpti.com> <Pine.LNX.4.30.0204121919130.16089-100000@elin.scali.no>
Message-ID: <20020412121913.B1508@hpti.com>

On Fri, Apr 12, 2002 at 07:50:26PM +0200, Steffen Persvold wrote:
> On Fri, 12 Apr 2002, Craig Tierney wrote:
> 
> > On Fri, Apr 12, 2002 at 01:40:08AM -0400, Patrick Geoffray wrote:
> > > Steffen Persvold wrote:
> > >
> > > >>I talked to a guy at SC2002 from Quadrics and he said
> > > >>that list pricing on a Quadrics network was about $3500
> > > >>per node when you are in the 100s of nodes and up.
> > > >>The price includes the cards, cables, switches,
> > > >>etc.  This doesn't include any sort of discount that you
> > > >>might get.  Myrinet is about $2000 for an equivelent
> > > >>network at list price.   Dolphin/SCI falls around $2245 list
> > > >>per node (if the system is > 144 nodes and you have to get
> > > >>the 3d card).
> > >  >
> > > > This is list prices for the cards only, right ?
> > >
> > > Not for Myrinet. Actually $2000 per node is the total cost
> > > (NIC/cable/port/software) for the high-end products (with L9/200 MHz),
> > > should be more like $1500 for low-end ones. Craig is spoiled, only buys
> > > the top stuff :-)
> >
> > Sorry Patrick.  The problem with trying to state numbers is that
> > if you get it wrong, then the real knowledgeable ones can point it out.
> >
> > I figure out list cost on a 256 node system at about $2000 before for
> > basic hardware.  I as wrong.  I reworked it and it is $1500 for
> > 256 (and would be the same for 512 and 1024).

Your calcuations are fine.  I shouldn't be allowed to add and
multiply numbers.  When I redid the numbers I redid them 
incorrectly.  From list prices, cables are about $100 each,
and you need two per card.  So add about $200 to your prices.

> 
> So what is wron with my calculations :
> 
> 256 node L9/2MB/133MHz config :
> 
> M3F-PCI64B-2 NICs           256 *  $1,195 = $305,920
> M3-E128 Switch enclosures     6 * $12,800 =  $76,800
> M3-SW16-8F "Leaf" cards      32 *  $2,400 =  $76,800
> M3-SPINE-8F "Spine" cards    64 *  $1,600 = $102,400
> -----------------------------------------------------
> Total cost                                = $561,920
> Node cost                                 =   $2,195
> 
> and for a L9/2MB/200MHz config :
> 
> M3F-PCI64B-2 NICs           256 *  $1,495 = $382,720
> M3-E128 Switch enclosures     6 * $12,800 =  $76,800
> M3-SW16-8F "Leaf" cards      32 *  $2,400 =  $76,800
> M3-SPINE-8F "Spine" cards    64 *  $1,600 = $102,400
> -----------------------------------------------------
> Total cost                                = $638,720
> Node cost                                 =   $2,495
> 
> And this is without cable cost (since I don't quite know the cable
> requirements for the total system, but atleast it is approx $100 per
> node).
> 
> >
> > I thought I had decent information.  What I was trying to provide was
> > info on the three options.  The SCI price is per node (cards, cables,
> > software).  It is $2245 list.  This would be appropriate for systems over
> > 144 nodes.
> >
> 
> Actually, Wulfkit3 comes in two flavors wether you want 1U or can manage
> with a 2U (or higher) solutuion. The 1U is $2,445 and the 2U solution is
> $2,245 per node.
> 
> > The Quadrics number I got from the rep might have been the sales number,
> > and not a perfect comparsion to all the required hardware for a system of
> > a few hundred nodes.
> >
> 
> Now we have price comparisons for the interconnects (SCI,Myrinet and
> Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
> ~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
> node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII ServerWorks
> HE-SL based cluster).

I don't think that anyone is going to have numbers on the same hardware.
Too bad.  It would be interesting to see the differences.  However,
that may end all the discussions and that would be no fun.

Craig


From patrick at myri.com  Fri Apr 12 12:48:00 2002
From: patrick at myri.com (Patrick Geoffray)
Date: Fri, 12 Apr 2002 15:48:00 -0400
Subject: very high bandwidth, low latency manner?
References: <Pine.LNX.4.30.0204121919130.16089-100000@elin.scali.no>
Message-ID: <3CB739F0.6000109@myri.com>

Steffen Persvold wrote:

>>I figure out list cost on a 256 node system at about $2000 before for
>>basic hardware.  I as wrong.  I reworked it and it is $1500 for
>>256 (and would be the same for 512 and 1024).
>>

> So what is wron with my calculations :
 >
 > 256 node L9/2MB/133MHz config :
 > Node cost                                 =   $2,195
 > and for a L9/2MB/200MHz config :
 > Node cost                                 =   $2,495

Nothing, it's right for 256 nodes. However:

128 nodes L9/133 MHz config:
Node cost                                   =   $1,595
128 nodes L9/200 MHz config:
Node cost                                   =   $1,895

For more than 128 ports, the number of switches increases to keep a 
guaranteed full-bissection, it adds about $500 per node. However, up to 
128 nodes, you need only one switch. and the numbers I gave are correct.

The switchless cost model makes sense for configs > than the biggest 
switch size for switched technologies, ie. 128 ports for Quadrics and 
Myrinet. Surprisingly, the largest SCI cluster is, AFAIK, 132 nodes ;-)

> Now we have price comparisons for the interconnects (SCI,Myrinet and
> Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
> ~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
> node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII ServerWorks
> HE-SL based cluster).

Ok, I will say again what I think about these comparaisons: it's already 
hard to compare dollars (what about discount, what about support, what 
about software, etc) despite that it the same dollars, it's wasting time 
to do that for micro-benchmarks. It's something you do when you want to 
publish something in a conference next to a beach.
When a customer asks me about performance, I don't give him my NAS or 
PMB numbers, he doesn't care. He wants access to a XXX nodes machine to 
play with and run his set of applications, or he gives a list of codes 
to the vendors for the bid and the vendors guarantee the results because 
it's used officially in the bid process. If someone buys a machine 
because the NAS look pretty and his CFD code sucks, this guy will take 
his stuffs and look for a new job.

Do you spend time to tune NAS ? I don't. People already told me that the 
NAS LU test sucks on MPICH-GM. Well, the LU algorithm in HPL is much 
better. How many application behaves like the NAS LU, how many like HPL 
? If a customer comes to me because his code behaves like NAS LU, I will 
  tell him what to tune in his code to be more efficient.

The pitfall with benchmarks is that you want to tune your MPI 
implementation to looks good on them. In real world, you cannot expect 
to run efficiently a code on a machine without tuning it, specially with 
MPI.

My 2 pennies

Patrick

----------------------------------------------------------
|   Patrick Geoffray, Ph.D.      patrick at myri.com
|   Myricom, Inc.                http://www.myri.com
|   Cell:  865-389-8852          685 Emory Valley Rd (B)
|   Phone: 865-425-0978          Oak Ridge, TN 37830
----------------------------------------------------------


From robert at bay13.de  Fri Apr 12 12:25:06 2002
From: robert at bay13.de (Robert Depenbrock)
Date: Fri, 12 Apr 2002 21:25:06 +0200
Subject: What could be the performance of my cluster
References: <20020411125920.D32605@hpti.com> <20020412103234.7580.qmail@web10508.mail.yahoo.com> <20020412111552.B1491@hpti.com> <20020412134335.B1810@wumpus.skymv.com>
Message-ID: <3CB73492.78900CF4@bay13.de>

Greg Lindahl wrote:
> 

Hi Greg,

> On Fri, Apr 12, 2002 at 11:15:52AM -0600, Craig Tierney wrote:
> 
> > Is the BLAST code something that spends lots
> > of time trying doing lots of little calculations,
> > or doing one big calculation?  How important is
> > the speed of access to the database?  What is
> > the memory footprint of the code when it runs
> > on the DS20E?
> 
> It depends.
> 
> What BLAST does is compare a set of sequences against a big database of
> sequences. The databases come in small, medium, and large (bigger than
> 2 GByte) sizes; the sequences can either be a single sequence (imagine
> a researcher looking up a single protein using a web interface) or a
> large set of them. If it's a large set, the problem is embarrassingly
> parallel.
> 
> The BLAST implementation used by most people isn't parallel. It can be
> fairly easily parallelized to divide the big database up into pieces.
> 
> People build fairly different clusters to run BLAST depending on their
> details. The guys at Celera Geonmics didn't want to use a parallel
> version, and their database is bigger than 2 GBytes, so they bought
> Alphas. Most people have small enough databases to fit into 2 GBytes,
> but search against 1 sequence at a time, so they can't afford to read
> the entire database over NFS every time, and keep it on a local disk.

Do you have some sample proteins and databases ?

I would like to test some machines i have availble to mess around a
little bit.
(HP PA-Risc Series, SUN Sparc Fire, Itanium, Power PC).

I would like to build a little benchmark around these datasets.

regards
 Robert Depenbrock

-- 
nic-hdl RD-RIPE
http://www.bay13.de/
e-mail: robert at bay13.de
Fingerprint: 1CEF 67DC 52D7 252A 3BCD  9BC4 2C0E AC87 6830 F5DD


From sp at scali.com  Fri Apr 12 13:41:23 2002
From: sp at scali.com (Steffen Persvold)
Date: Fri, 12 Apr 2002 22:41:23 +0200 (CEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <3CB739F0.6000109@myri.com>
Message-ID: <Pine.LNX.4.30.0204122220460.16089-100000@elin.scali.no>

On Fri, 12 Apr 2002, Patrick Geoffray wrote:

> Steffen Persvold wrote:
>
> >>I figure out list cost on a 256 node system at about $2000 before for
> >>basic hardware.  I as wrong.  I reworked it and it is $1500 for
> >>256 (and would be the same for 512 and 1024).
> >>
>
> > So what is wron with my calculations :
>  >
>  > 256 node L9/2MB/133MHz config :
>  > Node cost                                 =   $2,195
>  > and for a L9/2MB/200MHz config :
>  > Node cost                                 =   $2,495
>
> Nothing, it's right for 256 nodes. However:
>
> 128 nodes L9/133 MHz config:
> Node cost                                   =   $1,595
> 128 nodes L9/200 MHz config:
> Node cost                                   =   $1,895
>
> For more than 128 ports, the number of switches increases to keep a
> guaranteed full-bissection, it adds about $500 per node. However, up to
> 128 nodes, you need only one switch. and the numbers I gave are correct.
>

Yes, I was just questioning Craig's numbers. I was actually suprised that
the Myrinet node cost didn't increase more when going from 128 to 256
nodes since it basically involves a lot more hardware (i.e 4 additional
switch enclousures, and 64 additional "spine" cards).

> The switchless cost model makes sense for configs > than the biggest
> switch size for switched technologies, ie. 128 ports for Quadrics and
> Myrinet. Surprisingly, the largest SCI cluster is, AFAIK, 132 nodes ;-)
>

The largest SCI cluster (atleast switchless) is indeed 132 nodes.

> > Now we have price comparisons for the interconnects (SCI,Myrinet and
> > Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
> > ~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
> > node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII ServerWorks
> > HE-SL based cluster).
>
> Ok, I will say again what I think about these comparaisons: it's already
> hard to compare dollars (what about discount, what about support, what
> about software, etc) despite that it the same dollars, it's wasting time
> to do that for micro-benchmarks. It's something you do when you want to
> publish something in a conference next to a beach.
> When a customer asks me about performance, I don't give him my NAS or
> PMB numbers, he doesn't care. He wants access to a XXX nodes machine to
> play with and run his set of applications, or he gives a list of codes
> to the vendors for the bid and the vendors guarantee the results because
> it's used officially in the bid process. If someone buys a machine
> because the NAS look pretty and his CFD code sucks, this guy will take
> his stuffs and look for a new job.
>
> Do you spend time to tune NAS ? I don't. People already told me that the
> NAS LU test sucks on MPICH-GM. Well, the LU algorithm in HPL is much
> better. How many application behaves like the NAS LU, how many like HPL
> ? If a customer comes to me because his code behaves like NAS LU, I will
>   tell him what to tune in his code to be more efficient.
>
> The pitfall with benchmarks is that you want to tune your MPI
> implementation to looks good on them. In real world, you cannot expect
> to run efficiently a code on a machine without tuning it, specially with
> MPI.
>

I think that most people on this list agrees that it is really the
customers application that counts, not NAS nor PMB numbers (and no, I
don't spend much time tuning NAS it was a bad example). I also agree with
most of your other statements, however I still think that atleast a MPI
specific benchmark such as PMB (don't know if it's available for PVM...)
will give the customers an initial feeling on what interconnect they need
(if they know how their application is architected).


> My 2 pennies
>

Thanks,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From joachim at sonne.lfbs.rwth-aachen.de  Fri Apr 12 14:02:58 2002
From: joachim at sonne.lfbs.rwth-aachen.de (joachim)
Date: Fri, 12 Apr 2002 23:02:58 +0200 (MEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <3CB739F0.6000109@myri.com> from Patrick Geoffray at "Apr 12, 2002 03:48:00 pm"
Message-ID: <200204122102.XAA02537@wikkit.lfbs.rwth-aachen.de>

Fully d'accord, but: comparing applications like MM5, GROMACS, ...
based on the interconnect and MPI library (on otherwise identical
systems) *would* make sense. At least for interconnect and MPI 
designers, and also for marketing (after carefully chosing the 
right cases...). And maybe for some buying decisions for smaller
"home built" systems.

We can discuss this on CAC'02 on monday. ;)

 regards, Joachim


From lindahl at keyresearch.com  Fri Apr 12 15:23:46 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 12 Apr 2002 18:23:46 -0400
Subject: What could be the performance of my cluster
In-Reply-To: <3CB73492.78900CF4@bay13.de>; from robert@bay13.de on Fri, Apr 12, 2002 at 09:25:06PM +0200
References: <20020411125920.D32605@hpti.com> <20020412103234.7580.qmail@web10508.mail.yahoo.com> <20020412111552.B1491@hpti.com> <20020412134335.B1810@wumpus.skymv.com> <3CB73492.78900CF4@bay13.de>
Message-ID: <20020412182346.A1990@wumpus.skymv.com>

On Fri, Apr 12, 2002 at 09:25:06PM +0200, Robert Depenbrock wrote:

> Do you have some sample proteins and databases ?

Robert,

I almost got a benchmark suite for BLAST together, but got
side-tracked before I had anything useful. Just like 10,000 other
projects ;-)

greg


From mas at ucla.edu  Fri Apr 12 15:37:50 2002
From: mas at ucla.edu (Michael Stein)
Date: Fri, 12 Apr 2002 15:37:50 -0700
Subject: very high bandwidth, low latency manner? (i860)
In-Reply-To: <Pine.LNX.4.30.0204120530500.10585-100000@elin.scali.no>; from sp@scali.com on Fri, Apr 12, 2002 at 05:37:17AM +0200
References: <v05010178b8dacb026736@[192.168.1.4]> <Pine.LNX.4.30.0204120530500.10585-100000@elin.scali.no>
Message-ID: <20020412153750.A5315@mas1.ats.ucla.edu>

> bandwidth was highest on IA64 460GX...). I can't imagine that the i860 can
> actually perform as well as 340MByte/sec since the Hub-Link (between
> the MCH and the P64H) has a limit of 266MByte/sec (AFAIK) ....

The data sheet for the i860 shows 3 separate Hub-links A, B and C.

A is 266 MByte/sec (and typically runs the 33 Mhz 32 bit stuff).

B and C are 533 MByte/sec each and drive the P64Hs.
(16 bits * 66 Mhz * 4x data xfers).

http://developer.intel.com/design/chipsets/datashts/290713.htm

The pdf is about 1.1 MB.


From djholm at fnal.gov  Fri Apr 12 16:11:41 2002
From: djholm at fnal.gov (Don Holmgren)
Date: Fri, 12 Apr 2002 18:11:41 -0500
Subject: very high bandwidth, low latency manner? (i860)
In-Reply-To: <20020412153750.A5315@mas1.ats.ucla.edu>
Message-ID: <Pine.SGI.4.21.0204121806370.26265788-100000@hppc.fnal.gov>

Unfortunately the measured performance doesn't match the published
specs.  DMA rates reported by the Myrinet driver on 64/66 cards are
about 315 MB/sec and 225 MB/sec, respectively, for bus writes and reads.  
See the reported measurements on a number of i860-based motherboards at
Greg Lindahl's page,

    http://www.conservativecomputer.com/myrinet/perf.html

This has been a sore point for lots of folks wanting to build clusters
with i860-based machines.

Don Holmgren
Fermilab


On Fri, 12 Apr 2002, Michael Stein wrote:

> > bandwidth was highest on IA64 460GX...). I can't imagine that the i860 can
> > actually perform as well as 340MByte/sec since the Hub-Link (between
> > the MCH and the P64H) has a limit of 266MByte/sec (AFAIK) ....
> 
> The data sheet for the i860 shows 3 separate Hub-links A, B and C.
> 
> A is 266 MByte/sec (and typically runs the 33 Mhz 32 bit stuff).
> 
> B and C are 533 MByte/sec each and drive the P64Hs.
> (16 bits * 66 Mhz * 4x data xfers).
> 
> http://developer.intel.com/design/chipsets/datashts/290713.htm
> 
> The pdf is about 1.1 MB.
> 


From fraser5 at cox.net  Fri Apr 12 16:51:25 2002
From: fraser5 at cox.net (Jim Fraser)
Date: Fri, 12 Apr 2002 19:51:25 -0400
Subject: BLAS-1, AMD, Pentium, gcc
In-Reply-To: <Pine.SGI.4.21.0204121210450.26154338-100000@hppc.fnal.gov>
Message-ID: <001001c1e27c$f59f1470$0400005a@papabear>

Sure the optimized BLAS by Intel IS faster (on Intel) the data you present
while very impressive but are skewed towards Intel because the libs are
optimized for ONLY for SSE and intel chips while AMD does not really fully
SSE.
BUT should replace your stale BLAS code with optimized ATLAS on for your AMD
chips....its a whole new world my friend!  AMD really kicks some butt when
the libs are optimized for cache size.  It blew me away. The libs optimize
for a specific chip cache and detect for SSE or 3Dnow! and really exploit it
and the performance is very impressive. (as well as the makefile that runs
for quite some time to produce the libs.)   Download the latest developers
version compile and sit back and smile. WELL WORTH THE EFFORT, no question.

I got into this to port a cfd code over from intel/mkl/scalapack/mpi to
amd/atlas/scalapack/mpi.  The bang for the buck with AMD is no comparison
after you run with this package.  BTW, the Atlas libs also run on intel (
runs ANY chip for that matter) and improved performance over the intel MKL
package as well (for some chips = on others).  I don't have the all numbers
off hand but I would suggest you re-run your case with ATLAS, your
conclusion may change.

try it. Its free.

(PS get the developers source and compile instead of downloading the binary,
the term)
http://www.netlib.org/atlas/


Jim


-----Original Message-----
From: beowulf-admin at beowulf.org [mailto:beowulf-admin at beowulf.org]On
Behalf Of Don Holmgren
Sent: Friday, April 12, 2002 1:36 PM
To: Hung Jung Lu
Cc: beowulf at beowulf.org
Subject: Re: BLAS-1, AMD, Pentium, gcc


On Fri, 12 Apr 2002, Hung Jung Lu wrote:

> Hi,
>
> I am thinking in migrating some calculation programs
> from Windows to Linux, maybe eventually using a
> Beowulf cluster. However, I am kind of worried after I
> read in the mailing list archive about lack of
> CPU-optimized BLAS-1 code in Linux systems. Currently
> I run on a Wintel (Windows+Pentium) machine, and I
> know it's substantially faster than equivalent AMD
> machine, because I use the Intel's BLAS (MKL) library.
> (I apologize for any misapprehensions in what
> follows... I am only starting to explore in this
> arena.)
>
> (1) Does anyone know when gcc will have memory
> prefetching features? Any time frame? I can notice
> very significant performance improvement on my Wintel
> machine, and I think it's due to memory prefetching.


If you mean, "when will gcc's optimizer do automatic prefetching?", I
have no idea.  But, many programmers have been doing manual prefetching
with gcc for quite a while. If you don't mind defining and using
assembler macros, gcc handles it just fine now.  Here's an example:

#define prefetch_loc(addr) \
__asm__ __volatile__ ("prefetchnta %0" \
                      : \
                      : \
                      "m" (*(((char*)(((unsigned int)(addr))&~0x7f)))))


> (2) I am a bit confused on the following issue: Intel
> does release MKL for Linux. So, does this mean that if
> I use Pentium, I still get full benefit of the
> CPU-optimized features in BLAS-1, despite of gcc does
> not do memory prefetching? How is this possible?


The Intel compiler produces object files compatible with gcc, and vice
versa.  I would assume they implemented the library with the Intel
compiler, which has full SSE/SSE2 support (including prefetching).  They
list the MKL for Linux as compatible with both gnu and Intel compilers.


> (3) Related to the above: for general linear algebra
> operations, is Pentium processor then better than AMD,
> since Intel has the machine-optimized BLAS library? I
> get contradictory information sometimes... I've seen
> somewhere that Pentium-4 compares unfavorably with AMD
> chips in calculation speed... Any opinions?
>
> thanks,
>
> Hung Jung Lu


For the very simple SU3 linear algebra (3X3 complex matrices and 3X1
complex vectors) used in our codes, the Pentium 4 outperforms the Athlon
on most of our SSE-assisted routines.  See the table near the bottom of
   http://qcdhome.fnal.gov/sse/inline.html
for Mflops per gigahertz on various routines for P-III, P4, and Athlon.
Perhaps re-coding in 3DNow! would give the Athlon a boost.

For our codes, which are bound by memory bandwidth, P4's do
significantly better than Athlons because of the faster front side bus
(400 Mhz effective).  See
   http://qcdhome.fnal.gov/qcdstream/compare.qcdstream
for a table comparing memory bandwidth and SU3 linear algebra
performance on a 1.2 GHz Athlon, 1.4 GHz P4, and 1.7 GHz P7 (see
   http://qcdhome.fnal.gov/qcdstream/
for information about this benchmark).

Don Holmgren
Fermilab

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From suraj_peri at yahoo.com  Sat Apr 13 03:21:52 2002
From: suraj_peri at yahoo.com (Suraj Peri)
Date: Sat, 13 Apr 2002 03:21:52 -0700 (PDT)
Subject: What could be the performance of my cluster
In-Reply-To: <3CB73492.78900CF4@bay13.de>
Message-ID: <20020413102152.24030.qmail@web10506.mail.yahoo.com>

 BLAST ( Basic Local Alignment Search tool) takes the
query ( either protein or DNA) sequence and try to
match the small pathces ( lets say it breaks your
sequence in to small pieces of 6 letters and then try
to match them in a the database index file) .
Once BLAST algo. finds any small match it tries to
extend your query sequence for further match in the
database. If it finds more then it makes a score and
represent that score. If it doesnt then it represents
low score and based on low scores we do not consider
lower score hits. 
Thus, in my opinion it does many claculations and
finally show the scores. ( P-value)

Interestingly , BLAST is considered a local alignment
search tool because it tries to match bits of your
query sequence and then extends for more matches. 
in contrast there is another algorithm called FASTA (
Fast alignment search tool ) this is a global ( means
it takes big chunks of sequences and then tries to
thread them over database). 
So Bill Pearson (creator) made a PVM version of FASTA
and his students at virginia are using it on a beowulf
cluster.
( You can access that at
ftp://ftp.virginia.edu/pub/fasta/) 

In my case my database would be ~80 GB. ( i hope to
use this much data over NFS)

I am planning to introduce this algorithm in every
node and then using MPICH I would like to ask my node
to access the whole database using NFS. 
I am new to this area, but I wonder the ideas I am
having are practical or not. We will start configuring
our cluster some time in May. 

cheers
suraj.


--- Robert Depenbrock <robert at bay13.de> wrote:
> Greg Lindahl wrote:
> > 
> 
> Hi Greg,
> 
> > On Fri, Apr 12, 2002 at 11:15:52AM -0600, Craig
> Tierney wrote:
> > 
> > > Is the BLAST code something that spends lots
> > > of time trying doing lots of little
> calculations,
> > > or doing one big calculation?  How important is
> > > the speed of access to the database?  What is
> > > the memory footprint of the code when it runs
> > > on the DS20E?
> > 
> > It depends.
> > 
> > What BLAST does is compare a set of sequences
> against a big database of
> > sequences. The databases come in small, medium,
> and large (bigger than
> > 2 GByte) sizes; the sequences can either be a
> single sequence (imagine
> > a researcher looking up a single protein using a
> web interface) or a
> > large set of them. If it's a large set, the
> problem is embarrassingly
> > parallel.
> > 
> > The BLAST implementation used by most people isn't
> parallel. It can be
> > fairly easily parallelized to divide the big
> database up into pieces.
> > 
> > People build fairly different clusters to run
> BLAST depending on their
> > details. The guys at Celera Geonmics didn't want
> to use a parallel
> > version, and their database is bigger than 2
> GBytes, so they bought
> > Alphas. Most people have small enough databases to
> fit into 2 GBytes,
> > but search against 1 sequence at a time, so they
> can't afford to read
> > the entire database over NFS every time, and keep
> it on a local disk.
> 
> Do you have some sample proteins and databases ?
> 
> I would like to test some machines i have availble
> to mess around a
> little bit.
> (HP PA-Risc Series, SUN Sparc Fire, Itanium, Power
> PC).
> 
> I would like to build a little benchmark around
> these datasets.
> 
> regards
>  Robert Depenbrock
> 
> -- 
> nic-hdl RD-RIPE
> http://www.bay13.de/
> e-mail: robert at bay13.de
> Fingerprint: 1CEF 67DC 52D7 252A 3BCD  9BC4 2C0E
> AC87 6830 F5DD
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


=====
PIL/BMB/SDU/DK

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From hahn at physics.mcmaster.ca  Sat Apr 13 11:18:39 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Sat, 13 Apr 2002 14:18:39 -0400 (EDT)
Subject: decent performance from G4 Macs?
Message-ID: <Pine.LNX.4.33.0204131358330.25397-100000@coffee.psychology.mcmaster.ca>

I'm doing some benchmarks to evaluate whether current Macs
would make suitable nodes for a serial farm (lots of nodes,
preferably fast CPU and dram, but no serious interconnect.)
I've tried a variety of real codes and benchmarks, but can't
seem to get something like a Mac G4/800 with PC133 to perform
anywhere close to even a P4/1.7/i845/PC133.

I'm using either the gcc 2.95 that comes with OSX or a 
recent 3.1 snapshot (which is MUCH better, but still bad).

is it just that the performance Apple brags about is strictly
in-cache, and/or when doing something ah specialized like 
single-precision SIMD (altivec/velocity engine)?

I haven't really pushed to track down an account on the very
latest dual G4/1000, but afaikt it's got the same boring PC133.

is anyone using Macs in clusters, and what kind of performance
are you observing?

thanks, mark hahn.


From wrp at alpha0.bioch.virginia.edu  Sat Apr 13 13:11:25 2002
From: wrp at alpha0.bioch.virginia.edu (William R. Pearson)
Date: Sat, 13 Apr 2002 16:11:25 -0400 (EDT)
Subject: BLAST and FASTA benchmarks
Message-ID: <200204132011.QAA22280@alpha0.bioch.virginia.edu>

There was a bit of misinformation about the difference between the
BLAST and FASTA programs for protein and DNA sequence comparison
program.

Both BLAST and FASTA search for local sequence similarity - indeed
they have exactly the same goals, though they use somewhat different
algorithms and statistical approaches.

The advantage of an ES40 or other large shared memory machine for
BLAST is that it has been optimized for searching databases that are
large memory mapped files, and it runs multithreaded.  PVM and MPI
versions of BLAST are not available, but, it is important to remember
that BLAST is extremely fast, and highly optimized to go through a
large amount of memory very quickly; it would be difficult to provide
an equally efficient distributed version - but, of course, a
distributed memory machine would be much cheaper.

PVM and MPI versions of FASTA are available.  FASTA actually is a
package of about a dozen programs that vary more than 100-fold in
speed.  It is easy to make efficient PVM/MPI versions of the slower
algorithms (Smith-Waterman, TFASTY, TFASTX); parallel versions of the
FASTA algorithm are less efficient.

How to benchmark BLAST and FASTA -

As Greg Lindahl pointed out, the appropriate platform for BLAST (less
so for FASTA) depends on the size of the database.  Very few databases
are larger than 2 Gb (I think the person who said he had an 80 Gb
database was mistaken - the largest publically available sequence
database, Genbank, currently has 17Gb of sequence data).  In contrast,
protein sequence databases are much smaller, typically 50 - 500 Mb).

If you would like to try searching some protein or DNA sequence
databases, they are available from ftp.ncbi.nih.gov/blast/db.  nr.Z
and swissprot.Z are two representative protein sequence databases,
nt.Z and est_mouse.Z are representative DNA databases.  Simply select
10 - 100 sequences at random from these databases and run them against
the full size databases.

Bill Pearson


From ron_chen_123 at yahoo.com  Sat Apr 13 16:39:29 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Sat, 13 Apr 2002 16:39:29 -0700 (PDT)
Subject: decent performance from G4 Macs?
In-Reply-To: <Pine.LNX.4.33.0204131358330.25397-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20020413233929.65399.qmail@web14706.mail.yahoo.com>

--- Mark Hahn <hahn at physics.mcmaster.ca> wrote:
> I'm doing some benchmarks to evaluate whether
> current Macs would make suitable nodes for a serial
> farm (lots of nodes, preferably fast CPU and dram,
> but no serious interconnect.)

Physics or bioscience code?

> I've tried a variety of real codes and benchmarks,
> but can't seem to get something like a Mac G4/800 
> with PC133 to perform anywhere close to even a
> P4/1.7/i845/PC133.
> 
> I'm using either the gcc 2.95 that comes with OSX or
> a recent 3.1 snapshot (which is MUCH better, but
> still bad).

What compiler are you using for the P4? 
 
> is it just that the performance Apple brags about is
> strictly in-cache, and/or when doing something ah
> specialized like single-precision SIMD
>(altivec/velocity engine)?

Apple has some libraries that take advantage of the
Altivec instructions.

http://www.apple.com/downloads/macosx/math_science/applegenentechblast.html

> is anyone using Macs in clusters, and what kind of
> performance
> are you observing?

AFAIK, there are several people using MacOS X in
clusters, the SGE (Sun Grid Engine) project has a port
for Mac OS X.

May be you should ask for the experience in setting up
Mac OS X compute farms. SGE is specifically written
for that environment.

SGE home:
http://wwws.sun.com/software/gridware/

SGE Open source site:
http://gridengine.sunsource.net

Search for "Mac OS" in the mailing list Archive.
http://gridengine.sunsource.net/servlets/SearchList?listName=dev&by=thread

-Ron

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From steveb at aei-potsdam.mpg.de  Sat Apr 13 17:37:42 2002
From: steveb at aei-potsdam.mpg.de (Steven Berukoff)
Date: Sun, 14 Apr 2002 02:37:42 +0200 (MET DST)
Subject: DMA difficulties
Message-ID: <Pine.OSF.4.21.0204140228020.24484-100000@holodec15.aei-potsdam.mpg.de>

Hi all,

This question may be very slightly off-topic, so I apologize.

I'm in the process of setting up a network installation procedure using
PXE/DHCP/NFS/Kickstart w/ RH7.2 for about 150 dual Athlon nodes.  These
nodes use a Maxtor 6L080J4 80.0GB HDD and an ASUS A7M266-D motherboard,
among other things.  One particular note is that I don't need/want CDROMs
in these systems.

Now, a vendor provided me with a couple of test nodes basically to our
specifications, except that they included CDROMs and floppies.  To make a
longish story shorter, I wanted to make sure that the nodes work fine
without the CDROM.

So, I first looked into the BIOS.  I disabled (set to "None") Primary
Slave, Secondary Master/Slave (since my HDD is Primary Master), removed
the CDROM from the list of boot devices, and disabled the Secondary IDE
channel.  Then, I passed the kernel args "ide0=dma hdb=none" to try to
enforce the HDD to use DMA during the Kickstart installation.

Now, here is the kicker: regardless of the BIOS settings, if I have the
CDROM plugged in (power+IDE, on the secondary channel) the installation
takes ~ 5 times faster than if the thing isn't there.  This installation
includes installation of ~470 packages plus formatting the HDD.  That's
right, as long as the CDROM is plugged in, everything is peachy, but once
gone, things slow down.

I think this is a problem with the DMA settings, b/c when I pass
"ide=nodma" to the kernel, WITH the CD attached, performance is
slow.  However, I can't even force DMA to be used.

If anyone has any suggestions or similar experiences, please let me know.

Thanks a bunch!
Steve


=====
Steve Berukoff					tel: 49-331-5677233
Albert-Einstein-Institute			fax: 49-331-5677298
Am Muehlenberg 1, D14477 Golm, Germany		email:steveb at aei.mpg.de


From alvin at Maggie.Linux-Consulting.com  Sat Apr 13 18:09:18 2002
From: alvin at Maggie.Linux-Consulting.com (alvin at Maggie.Linux-Consulting.com)
Date: Sat, 13 Apr 2002 18:09:18 -0700 (PDT)
Subject: DMA difficulties
In-Reply-To: <Pine.OSF.4.21.0204140228020.24484-100000@holodec15.aei-potsdam.mpg.de>
Message-ID: <Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com>

hi ya

i notice that when the cable is attached... things goes
bonkers... even if no power ot the drive ( hd or cdrom )

remove the ide cable from the motherboard if its not used

and tell the bios NOT to autodetect ide devices
except those that is in fact present

150 nodes.... hummm .... one full cabinet..front and back.. :-)

c ya
alvin
http://www.Linux-1U.net


On Sun, 14 Apr 2002, Steven Berukoff wrote:

> 
> Hi all,
> 
> This question may be very slightly off-topic, so I apologize.
> 
> I'm in the process of setting up a network installation procedure using
> PXE/DHCP/NFS/Kickstart w/ RH7.2 for about 150 dual Athlon nodes.  These
> nodes use a Maxtor 6L080J4 80.0GB HDD and an ASUS A7M266-D motherboard,
> among other things.  One particular note is that I don't need/want CDROMs
> in these systems.
> 
> Now, a vendor provided me with a couple of test nodes basically to our
> specifications, except that they included CDROMs and floppies.  To make a
> longish story shorter, I wanted to make sure that the nodes work fine
> without the CDROM.
> 
> So, I first looked into the BIOS.  I disabled (set to "None") Primary
> Slave, Secondary Master/Slave (since my HDD is Primary Master), removed
> the CDROM from the list of boot devices, and disabled the Secondary IDE
> channel.  Then, I passed the kernel args "ide0=dma hdb=none" to try to
> enforce the HDD to use DMA during the Kickstart installation.
> 
> Now, here is the kicker: regardless of the BIOS settings, if I have the
> CDROM plugged in (power+IDE, on the secondary channel) the installation
> takes ~ 5 times faster than if the thing isn't there.  This installation
> includes installation of ~470 packages plus formatting the HDD.  That's
> right, as long as the CDROM is plugged in, everything is peachy, but once
> gone, things slow down.
> 
> I think this is a problem with the DMA settings, b/c when I pass
> "ide=nodma" to the kernel, WITH the CD attached, performance is
> slow.  However, I can't even force DMA to be used.
> 
> If anyone has any suggestions or similar experiences, please let me know.
> 
> Thanks a bunch!
> Steve
> 
> 
> =====
> Steve Berukoff					tel: 49-331-5677233
> Albert-Einstein-Institute			fax: 49-331-5677298
> Am Muehlenberg 1, D14477 Golm, Germany		email:steveb at aei.mpg.de
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


From robl at mcs.anl.gov  Sat Apr 13 18:29:56 2002
From: robl at mcs.anl.gov (Robert Latham)
Date: Sat, 13 Apr 2002 20:29:56 -0500
Subject: decent performance from G4 Macs?
In-Reply-To: <Pine.LNX.4.33.0204131358330.25397-100000@coffee.psychology.mcmaster.ca>
References: <Pine.LNX.4.33.0204131358330.25397-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20020414012956.GA20390@mcs.anl.gov>

On Sat, Apr 13, 2002 at 02:18:39PM -0400, Mark Hahn wrote:
> is it just that the performance Apple brags about is strictly
> in-cache, and/or when doing something ah specialized like 
> single-precision SIMD (altivec/velocity engine)?

it's the altivec unit that makes G4s at all interesting.  if you
aren't using the vector unit, yeah, you won't even come close to x86.   

gcc is multi-platform, sure, but it's optimizer for x86 has received a
lot of attention, while the powerpc optimizer has not. your
observation that gcc 3.1 performance is better shows that focus on
powerpc optimizations has grown, but yeah, it's going to get less
attention than x86.  too bad, really.  register pressure on a powerpc
is much less than on x86 ( register pressure on just about any arch
not stack-based is less than that on x86 :> )

you are running on mac os x, yes?  is there any chance you could put
linux on it?  if your application is making a significant number of
system calls ( file i/o, network traffic... you know, system calls )
os x will hurt you. I'd be curious to hear if your application
performs better under linux on powerpc (debian, suse, mandrake,
yellowdog; there are many options) than it does under os x on the same
hardware.  ( if you use linux, you'll have to hand-code some assembly
to use the G4. samples abound on the web.  but if you are
compute-intensive anyway, you might not see gains running under linux)

microbenchmarks don't always correlate well with application
performance, but here are lmbench numbers. the hardware is constant
while i varied the operating system: 
http://clustermonkey.org/~laz/pbook/lmbench.powerbook.txt (the numbers
are nearly 8 months old, but the newer versions of os X do not show any
remarkable improvement and in fact regress on some scores)

rgb, do you know what the cputest curves look like for a G4 mac?  

also bear in mind that G4s run significantly cooler than their x86
counterparts, so you might still come out ahead on price/performance,
where price takes into account initial purchase + cost of running the
cluster.  

so there you go. there are lots of reasons why you'll have to actually
spend a bit of effort to move to a new architecture.  i hope no one on
this list finds that idea surprising.  

==rob

-- 
Rob Latham
                                             A215 0178 EA2D B059 8CDF  
                                             B29D F333 664A 4280 315B


From echiu at imservice.com  Sat Apr 13 18:57:36 2002
From: echiu at imservice.com (Eric Chiu)
Date: Sat, 13 Apr 2002 18:57:36 -0700
Subject: CD from "Building Linux Cluster"
References: <Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com>
Message-ID: <00a101c1e357$ece44f90$e3c0fea9@squaw>

Has anyone set up a cluster using the CD from Spector's book 
"Building Linux Clusters" (O'Reilly)? 

Eric Chiu, author/consultant
Imservice, Inc.
www.imservice.com


From jsmith at structbio.vanderbilt.edu  Sat Apr 13 19:45:41 2002
From: jsmith at structbio.vanderbilt.edu (Jarrod Smith)
Date: Sat, 13 Apr 2002 21:45:41 -0500 (CDT)
Subject: decent performance from G4 Macs?
In-Reply-To: <Pine.LNX.4.33.0204131358330.25397-100000@coffee.psychology.mcmaster.ca>
Message-ID: <Pine.LNX.4.33.0204132142220.8334-100000@scylla.structbio.vanderbilt.edu>

On Sat, 13 Apr 2002, Mark Hahn wrote:
> is it just that the performance Apple brags about is strictly
> in-cache, and/or when doing something ah specialized like 
> single-precision SIMD (altivec/velocity engine)?

I've been making a foray into OS X on G4 hardware recently.  After having
compiled and benchmarked a couple of our compute-intensive codes, I have
wondered the same thing...

So far double-precision floating point has not impressed me in the least
on the G4.

Jarrod Smith


From robl at mcs.anl.gov  Sat Apr 13 19:58:49 2002
From: robl at mcs.anl.gov (Robert Latham)
Date: Sat, 13 Apr 2002 21:58:49 -0500
Subject: CD from "Building Linux Cluster"
In-Reply-To: <00a101c1e357$ece44f90$e3c0fea9@squaw>
References: <Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com> <00a101c1e357$ece44f90$e3c0fea9@squaw>
Message-ID: <20020414025849.GB20390@mcs.anl.gov>

On Sat, Apr 13, 2002 at 06:57:36PM -0700, Eric Chiu wrote:
> Has anyone set up a cluster using the CD from Spector's book 
> "Building Linux Clusters" (O'Reilly)? 

http://www.oreilly.com/catalog/clusterlinux/

i'm guessing the answer is 'no'

==rob

-- 
Rob Latham
                                             A215 0178 EA2D B059 8CDF  
                                             B29D F333 664A 4280 315B


From walke at usna.edu  Sat Apr 13 19:58:54 2002
From: walke at usna.edu (Vann H. Walke)
Date: 13 Apr 2002 22:58:54 -0400
Subject: CD from "Building Linux Cluster"
In-Reply-To: <00a101c1e357$ece44f90$e3c0fea9@squaw>
References: 	<Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com> 
	<00a101c1e357$ece44f90$e3c0fea9@squaw>
Message-ID: <1018753136.25541.3.camel@walkeonline.com>

I don't have the book, but suspect that the included software would be
well out of date.  If you're just getting into clustering, I would
suggest trying the Scyld distribution.  You can get it for $3 at
linuxcentral.com.  

Good Luck,
Vann

On Sat, 2002-04-13 at 21:57, Eric Chiu wrote:
> Has anyone set up a cluster using the CD from Spector's book 
> "Building Linux Clusters" (O'Reilly)? 
> 
> Eric Chiu, author/consultant
> Imservice, Inc.
> www.imservice.com
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From spoel at xray.bmc.uu.se  Sun Apr 14 00:18:02 2002
From: spoel at xray.bmc.uu.se (David van der Spoel)
Date: Sun, 14 Apr 2002 09:18:02 +0200 (CEST)
Subject: decent performance from G4 Macs?
In-Reply-To: <Pine.LNX.4.33.0204132142220.8334-100000@scylla.structbio.vanderbilt.edu>
Message-ID: <Pine.LNX.4.10.10204140915170.9488-100000@zorn.bmc.uu.se>

On Sat, 13 Apr 2002, Jarrod Smith wrote:

>> is it just that the performance Apple brags about is strictly
>> in-cache, and/or when doing something ah specialized like 
>> single-precision SIMD (altivec/velocity engine)?
>
>I've been making a foray into OS X on G4 hardware recently.  After having
>compiled and benchmarked a couple of our compute-intensive codes, I have
>wondered the same thing...
>
>So far double-precision floating point has not impressed me in the least
>on the G4.

We have done some single precision (gcc with altivec code) tests using
our molecular dynamics code GROMACS. The results are on

http://www.gromacs.org/benchmarks/single.php

the numbers are simulation time/real time, i.e. higher is better. The G4
is slightly slower than an Athlon (w 3DNow)/P3 (w SSE) at the same clock.
Havent't tested double precision yet.


Groeten, David.
________________________________________________________________________
Dr. David van der Spoel, 	Biomedical center, Dept. of Biochemistry
Husargatan 3, Box 576,  	75123 Uppsala, Sweden
phone:	46 18 471 4205		fax: 46 18 511 755
spoel at xray.bmc.uu.se	spoel at gromacs.org   http://zorn.bmc.uu.se/~spoel
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From echiu at imservice.com  Sat Apr 13 23:22:01 2002
From: echiu at imservice.com (Eric Chiu)
Date: Sat, 13 Apr 2002 23:22:01 -0700
Subject: BladeFrame vs Beowulf
References: <Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com> <00a101c1e357$ece44f90$e3c0fea9@squaw>
Message-ID: <012201c1e37c$f71fa2a0$e3c0fea9@squaw>

Has anyone worked on one of these BladeFrame?
http://www.egenera.com/prod_spec_overview.php

I'm wondering how this compares to a custom-built Beowulf. 
I like how they have consolidated the networking and 
hardware in this proprietary architecture. One of the biggest 
problems in a Beowulf is keeping track of the boxes and ethernet 
connections.

Eric Chiu, author/consultant
Imservice, Inc.
www.imservice.com


From emiller at techskills.com  Sun Apr 14 05:28:21 2002
From: emiller at techskills.com (Eric Miller)
Date: Sun, 14 Apr 2002 08:28:21 -0400
Subject: CD from "Building Linux Cluster"
In-Reply-To: <00a101c1e357$ece44f90$e3c0fea9@squaw>
Message-ID: <NMELJLHHFNGMNFFNGJAEOEIICEAA.emiller@techskills.com>

>Has anyone set up a cluster using the CD from Spector's book
>"Building Linux Clusters" (O'Reilly)?

Eric,

I tried to order that book about a year ago, it was taken out of print (at
least the edition then was).  An email response from the publisher stated
that the book was such low quality that they had to take it off the shelves,
too many returns/reader complaints.

You may have a newer edition.


From opengeometry at yahoo.ca  Sun Apr 14 08:59:12 2002
From: opengeometry at yahoo.ca (William Park)
Date: Sun, 14 Apr 2002 11:59:12 -0400
Subject: CD from "Building Linux Cluster"
In-Reply-To: <NMELJLHHFNGMNFFNGJAEOEIICEAA.emiller@techskills.com>; from emiller@techskills.com on Sun, Apr 14, 2002 at 08:28:21AM -0400
References: <00a101c1e357$ece44f90$e3c0fea9@squaw> <NMELJLHHFNGMNFFNGJAEOEIICEAA.emiller@techskills.com>
Message-ID: <20020414115912.A13058@node0.opengeometry.ca>

On Sun, Apr 14, 2002 at 08:28:21AM -0400, Eric Miller wrote:
> >Has anyone set up a cluster using the CD from Spector's book
> >"Building Linux Clusters" (O'Reilly)?
> 
> Eric,
> 
> I tried to order that book about a year ago, it was taken out of print (at
> least the edition then was).  An email response from the publisher stated
> that the book was such low quality that they had to take it off the shelves,
> too many returns/reader complaints.
> 
> You may have a newer edition.

I have it, but it's so out-of-date now.  Try Mosix or Beowulf.

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin


From hahn at physics.mcmaster.ca  Sun Apr 14 09:24:54 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Sun, 14 Apr 2002 12:24:54 -0400 (EDT)
Subject: decent performance from G4 Macs?
In-Reply-To: <20020413233929.65399.qmail@web14706.mail.yahoo.com>
Message-ID: <Pine.LNX.4.33.0204141214020.27058-100000@coffee.psychology.mcmaster.ca>

> --- Mark Hahn <hahn at physics.mcmaster.ca> wrote:
> > I'm doing some benchmarks to evaluate whether
> > current Macs would make suitable nodes for a serial
> > farm (lots of nodes, preferably fast CPU and dram,
> > but no serious interconnect.)
> 
> Physics or bioscience code?

why does it matter?  we're not trying specifically to run BLAST, 
if that's what you're asking.  I don't see any reason why the 
department would matter, but it's a mixture of math, chem,
physics, astro, biologists, and perhaps a few psychologists.

> > I've tried a variety of real codes and benchmarks,
> > but can't seem to get something like a Mac G4/800 
> > with PC133 to perform anywhere close to even a
> > P4/1.7/i845/PC133.
> > 
> > I'm using either the gcc 2.95 that comes with OSX or
> > a recent 3.1 snapshot (which is MUCH better, but
> > still bad).
> 
> What compiler are you using for the P4? 

I'm pretty happy with recent snapshots of gcc 3.1 (pre-release).
(still mystified why gnu fortran people are stuck at F77, but...)

> > is it just that the performance Apple brags about is
> > strictly in-cache, and/or when doing something ah
> > specialized like single-precision SIMD
> >(altivec/velocity engine)?
> 
> Apple has some libraries that take advantage of the
> Altivec instructions.

linpack/lapack/atlas/fftw?

> AFAIK, there are several people using MacOS X in
> clusters, the SGE (Sun Grid Engine) project has a port
> for Mac OS X.

which doesn't give me ANY data on performance.


From opengeometry at yahoo.ca  Sun Apr 14 09:23:27 2002
From: opengeometry at yahoo.ca (William Park)
Date: Sun, 14 Apr 2002 12:23:27 -0400
Subject: [MAILER-DAEMON@x263.net: Undelivered Mail Returned to Sender]
Message-ID: <20020414122327.A13288@node0.opengeometry.ca>

To list maintainer:

Please unsubscribe <qgzeng at x263.net>.  Everytime I post to the list, I get
rejected notice from <smtp.x263.net>.  It should go to you!

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
8 CPU cluster, NAS, (Slackware) Linux, Python, LaTeX, Vim, Mutt, Tin
-------------- next part --------------
An embedded message was scrubbed...
From: MAILER-DAEMON at x263.net (Mail Delivery System)
Subject: Undelivered Mail Returned to Sender
Date: Mon, 15 Apr 2002 00:16:41 +0800 (CST)
Size: 5390
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020414/7bdf47bb/attachment.mht>

From heckendo at cs.uidaho.edu  Sun Apr 14 10:05:39 2002
From: heckendo at cs.uidaho.edu (Robert B Heckendorn)
Date: Sun, 14 Apr 2002 10:05:39 -0700 (PDT)
Subject: MPI/PVM for BLAST and FASTA
In-Reply-To: <200204141601.g3EG14G09306@blueraja.scyld.com>
Message-ID: <200204141705.KAA16409@brownlee.cs.uidaho.edu>

Bill Pearson's paragraph introduces so many great questions that
maybe Bill or others can answer.

> The advantage of an ES40 or other large shared memory machine for
> BLAST is that it has been optimized for searching databases that are
> large memory mapped files, and it runs multithreaded.  PVM and MPI
> versions of BLAST are not available but, it is important to remember
> that BLAST is extremely fast, and highly optimized to go through a
> large amount of memory very quickly; it would be difficult to provide
> an equally efficient distributed version - but, of course, a
> distributed memory machine would be much cheaper.

I think I could learn a lot by listening to the details of why this
is not done.  So here goes:

Why is it that BLAST is not available for MPI/PVM?  I would think
clusters would be the prefect host for such an application.
Is it there is no need because BLAST is already so fast and
no one wants to break the database out onto node-resident disks?
Or is it that BLAST is kept running on single processor or shared memory 
machines BLAST so that the DB is always in memory ready to roll without
loading and doing the same for a cluster is not worth it
because the same trick is difficult to do on a node given the current
way clusters are built?  I assume the same is true for FASTA?

thanks for the clarification,

-- 
| Robert Heckendorn                        | We may not be the only
| heckendo at cs.uidaho.edu                   | species on the planet but
| http://www.cs.uidaho.edu/~heckendo       | we sure do act like it.
| CS Dept, University of Idaho             |
| Moscow, Idaho, USA   83844-1010          |


From steveb at aei-potsdam.mpg.de  Sun Apr 14 10:24:24 2002
From: steveb at aei-potsdam.mpg.de (Steven Berukoff)
Date: Sun, 14 Apr 2002 19:24:24 +0200 (MET DST)
Subject: DMA difficulties
In-Reply-To: <Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com>
Message-ID: <Pine.OSF.4.21.0204141907500.24786-100000@holodec15.aei-potsdam.mpg.de>

Hi,

Sorry, I should have given a bit more info.

If the IDE cable is attached, but the power cable is not, the machine will
not complete POST; it will hang.  If the power cable is attached, but the
IDE cable is not, the machine completes POST, and goes forward with the
install.  However, performance is slow.  Only when both cables are
attached to the CDROM does the installation run quickly.

To address Alvin's comments, all settings in the BIOS relevant to the
CDROM are disabled: the CDROM is not listed as a boot device, it's not a
Master or Slave on either IDE channel, and the Secondary IDE channel is
disabled.  Further, no IDE cables are attached where they shouldn't be,
i.e., only the HDD cable is plugged in.  Finally, there is no option in
the BIOS for enabling/disabling autodetection of IDE devices.

To address Mark's comments, the kernel that I'm using is the 2.4.7-10
kernel that comes with RH7.2.  In particular, I'm using the kernel found
in images/pxeboot, which includes support for the network loopback device,
initial ramdisk, etc.  Also, the boot messages say that the HDD is
DMA enabled, although, as I've said, I'm a bit wary of that pronouncement.

I thought about compiling my own kernel for this, instead of using the RH
distro version.  However, going through some of the permutations of kernel
configurations didn't produce a useful product.  Anyone have insights as
to the kernel config that will work for this, or the options in the stock
RH kernel, or how to extract such options?

TIA again for your insights.

Steve


> hi ya
> 
> i notice that when the cable is attached... things goes
> bonkers... even if no power ot the drive ( hd or cdrom )
> 
> remove the ide cable from the motherboard if its not used
> 
> and tell the bios NOT to autodetect ide devices
> except those that is in fact present
> 
> 150 nodes.... hummm .... one full cabinet..front and back.. :-)
> 
> c ya
> alvin
> http://www.Linux-1U.net
> 
> 
> On Sun, 14 Apr 2002, Steven Berukoff wrote:
> 
> > 
> > Hi all,
> > 
> > This question may be very slightly off-topic, so I apologize.
> > 
> > I'm in the process of setting up a network installation procedure using
> > PXE/DHCP/NFS/Kickstart w/ RH7.2 for about 150 dual Athlon nodes.  These
> > nodes use a Maxtor 6L080J4 80.0GB HDD and an ASUS A7M266-D motherboard,
> > among other things.  One particular note is that I don't need/want CDROMs
> > in these systems.
> > 
> > Now, a vendor provided me with a couple of test nodes basically to our
> > specifications, except that they included CDROMs and floppies.  To make a
> > longish story shorter, I wanted to make sure that the nodes work fine
> > without the CDROM.
> > 
> > So, I first looked into the BIOS.  I disabled (set to "None") Primary
> > Slave, Secondary Master/Slave (since my HDD is Primary Master), removed
> > the CDROM from the list of boot devices, and disabled the Secondary IDE
> > channel.  Then, I passed the kernel args "ide0=dma hdb=none" to try to
> > enforce the HDD to use DMA during the Kickstart installation.
> > 
> > Now, here is the kicker: regardless of the BIOS settings, if I have the
> > CDROM plugged in (power+IDE, on the secondary channel) the installation
> > takes ~ 5 times faster than if the thing isn't there.  This installation
> > includes installation of ~470 packages plus formatting the HDD.  That's
> > right, as long as the CDROM is plugged in, everything is peachy, but once
> > gone, things slow down.
> > 
> > I think this is a problem with the DMA settings, b/c when I pass
> > "ide=nodma" to the kernel, WITH the CD attached, performance is
> > slow.  However, I can't even force DMA to be used.
> > 
> > If anyone has any suggestions or similar experiences, please let me know.
> > 
> > Thanks a bunch!
> > Steve
> > 
> > 
> > =====
> > Steve Berukoff					tel: 49-331-5677233
> > Albert-Einstein-Institute			fax: 49-331-5677298
> > Am Muehlenberg 1, D14477 Golm, Germany		email:steveb at aei.mpg.de
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > 
> 
> 


=====
Steve Berukoff					tel: 49-331-5677233
Albert-Einstein-Institute			fax: 49-331-5677298
Am Muehlenberg 1, D14477 Golm, Germany		email:steveb at aei.mpg.de


From hahn at physics.mcmaster.ca  Sun Apr 14 10:43:43 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Sun, 14 Apr 2002 13:43:43 -0400 (EDT)
Subject: decent performance from G4 Macs?
In-Reply-To: <20020414012956.GA20390@mcs.anl.gov>
Message-ID: <Pine.LNX.4.33.0204141229550.27058-100000@coffee.psychology.mcmaster.ca>

> On Sat, Apr 13, 2002 at 02:18:39PM -0400, Mark Hahn wrote:
> > is it just that the performance Apple brags about is strictly
> > in-cache, and/or when doing something ah specialized like 
> > single-precision SIMD (altivec/velocity engine)?
> 
> it's the altivec unit that makes G4s at all interesting.  if you
> aren't using the vector unit, yeah, you won't even come close to x86.   

as far as I can tell, the requirement to think highly of G4 is:
	hand-tuned altivec and a tiny working set
which pretty much excludes any general-purpose scientific computing.

> gcc is multi-platform, sure, but it's optimizer for x86 has received a
> lot of attention, while the powerpc optimizer has not. your

I'm not sure that's true: I read the gcc developers list and see
significant efforts from Apple people.  and remember that lots of 
code is not inherently vectorizable, so would never win big on SIMD.

> observation that gcc 3.1 performance is better shows that focus on
> powerpc optimizations has grown, but yeah, it's going to get less

afaikt, 3.1 improvements are from improved infrastructure, nothing
powerpc-specific.

> you are running on mac os x, yes?  is there any chance you could put
> linux on it?  if your application is making a significant number of
> system calls ( file i/o, network traffic... you know, system calls )

no, I'm really only interested in compute-bound performance.

> also bear in mind that G4s run significantly cooler than their x86
> counterparts, so you might still come out ahead on price/performance,

I've heard Apple/Moto's PR on that, too.  but my recent benchmarking
has made me "think different": the G4 appears to be about the same 
performance as current Intel notebook PIII's.  which, of course,
burn about the same power as G4's...

> where price takes into account initial purchase + cost of running the
> cluster.  

we're in the market for 1-200 CPUs.  it's not obvious to me that it 
matters whether the CPU burns 20 or 50W, since we're already got 
30 KW of Alphas in the room ;)

G4e/1000	21	probably "design" power
PIIIulv/700	 8	"design" power
PIIIt/1113	28	"design" power
P4a/2200	55	"design" power
athxp/1800	66	max power

> so there you go. there are lots of reasons why you'll have to actually
> spend a bit of effort to move to a new architecture.  i hope no one on
> this list finds that idea surprising.  

I certainly do.  powerpc support in gcc is not immature, and the cpu
is supposed to be a general-purpose one.  if my observations are true,
then it's the slowest shipping GP machine, and is only viable if you
can afford to structure your program around its SIMD and cache.

regards, mark hahn.


From gotero at linuxprophet.com  Sun Apr 14 16:54:59 2002
From: gotero at linuxprophet.com (Glen Otero)
Date: 14 Apr 2002 16:54:59 -0700
Subject: CD from "Building Linux Cluster"
In-Reply-To: <1018753136.25541.3.camel@walkeonline.com>
References: 	<Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com> 
	<00a101c1e357$ece44f90$e3c0fea9@squaw> 
	<1018753136.25541.3.camel@walkeonline.com>
Message-ID: <1018828499.1838.175.camel@prophet>

I tried to build a cluster with the CD when I reviewed that book for
Linux Journal.  Incredibly, the software was released unfinished, and so
building a cluster with it wasn't possible.  The book was pulled from
circulation for this and other editorial reasons.  I recommend Rocks,
Scyld, and OSCAR for building clusters.

Glen

On Sat, 2002-04-13 at 19:58, Vann H. Walke wrote:
> I don't have the book, but suspect that the included software would be
> well out of date.  If you're just getting into clustering, I would
> suggest trying the Scyld distribution.  You can get it for $3 at
> linuxcentral.com.  
> 
> Good Luck,
> Vann
> 
> On Sat, 2002-04-13 at 21:57, Eric Chiu wrote:
> > Has anyone set up a cluster using the CD from Spector's book 
> > "Building Linux Clusters" (O'Reilly)? 
> > 
> > Eric Chiu, author/consultant
> > Imservice, Inc.
> > www.imservice.com
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- 
Glen Otero, Ph.D.
Linux Prophet
Office:858.792.5561
Mobile:619.917.1772
www.linuxprophet.com
"The Beowulf is primarily a mental phenomenon"


From alvin at Maggie.Linux-Consulting.com  Sun Apr 14 17:37:32 2002
From: alvin at Maggie.Linux-Consulting.com (alvin at Maggie.Linux-Consulting.com)
Date: Sun, 14 Apr 2002 17:37:32 -0700 (PDT)
Subject: DMA difficulties
In-Reply-To: <Pine.OSF.4.21.0204141907500.24786-100000@holodec15.aei-potsdam.mpg.de>
Message-ID: <Pine.LNX.3.96.1020414173440.6958B-100000@Maggie.Linux-Consulting.com>

hi ya steven

if you have the hdd cable plugged in ( am assuming into the motherboard )
but no ide drive .... you will get whacky results ...
( whether secondary ide is disabled on the bios or not... )

remove cables that dont go nowhere ( is what am trying to say )
remove um if the devices are disabled ..

most bios does allow you to autodetect or user define the
devices... but donno about your motherboard...

the default rh-7.2 kernel should work fine...
( doesnt cough up erroneous messages on boot that i know of..

c ya
alvin


On Sun, 14 Apr 2002, Steven Berukoff wrote:

> 
> Hi,
> 
> Sorry, I should have given a bit more info.
> 
> If the IDE cable is attached, but the power cable is not, the machine will
> not complete POST; it will hang.  If the power cable is attached, but the
> IDE cable is not, the machine completes POST, and goes forward with the
> install.  However, performance is slow.  Only when both cables are
> attached to the CDROM does the installation run quickly.
> 
> To address Alvin's comments, all settings in the BIOS relevant to the
> CDROM are disabled: the CDROM is not listed as a boot device, it's not a
> Master or Slave on either IDE channel, and the Secondary IDE channel is
> disabled.  Further, no IDE cables are attached where they shouldn't be,
> i.e., only the HDD cable is plugged in.  Finally, there is no option in
> the BIOS for enabling/disabling autodetection of IDE devices.
> 
> To address Mark's comments, the kernel that I'm using is the 2.4.7-10
> kernel that comes with RH7.2.  In particular, I'm using the kernel found
> in images/pxeboot, which includes support for the network loopback device,
> initial ramdisk, etc.  Also, the boot messages say that the HDD is
> DMA enabled, although, as I've said, I'm a bit wary of that pronouncement.
> 
> I thought about compiling my own kernel for this, instead of using the RH
> distro version.  However, going through some of the permutations of kernel
> configurations didn't produce a useful product.  Anyone have insights as
> to the kernel config that will work for this, or the options in the stock
> RH kernel, or how to extract such options?
> 
> TIA again for your insights.
> 
> Steve
> 
> 
> 
> > hi ya
> > 
> > i notice that when the cable is attached... things goes
> > bonkers... even if no power ot the drive ( hd or cdrom )
> > 
> > remove the ide cable from the motherboard if its not used
> > 
> > and tell the bios NOT to autodetect ide devices
> > except those that is in fact present
> > 
> > 150 nodes.... hummm .... one full cabinet..front and back.. :-)
> > 
> > c ya
> > alvin
> > http://www.Linux-1U.net
> > 
> > 
> > On Sun, 14 Apr 2002, Steven Berukoff wrote:
> > 
> > > 
> > > Hi all,
> > > 
> > > This question may be very slightly off-topic, so I apologize.
> > > 
> > > I'm in the process of setting up a network installation procedure using
> > > PXE/DHCP/NFS/Kickstart w/ RH7.2 for about 150 dual Athlon nodes.  These
> > > nodes use a Maxtor 6L080J4 80.0GB HDD and an ASUS A7M266-D motherboard,
> > > among other things.  One particular note is that I don't need/want CDROMs
> > > in these systems.
> > > 
> > > Now, a vendor provided me with a couple of test nodes basically to our
> > > specifications, except that they included CDROMs and floppies.  To make a
> > > longish story shorter, I wanted to make sure that the nodes work fine
> > > without the CDROM.
> > > 
> > > So, I first looked into the BIOS.  I disabled (set to "None") Primary
> > > Slave, Secondary Master/Slave (since my HDD is Primary Master), removed
> > > the CDROM from the list of boot devices, and disabled the Secondary IDE
> > > channel.  Then, I passed the kernel args "ide0=dma hdb=none" to try to
> > > enforce the HDD to use DMA during the Kickstart installation.
> > > 
> > > Now, here is the kicker: regardless of the BIOS settings, if I have the
> > > CDROM plugged in (power+IDE, on the secondary channel) the installation
> > > takes ~ 5 times faster than if the thing isn't there.  This installation
> > > includes installation of ~470 packages plus formatting the HDD.  That's
> > > right, as long as the CDROM is plugged in, everything is peachy, but once
> > > gone, things slow down.
> > > 
> > > I think this is a problem with the DMA settings, b/c when I pass
> > > "ide=nodma" to the kernel, WITH the CD attached, performance is
> > > slow.  However, I can't even force DMA to be used.
> > > 
> > > If anyone has any suggestions or similar experiences, please let me know.
> > > 
> > > Thanks a bunch!
> > > Steve
> > > 
> > > 
> > > =====
> > > Steve Berukoff					tel: 49-331-5677233
> > > Albert-Einstein-Institute			fax: 49-331-5677298
> > > Am Muehlenberg 1, D14477 Golm, Germany		email:steveb at aei.mpg.de
> > > 
> > > 
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> > > 
> > 
> > 
> 
> 
> 
> =====
> Steve Berukoff					tel: 49-331-5677233
> Albert-Einstein-Institute			fax: 49-331-5677298
> Am Muehlenberg 1, D14477 Golm, Germany		email:steveb at aei.mpg.de
> 
> 
> 


From ron_chen_123 at yahoo.com  Sun Apr 14 18:55:13 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Sun, 14 Apr 2002 18:55:13 -0700 (PDT)
Subject: decent performance from G4 Macs?
In-Reply-To: <Pine.LNX.4.33.0204141214020.27058-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20020415015513.55770.qmail@web14702.mail.yahoo.com>

--- Mark Hahn <hahn at physics.mcmaster.ca> wrote:
> > --- Mark Hahn <hahn at physics.mcmaster.ca> wrote:
> > > I'm doing some benchmarks to evaluate whether
> > > current Macs would make suitable nodes for a
> serial
> > > farm (lots of nodes, preferably fast CPU and
> dram,
> > > but no serious interconnect.)
> > 
> > Physics or bioscience code?
> 
> why does it matter?  we're not trying specifically
> to run BLAST, 
> if that's what you're asking.  I don't see any
> reason why the 
> department would matter, but it's a mixture of math,
> chem,
> physics, astro, biologists, and perhaps a few
> psychologists.
> 
> > > I've tried a variety of real codes and
> benchmarks,
> > > but can't seem to get something like a Mac
> G4/800 
> > > with PC133 to perform anywhere close to even a
> > > P4/1.7/i845/PC133.
> > > 
> > > I'm using either the gcc 2.95 that comes with
> OSX or
> > > a recent 3.1 snapshot (which is MUCH better, but
> > > still bad).
> > 
> > What compiler are you using for the P4? 
> 
> I'm pretty happy with recent snapshots of gcc 3.1
> (pre-release).
> (still mystified why gnu fortran people are stuck at
> F77, but...)
> 
> > > is it just that the performance Apple brags
> about is
> > > strictly in-cache, and/or when doing something
> ah
> > > specialized like single-precision SIMD
> > >(altivec/velocity engine)?
> > 
> > Apple has some libraries that take advantage of
> the
> > Altivec instructions.
> 
> linpack/lapack/atlas/fftw?
> 
> > AFAIK, there are several people using MacOS X in
> > clusters, the SGE (Sun Grid Engine) project has a
> port
> > for Mac OS X.
> 
> which doesn't give me ANY data on performance.
> 

You can use a better compiler for the PPC:

http://www.absoft.com/newproductpage.html


Also, I did not say that SGE would provide you ANY
data on performance -- all I said was that you could
find people using Mac OS X/G4 machines in the cluster
world. (or if you don't like SGE, you can choose PBS,
they also have a Mac OS X port)

-Ron

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From wrp at alpha0.bioch.virginia.edu  Sun Apr 14 18:57:56 2002
From: wrp at alpha0.bioch.virginia.edu (William R. Pearson)
Date: Sun, 14 Apr 2002 21:57:56 -0400 (EDT)
Subject: G4's for scientific computing
Message-ID: <200204150157.VAA22514@alpha0.bioch.virginia.edu>

One of the advantages of the MacOSX gcc compiler is that in line
Altivec instructions are available at a high level.  One can
define vector arrays, and do vector operations from 'C' code, e.g.

	while(vec_any_gt(T2, NAUGHT)) {
	  T2 = vec_sub(LSHIFT(T2), RR);
	  FF = vec_max(FF, T2);
	}

We are testing an Altivec FASTA version; a Altivec BLAST was announced
several months ago.  We like Altivec because we can manipulate 8
16-bit integers or 16 8 bit integers at once - biological sequence
comparison code is essentially all integer.  We see a 6-fold speedups
on when things are done 8-fold parallel.  On our codes a dual 533 G4
and Altivec code is 6X-faster than a dual 1 GHz PIII (we don't have a
GHz G4 yet).  Because of the high level Altivec primitives in the
Apple gcc compiler, vectorizing was very very easy; we would have to
be much more sophisticated to do the same thing on the PIII (and the
potential speed-up would be 1/2 as large, since the vector is 64, not
128 bits).

I might have agreed with the statement that one must have hand-tuned
Altivec code which pretty much excludes general purpose scientific
computing 4 months ago, but our experience has been very positive -
our programs are not specialized signal processing programs, but, in
retrospect, it was easy to get very dramatic speed up.

Bill Pearson


From wrp at alpha0.bioch.virginia.edu  Sun Apr 14 19:32:20 2002
From: wrp at alpha0.bioch.virginia.edu (William R. Pearson)
Date: Sun, 14 Apr 2002 22:32:20 -0400 (EDT)
Subject: Parallel BLAST
Message-ID: <200204150232.WAA22617@alpha0.bioch.virginia.edu>

  
> Why is it that BLAST is not available for MPI/PVM?  I would think
> clusters would be the prefect host for such an application.
> Is it there is no need because BLAST is already so fast and
> no one wants to break the database out onto node-resident disks?
> Or is it that BLAST is kept running on single processor or shared memory 
> machines BLAST so that the DB is always in memory ready to roll without
> loading and doing the same for a cluster is not worth it
> because the same trick is difficult to do on a node given the current
> way clusters are built?  I assume the same is true for FASTA?

I suspect that BLAST is not available for MPI/PVM because (1) it is
too fast, and (2) there is not much demand for it.  

95% of the time, BLAST is almost an in-memory grep (the other 5% of
the time it is working on the things it is looking for).  Sequence
comparison is embarrassingly parallel, and very easily threaded.
Distributing the sequence databases and collecting results has more
overhead (there probably aren't many distributed grep programs
either).  FASTA is 5 - 10X slower than BLAST, and Smith-Waterman is
another 5-20X slower than FASTA.  Here, the communications overhead is
low, and distributed systems work OK for FASTA, and great for
Smith-Waterman (where the overhead fraction is very small).

Of course, it is a lot easier to compile a threaded program, and just
run it, than it is to install and configure the MPI or PVM environment
and the programs to run in it.  Bioinformatics software is often run
by computer savvy biologists, not high-performance computing folks,
and not having to install and configure PVM/MPI is a big advantage.
The NCBI probably does not make a PVM/MPI parallel BLAST because there
is very little demand for it, and it does not meet their computational
needs.

Bill Pearson


From lindahl at keyresearch.com  Fri Apr 12 17:08:52 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Fri, 12 Apr 2002 20:08:52 -0400
Subject: very high bandwidth, low latency manner? (i860)
In-Reply-To: <Pine.SGI.4.21.0204121806370.26265788-100000@hppc.fnal.gov>; from djholm@fnal.gov on Fri, Apr 12, 2002 at 06:11:41PM -0500
References: <20020412153750.A5315@mas1.ats.ucla.edu> <Pine.SGI.4.21.0204121806370.26265788-100000@hppc.fnal.gov>
Message-ID: <20020412200852.B2381@wumpus.skymv.com>

On Fri, Apr 12, 2002 at 06:11:41PM -0500, Don Holmgren wrote:

> Unfortunately the measured performance doesn't match the published
> specs. 

In fact, this is *always* true for every PCI and memory system out
there. Measure, measure, measure. The myrinet perftest and STREAM
memory benchmark are your friends.

-- greg


From rickey-co at mug.biglobe.ne.jp  Sat Apr 13 11:57:27 2002
From: rickey-co at mug.biglobe.ne.jp (Iwao Makino)
Date: Sun, 14 Apr 2002 03:57:27 +0900
Subject: decent performance from G4 Macs?
In-Reply-To:  <Pine.LNX.4.33.0204131358330.25397-100000@coffee.psychology.mcmaster.ca>
References:  <Pine.LNX.4.33.0204131358330.25397-100000@coffee.psychology.mcmaster.ca>
Message-ID: <v050101bfb8de29ad161a@[192.168.1.4]>

Mark,

At 14:18 -0400 13.04.2002, Mark Hahn wrote:
>I'm doing some benchmarks to evaluate whether current Macs
>would make suitable nodes for a serial farm (lots of nodes,
>preferably fast CPU and dram, but no serious interconnect.)

I agree about first 2, but for Interconnect, there's Myrinet
for MacOS X!!!
Myricom now released MPICH-GM for MacOS X as well.

I personally haven't purchased Mac to test, but having said
by Apple that G4's with Blast is a lot faster than P4, I'm
quite interested to evaluate them soon.

>I've tried a variety of real codes and benchmarks, but can't
>seem to get something like a Mac G4/800 with PC133 to perform
>anywhere close to even a P4/1.7/i845/PC133.
>
>I'm using either the gcc 2.95 that comes with OSX or a
>recent 3.1 snapshot (which is MUCH better, but still bad).

I think you have to MODIFY code a bit to take advantage of velocity engine
for MacOS X gcc.

I thought there are interesting post along with other bluff.
<http://www.apple.com/pr/library/2002/feb/07blast.html>


>is it just that the performance Apple brags about is strictly
>in-cache, and/or when doing something ah specialized like
>single-precision SIMD (altivec/velocity engine)?

I too think that's big part of it...

-- 

Best regards,

Iwao Makino
Hard Data Ltd. Tokyo branch
mailto:iwao at harddata.com
http://www.harddata.com/

-> HPC cluster specialist<-
-> Scientific Imaging/Life Science/Physical Science/Parallel Computing <-


From bgb at itcnv.com  Sun Apr 14 07:35:00 2002
From: bgb at itcnv.com (bgb at itcnv.com)
Date: Sun, 14 Apr 2002 14:35:00 GMT
Subject: BladeFrame vs Beowulf
In-Reply-To: <012201c1e37c$f71fa2a0$e3c0fea9@squaw> 
References: <Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com>
            <00a101c1e357$ece44f90$e3c0fea9@squaw>
            <012201c1e37c$f71fa2a0$e3c0fea9@squaw>
Message-ID: <20020414143501.29671.qmail@smtp.itcnv.com>

There is also: 

http://www.rlxtechnologies.com/about/pr_blast.php 

Eric Chiu writes: 

> Has anyone worked on one of these BladeFrame?
> http://www.egenera.com/prod_spec_overview.php 
> 
> I'm wondering how this compares to a custom-built Beowulf. 
> I like how they have consolidated the networking and 
> hardware in this proprietary architecture. One of the biggest 
> problems in a Beowulf is keeping track of the boxes and ethernet 
> connections. 
> 
> Eric Chiu, author/consultant
 

B.G. Bruce
Networking Technologies N.V / Internet Technologies (Curacao) N.V. 

Phone: +599 9 563-1836
Fax:   +599 9 465-3594 

Alternate Email:  bgbruce at it-curacao.com, ancu321 at attglobal.net 


From lindahl at keyresearch.com  Sun Apr 14 17:50:47 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Sun, 14 Apr 2002 17:50:47 -0700
Subject: CD from "Building Linux Cluster"
In-Reply-To: <00a101c1e357$ece44f90$e3c0fea9@squaw>; from echiu@imservice.com on Sat, Apr 13, 2002 at 06:57:36PM -0700
References: <Pine.LNX.3.96.1020413180652.22935A-100000@Maggie.Linux-Consulting.com> <00a101c1e357$ece44f90$e3c0fea9@squaw>
Message-ID: <20020414175047.A12785@wumpus.attbi.com>

On Sat, Apr 13, 2002 at 06:57:36PM -0700, Eric Chiu wrote:

> Has anyone set up a cluster using the CD from Spector's book 
> "Building Linux Clusters" (O'Reilly)? 

I sat in an airport line for an hour once with a woman who knew
Spector. So she asked me what I thought of the book, and you guys know
me well enough to know how good of a spin I put on my answer: "Well,
he seemed to have a clue about high availability, but the Beowulf
section was pretty crappy."

It turns out that he's well aware of that, and was egged on to write a
"complete" book by the editors. Ah well, it's a shame no matter how it
happened.

greg


From erayo at cs.bilkent.edu.tr  Sun Apr 14 21:08:39 2002
From: erayo at cs.bilkent.edu.tr (Eray Ozkural)
Date: Mon, 15 Apr 2002 07:08:39 +0300
Subject: G4's for scientific computing
In-Reply-To: <200204150157.VAA22514@alpha0.bioch.virginia.edu>
References: <200204150157.VAA22514@alpha0.bioch.virginia.edu>
Message-ID: <200204150708.40331.erayo@cs.bilkent.edu.tr>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Monday 15 April 2002 04:57, William R. Pearson wrote:
>
> I might have agreed with the statement that one must have hand-tuned
> Altivec code which pretty much excludes general purpose scientific
> computing 4 months ago, but our experience has been very positive -
> our programs are not specialized signal processing programs, but, in
> retrospect, it was easy to get very dramatic speed up.
>

I imagine fake vector processing would only work for certain type of problems. 
That's not SIMD by any measure. Don't you really need multiple data streams 
for general purpose HPC?

Regards,

- -- 
Eray Ozkural (exa) <erayo at cs.bilkent.edu.tr>
Comp. Sci. Dept., Bilkent University, Ankara
www: http://www.cs.bilkent.edu.tr/~erayo  Malfunction: http://mp3.com/ariza
GPG public key fingerprint: 360C 852F 88B0 A745 F31B  EA0F 7C07 AE16 874D 539C
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8ulJHfAeuFodNU5wRAn7oAJ9n7oJC3nfBv29EBYOpypOjBGLUmACcCmPO
kY+ZBvrh1ev4iQnFMkQV4IA=
=YeV6
-----END PGP SIGNATURE-----


From ssy at prg.cpe.ku.ac.th  Sun Apr 14 23:09:30 2002
From: ssy at prg.cpe.ku.ac.th (Somsak Sriprayoonsakul)
Date: Mon, 15 Apr 2002 13:09:30 +0700
Subject: Need many C/C++ MPI programs
Message-ID: <000d01c1e444$2177e860$0100a8c0@yggdrasil>

Hello,
	I need to test my cluster by running many many MPI parallel
program. Is there any MPI program archive or something which I could
download the program source? It would be better if the program are
written in C/C++ so I could tune its performance and see how is it going
in my cluster.

Thanks
Somsak 


From markus at markus-fischer.de  Mon Apr 15 02:40:49 2002
From: markus at markus-fischer.de (Markus Fischer)
Date: Mon, 15 Apr 2002 11:40:49 +0200
Subject: very high bandwidth, low latency manner?
References: <Pine.LNX.4.30.0204121919130.16089-100000@elin.scali.no>
Message-ID: <3CBAA021.DB753C6F@markus-fischer.de>

Steffen Persvold wrote:
> 
> Now we have price comparisons for the interconnects (SCI,Myrinet and
> Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
> ~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
> node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII ServerWorks
> HE-SL based cluster).

yes, please.

I would like to get/see some numbers.
I have run tests with SCI for a non linear diffusion algorithm on a 96 node
cluster with 32/33 interface. I thought that the poor
scalability was due to the older interface, so I switched to
a SCI system with 32 nodes and 64/66 interface.

Still, the speedup values were behaving like a dog with more than 8 nodes.

Especially, the startup time will reach minutes which is probably due to
the exporting and mapping of memory.

Yes, the MPI library used was Scampi. Thus, I think the
(marketing) numbers you provide
below are not relevant except for applying for more VC.

Even worse, we noticed, that the SCI ring structure has an impact on the 
communication pattern/performance of other applications. 
This means we only got the same execution time if other nodes were
I idle or did not have communication intensive applications.
How will you determine the performance of the algorithm you just invented
in such a case ?

We then used a 512 node cluster with Myrinet2000. The algorithm scaled
very fine up to 512 nodes.

Markus

> 
> Regards,
> --
>   Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
>  mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
> Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
> Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jacobsgd21 at BrandonU.CA  Mon Apr 15 09:04:42 2002
From: jacobsgd21 at BrandonU.CA (Geoffrey D. Jacobs)
Date: Mon, 15 Apr 2002 11:04:42 -0500
Subject: David HM Spector, "Building Linux Clusters"
Message-ID: <3CBAFA1A.9090802@brandonu.ca>

A waste of ink and paper. This book has no depth, and the included 
software is incomplete.

Look elsewhere for your reference needs.


From tim at dolphinics.com  Mon Apr 15 10:28:42 2002
From: tim at dolphinics.com (Tim Wilcox)
Date: Mon, 15 Apr 2002 11:28:42 -0600
Subject: Need many C/C++ MPI programs
References: <000d01c1e444$2177e860$0100a8c0@yggdrasil>
Message-ID: <3CBB0DCA.3000908@dolphinics.com>


Somsak Sriprayoonsakul wrote:

>Hello,
>	I need to test my cluster by running many many MPI parallel
>program. Is there any MPI program archive or something which I could
>download the program source? It would be better if the program are
>written in C/C++ so I could tune its performance and see how is it going
>in my cluster.
>
There are several benchmarks available with source, I commonly use these 
for testing machines.  Try Linpack at http://www.netlib.org/benchmark/hpl/
       This is good for cpu performance.

I also use PMB  http://www.pallas.com/e/products/pmb/download.htm
    this is good for interconnect performance.  

Tim Wilcox

>
>Thanks
>Somsak 
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From SGaudet at turbotekcomputer.com  Mon Apr 15 11:11:55 2002
From: SGaudet at turbotekcomputer.com (Steve Gaudet)
Date: Mon, 15 Apr 2002 14:11:55 -0400
Subject: Parallel BLAST
Message-ID: <3450CC8673CFD411A24700105A618BD6267DC4@911TURBO>


> -----Original Message-----
> From: William R. Pearson [mailto:wrp at alpha0.bioch.virginia.edu]
> Sent: Sunday, April 14, 2002 10:32 PM
> To: beowulf at beowulf.org
> Subject: Parallel BLAST
> 
> 
>   
> > Why is it that BLAST is not available for MPI/PVM?  I would think
> > clusters would be the prefect host for such an application.
> > Is it there is no need because BLAST is already so fast and
> > no one wants to break the database out onto node-resident disks?
> > Or is it that BLAST is kept running on single processor or 
> shared memory 
> > machines BLAST so that the DB is always in memory ready to 
> roll without
> > loading and doing the same for a cluster is not worth it
> > because the same trick is difficult to do on a node given 
> the current
> > way clusters are built?  I assume the same is true for FASTA?
> 
> I suspect that BLAST is not available for MPI/PVM because (1) it is
> too fast, and (2) there is not much demand for it.  
> 
> 95% of the time, BLAST is almost an in-memory grep (the other 5% of
> the time it is working on the things it is looking for).  Sequence
> comparison is embarrassingly parallel, and very easily threaded.
> Distributing the sequence databases and collecting results has more
> overhead (there probably aren't many distributed grep programs
> either).  FASTA is 5 - 10X slower than BLAST, and Smith-Waterman is
> another 5-20X slower than FASTA.  Here, the communications overhead is
> low, and distributed systems work OK for FASTA, and great for
> Smith-Waterman (where the overhead fraction is very small).
> 
> Of course, it is a lot easier to compile a threaded program, and just
> run it, than it is to install and configure the MPI or PVM environment
> and the programs to run in it.  Bioinformatics software is often run
> by computer savvy biologists, not high-performance computing folks,
> and not having to install and configure PVM/MPI is a big advantage.
> The NCBI probably does not make a PVM/MPI parallel BLAST because there
> is very little demand for it, and it does not meet their computational
> needs.
--------------

There's also a commerical version from Turbogenomics.

http://www.turbogenomics.com

Offering:

1) Ready to go, plug-n-play solution for parallel BLAST
2) Expertise and 20+ years of experience in parallel computing
3) Dynamic database splitting feature to take advantage of computers that
have less memory than the size of the database
4) Smart load balancing - achieve linear to superlinear speedup
5) No modification made to the NCBI BLAST algorithm to ensure identical
results with the non-parallel version
6) Easy drop-in update whenever NCBI releases newer versions of their
algorithm
7) Excellent support
8) 30-days money back guarantee


Cheers,


Steve Gaudet 
Linux Solutions Engineer
   ..... 
  <(???)> 
 
===================================================================
| Turbotek Computer Corp.    tel:603-666-3062 ext. 21             |
| 8025 South Willow St.      fax:603-666-4519                     |
| Building 2, Unit 105       toll free:800-573-5393               |
| Manchester, NH 03103       e-mail:sgaudet at turbotekcomputer.com  |
|                            web: http://www.turbotekcomputer.com |
===================================================================

  
From Hakon.Bugge at scali.com  Tue Apr 16 03:24:37 2002
From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge)
Date: Tue, 16 Apr 2002 12:24:37 +0200
Subject: very high bandwidth, low latency manner?
In-Reply-To: <3CBAA021.DB753C6F@markus-fischer.de>
References: <Pine.LNX.4.30.0204121919130.16089-100000@elin.scali.no>
Message-ID: <5.1.0.14.0.20020416122156.05491530@62.70.89.10>

Hi,


I am sorry to hear that you was unable to achieve expected performance on 
the mentioned SCI based systems. You raise a couple of issues, which I 
would like to address:

1) Performance.

Performance transparency is always goal. Nevertheless, sometimes an 
implementation will have a performance bug. The two organizations owning 
the mentioned systems, have both support agreements with Scali. I have 
checked the support requests, but cannot find any request where your 
incidents were reported. We find this fact strange if you truly were aiming 
at achieving good performance. We are happy to look into your application 
and report findings back to this news group.

2) Startup time.

You contribute the bad scalability to high startup time and mapping of 
memory. This is an interesting hypothesis; and can easily be verified by 
using a switch when you start the program, and measure the difference 
between the elapsed time of the application and the time it uses after 
MPI_Init() has been called. However, the startup time measured on 64-nodes, 
two processors per node, where all processes have set up mapping to all 
other processes, is nn second. If this contributes to bad scalability, your 
application has a very short runtime.

3) SCI ring structure

You state that on a multi user, multi-process environment, it is hard to 
get deterministic performance numbers. Indeed, that is true. True sharing 
of resources implies that. Whether the resource is a file-server, a memory 
controller, or a network component, you will probably always be subject to 
performance differences. Also, lack of page coloring will contribute to 
different execution times, even for a sequential program. You further 
indicate that performance numbers reported f. ex. by Pallas PMB benchmark 
only can be used for applying for more VC. I disagree for two reasons; 
first, you imply that venture capitalists are naive (and to some extent 
stupid). That is not my impression, merely the opposite. Secondly, such 
numbers are a good example to verify/deny your hypothesis that the SCI ring 
structure is volatile to traffic generated by other applications. PMB's 
*multi* option is architected to investigate exactly the problem you 
mention; Run f. ex. MPI_Alltoall() on N/2 of the machine. Then measure how 
performance is affected when the other N/2 of the machine is also running 
Alltoall(). This is the reason we are interested in comparative performance 
numbers to SCI based systems. It is to me strange, that no Pallas PMB 
benchmark results ever has been published for a reasonable sized system 
based on alternative interconnect technologies. To quote Lord Kelvin: "If 
you haven't measured it, you don't know what you're talking about".

As a bottom line, I would appreciate that initiatives to compare cluster 
interconnect performance should be appreciated, rather than be scrutinized 
and be phrased as "only usable to apply for more VC".


H
At 11:40 AM 4/15/02 +0200, Markus Fischer wrote:
>Steffen Persvold wrote:
> >
> > Now we have price comparisons for the interconnects (SCI,Myrinet and
> > Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
> > ~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
> > node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII 
> ServerWorks
> > HE-SL based cluster).
>
>yes, please.
>
>I would like to get/see some numbers.
>I have run tests with SCI for a non linear diffusion algorithm on a 96 node
>cluster with 32/33 interface. I thought that the poor
>scalability was due to the older interface, so I switched to
>a SCI system with 32 nodes and 64/66 interface.
>
>Still, the speedup values were behaving like a dog with more than 8 nodes.
>
>Especially, the startup time will reach minutes which is probably due to
>the exporting and mapping of memory.
>
>Yes, the MPI library used was Scampi. Thus, I think the
>(marketing) numbers you provide
>below are not relevant except for applying for more VC.
>
>Even worse, we noticed, that the SCI ring structure has an impact on the
>communication pattern/performance of other applications.
>This means we only got the same execution time if other nodes were
>I idle or did not have communication intensive applications.
>How will you determine the performance of the algorithm you just invented
>in such a case ?
>
>We then used a 512 node cluster with Myrinet2000. The algorithm scaled
>very fine up to 512 nodes.
>
>Markus
>
> >
> > Regards,
> > --
> >   Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
> >  mailto:sp at scali.com |  http://www.scali.com  | performing MPI 
> implementation:
> > Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
> > Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS 
> latency
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf

--
H?kon Bugge; VP Product Development; Scali AS;
mailto:hob at scali.no; http://www.scali.com; fax: +47 22 62 89 51;
Voice: +47 22 62 89 50; Cellular (Europe+US): +47 924 84 514;
Visiting Addr: Olaf Helsets vei 6, Bogerud, N-0621 Oslo, Norway;
Mail Addr:  Scali AS, Postboks 150, Oppsal, N-0619  Oslo, Norway;


From Hakon.Bugge at scali.com  Tue Apr 16 03:33:55 2002
From: Hakon.Bugge at scali.com (=?iso-8859-1?Q?H=E5kon?= Bugge)
Date: Tue, 16 Apr 2002 12:33:55 +0200
Subject: very high bandwidth, low latency manner?
Message-ID: <5.1.0.14.0.20020416123123.054aac70@62.70.89.10>

I'm sorry. I forgot to fill in the startup time. Its 14.5 seconds for 128 
processes on 64 nodes, when all processes have mapped remote memory of all 
other 127 processes.

H


From rauch at inf.ethz.ch  Tue Apr 16 06:15:38 2002
From: rauch at inf.ethz.ch (Felix Rauch)
Date: Tue, 16 Apr 2002 15:15:38 +0200 (CEST)
Subject: Memory benchmark (was Re: very high bandwidth, low latency manner?
 (i860))
In-Reply-To: <20020412200852.B2381@wumpus.skymv.com>
Message-ID: <Pine.LNX.4.33.0204161506110.11209-100000@maloney.ethz.ch>

On Fri, 12 Apr 2002, Greg Lindahl wrote:
> The myrinet perftest and STREAM memory benchmark are your friends.

If you need more detailed informations about the performance of your
memory system than STREAM offers, then you might want to look at the
ECT benchmarks (developed by colleagues of mine):

Extended Copy Transfer Characterization
http://www.cs.inf.ethz.ch/CoPs/ECT/

- Felix
-- 
Felix Rauch                      | Email: rauch at inf.ethz.ch
Institute for Computer Systems   | Homepage: http://www.cs.inf.ethz.ch/~rauch/
ETH Zentrum / RZ H18             | Phone: ++41 1 632 7489
CH - 8092 Zuerich / Switzerland  | Fax:   ++41 1 632 1307


From jayne at sphynx.clara.co.uk  Tue Apr 16 07:34:17 2002
From: jayne at sphynx.clara.co.uk (Jayne Heger)
Date: Tue, 16 Apr 2002 14:34:17 +0000
Subject: what architecture was MPI and PVM 1st designed for?
Message-ID: <E16xStZ-000Pdh-00@oracle.clara.net>

Hi,

Coulld anyone tell me what computer architecture MPI and PVM were first 
designed for./written on.
Thanks,

Jayne Heger


From eugen at leitl.org  Tue Apr 16 06:40:49 2002
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 16 Apr 2002 15:40:49 +0200 (CEST)
Subject: OpenMosix
Message-ID: <Pine.LNX.4.33.0204161540050.18441-100000@hydrogen.leitl.org>

http://newsvac.newsforge.com/article.pl?sid=02/04/13/055227

Saturday April 13, 2002 - [ 05:00 AM GMT ]
Bruce Knox writes "Tel Aviv (April 11, 2002) - Dr. Moshe Bar recently 
announced the creation of openMosix, a new OpenSource project. The project 
has quickly attracted a team of volunteer developers from around the globe 
and is off to a very fast start. openMosix, is an extension of the Linux 
kernel. 

For thousands of users, MOSIX has been a reliable, fast and cost-efficient 
clustering platform with users in life sciences, finance, industry, 
high-tech, research and government environments. The goal of openMosix is 
to give to these users continued support and an up-to-date fully GPLv2 
OpenSource platform.

Moshe Bar

openMosix began as the last verifiable GPL version of MOSIX. All openMosix 
extensions are under the full GPLv2 license, the GNU General Public 
License (GPL) Version 2. The openMosix Copyright is held by Moshe Bar.

openMosix is a Linux kernel extension for single-system image clustering. 
openMosix is perfectly scalable and adaptive. Once you have installed 
openMosix, the nodes in the cluster start talking to one another and the 
cluster adapts itself to the workload.

There is no need to program applications specifically for openMosix. Since 
all openMosix extensions are inside the kernel, every application 
automatically and transparently benefits from the distributed computing 
concept of openMosix. The cluster behaves much as does a SMP, but this 
solution scales to well over a thousand nodes which can themselves be 
SMPs.

OpenSource is more than just free access to software source code. The 
basic idea behind open source is very simple: When programmers can read, 
redistribute, and modify the source code for a piece of software, the 
software evolves. People improve it, people adapt it, people fix bugs. And 
this can happen at a speed that, if one is used to the slow pace of 
conventional software development, seems astonishing. 
the Open Source Initiative

Moshe Bar is an Operating Systems researcher, writer of Byte Magazine 
column Serving With Linux , author of numerous Linux books, and frequent 
contributor to the Linux tree. Moshe lectures for universities, 
corporations, and international organizations. He holds a Bachelor degree 
in mathematics, a M.S. and a Ph.D. in computer science. Moshe runs 
moshebar.com with a mailing list of over 20,000 members, is Chief 
Technical Officer of Qlusters, Inc., and is the Project Manager for 
openMosix. Moshe was born in Israel, grew up in a kibbutz, and now lives 
in Tel Aviv.

The development team of volunteers is truly international. The early team 
members reside in Chile, Spain, Italy, Norway, Germany, Israel, France and 
the United States. Plus, other mailing list queries have come from Canada, 
Pakistan, Oman, Estonia, Finland, India, South Africa, Switzerland, Tonga, 
and Shanghai China. Projects using openMosix already include astrophysics, 
medical research, and university laboratories.

The openMosix project is hosted on SourceForge.net which provides 
collaborative development web tools for the project. Downloads, 
documentation, and additional information are available from 
www.openmosix.org.

MOSIX is a very highly regarded, high performance, low cost, flexible, and 
scaleable Cluster Computing System for Linux. MOSIX was a GPL OpenSource 
project until late 2001. MOSIX, operational since 1983, integrates 
independent computers into a cluster, providing the user with what appears 
to be a single-machine Linux environment. Both the MOSIX Copyright and the 
MOSIX Trademark are owned by Professor Amnon Barak. Amnon Barak is a 
Professor of Computer Science and the Director of the Distributed 
Computing Laboratory in the Institute of Computer Science at the Hebrew 
University of Jerusalem on sabbatical leave for one year.

openMosix is Copyright ? 2002 by Moshe Bar.
Linux is Copyright ? 2002 by Linus Torvalds.
Mosix is Copyright ? 2002 by Amnon Barak.
openMosix is licensed under the GNU General Public License (GPL) Version 
2, June 1991 as published by the Free Software Foundation.
All logos and trademarks are the property of their respective owners.
Copyright ? 2002 by Moshe Bar"


From eugen at leitl.org  Tue Apr 16 06:45:27 2002
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 16 Apr 2002 15:45:27 +0200 (CEST)
Subject: GBit Ethernet over Cu evaluation
Message-ID: <Pine.LNX.4.33.0204161544040.18441-100000@hydrogen.leitl.org>

http://www.cs.uni.edu/~gray/gig-over-copper/

Gigabit Over Copper Evaluation

DRAFT

Prepared by Anthony Betz and Paul Gray

April 2, 2002

University of Northern Iowa

Department of Computer Science

Cedar Falls, IA 50614


Given the relatively low cost, backwards-compatibility, and 
widely-availability solutions for gigabit over copper network interfaces, 
the migration to commodity gigabit networks has begun. Copper-based 
gigabit solutions are now providing an alternative to the often more 
expensive fiber-based network solutions that are typically integrated in 
high performance environments such as today's tightly-coupled cluster 
systems.

But how do these cards compare with their fiber based counterparts? Are 
the Linux-based drivers ready for prime-time? The intent of this paper is 
to provide an extensive comparison of the various Gigabit over copper 
network interface cards available. Since performance is based on numerous 
factors such as bus architecture and the network protocol being used, 
these are the two main subjects of our investigation.

Our bandwidth benchmarks look at sustained throughput using TCP. While 
other communication protocols are available, indeed preferred, for high- 
performance computing, TCP-based benchmarks provide an immediate insight 
into the expected performance of the cards. With PCI-X coming into the 
marketplace in more and more motherboards as well as the multitude of 
systems with more traditional 32-bit PCI subsystems, numerous cards are 
available for today's 64bit and 32bit computer systems. The 64bit cards 
tested were as follows: Syskonnect SK9821, Syskonnect SK9D21, Asante 
Giganix, Ark Soho-GA2000T, 3Com 3c996BT and Intel's E1000 XT. The 32bit 
cards were Ark Soho-GA2500T, D-Link DGE500T. Comparisons for the various 
cards were made with respect to operation in alternate bus configurations 
and varied maximum transmission unit (MTU) sizes of TCP frames (jumbo 
frames). Results were gathered using Netpipe 2.4. By using Netpipe the 
peak sustained throughput would be provided as well as the transfer rate 
for varying packet sizes.

Note: All cards were tested at 1500, 3000, 4000, and 6000 values for the 
TCP MTU size. The drivers for the cards were not modified. Cards based 
upon the dp83820 chipset were limited to 6000MTU due to driver defaults. 
All other cards were tested through 9000MTU.

[results too voluminous to post]


From rgb at phy.duke.edu  Tue Apr 16 07:18:51 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 16 Apr 2002 10:18:51 -0400 (EDT)
Subject: what architecture was MPI and PVM 1st designed for?
In-Reply-To: <E16xStZ-000Pdh-00@oracle.clara.net>
Message-ID: <Pine.LNX.4.44.0204161009250.8713-100000@lucifer.rgb.private.net>

On Tue, 16 Apr 2002, Jayne Heger wrote:

> 
> Hi,
> 
> Coulld anyone tell me what computer architecture MPI and PVM were first 
> designed for./written on.

See

  http://www.epm.ornl.gov/pvm/

and look under "documentation" for "PVM and MPI: A comparison of
features".  Read the "Background" section.

Among many other sources, but this is terse and probably adequate, close
to "horse's mouth" accurate for PVM (but "Project Overview" is also
there and IS horse's mouth:-) and of course I'm sure that the primary
MPI sites have similar historical stuff linked.

In a very terse nutshell, PVM was written for the kitchen sink (whatever
you happened to have handy and networked).  MPI was written by a
consortium of vendors and users to provide a common API for large,
expensive massively parallel computers.  As I understand it this wasn't
really the vendors' idea -- they would've been happy to continue
providing only their proprietary interfaces -- but the government
finally put its foot down as it learned just how much money it was
spending, first on the iron, then on porting code to run on the iron,
and then on NEW iron and RE-porting their ported code to run on the NEW
iron, etc.  Moore's law demanding that they rebuy everything every few
years or actually loose ground, of course...

   rgb

> Thanks,
> 
> Jayne Heger
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From Sebastien.Cabaniols at Compaq.com  Tue Apr 16 03:57:18 2002
From: Sebastien.Cabaniols at Compaq.com (Cabaniols, Sebastien)
Date: Tue, 16 Apr 2002 12:57:18 +0200
Subject: decreasing #define HZ in the linux kernel for CPU/memory bound apps ?
Message-ID: <11EB52F86530894F98FFB1E21F9972547EF918@aeoexc01.emea.cpqcorp.net>

Hi beowulfs!

Would it be interesting to decrease the #define HZ in the linux kernel
for CPU/Memory bound computationnal farms  ?
(I just posted the same question to lkml)

I mean we very often have only one running process eating 99% of 
the CPU, but we (in fact I) don't know if we loose time doing context
switches ....

Did anyone experiment on that ?

Thanks in advance


From rgb at phy.duke.edu  Tue Apr 16 08:29:03 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 16 Apr 2002 11:29:03 -0400 (EDT)
Subject: decreasing #define HZ in the linux kernel for CPU/memory bound
 apps ?
In-Reply-To: <11EB52F86530894F98FFB1E21F9972547EF918@aeoexc01.emea.cpqcorp.net>
Message-ID: <Pine.LNX.4.44.0204161052470.8713-100000@lucifer.rgb.private.net>

On Tue, 16 Apr 2002, Cabaniols, Sebastien wrote:

> Hi beowulfs!
> 
> Would it be interesting to decrease the #define HZ in the linux kernel
> for CPU/Memory bound computationnal farms  ?
> (I just posted the same question to lkml)
> 
> I mean we very often have only one running process eating 99% of 
> the CPU, but we (in fact I) don't know if we loose time doing context
> switches ....
> 
> Did anyone experiment on that ?
> 
> Thanks in advance

This was discussed a long time ago on kernel lists.  IIRC (and it was a
LONG time ago -- years -- so don't shoot me if I don't) the consensus
was that Linus was comfortable keeping HZ where it provided very good
interactive response time FIRST (primary design criterion) and efficient
for long running tasks SECOND (secondary design criterion) so no, they
weren't considering retuning anything anytime soon.  Altering HZ isn't
by any means guaranteed to improve task granularity (the scheduler
already does a damn good job there and is hard to improve). Also,
because there are a LOT of things that use it, written by many people
some of whom may well not have used it RIGHT, altering HZ may cause odd
side effects or break things.  I wouldn't recommend it unless you are
willing to live without or work pretty hard to fix whatever breaks.

The context switch part of the question is a bit easier.  By strange
chance, I'm at this moment running a copy of xmlsysd and wulfstat (my
current-project cluster monitoring toolset) on my home cluster, where
(to help Jayne this morning) I also cranked up pvm and the xep
mandelbrot set application.  So it is easy for me to test this.

During a panel update (with all my nodes whonking away on doing
mandelbrot set iterations) the context switch rate is negligible --
12-16/second -- on true nodes (ones doing nothing but computing or
twiddling their metaphorical thumbs).  The rate hardly changes relative
to the idle load when the system is doing a computation -- the scheduler
is quite efficient.  Interrupt rates on true nodes similarly remains
very close to baseline of a bit more than 100/second even when doing the
computations, which are of course quite coarse grained with only a bit
of network traffic per updated strip per node and strip times on the
order of seconds.  So for a coarse grained, CPU intensive task running
on dedicated nodes I doubt you'd see so much as 1% improvement monkeying
with pretty much any "simple" kernel tuning parameter -- I think that
single numerical jobs run at well OVER 99% efficiency as is.

Note that on workstation-nodes (ones running a GUI and this and that)
the story is quite different, although still good.  For example, I'm
running X, xmms (can't work without music, can we:-), the xep GUI,
wulfstat (the monitoring client), galeon, and a dozen other xterms and
small apps on my desktop; my sons are running X and screensavers on
their systems downstairs (grrr, have to talk to them about that, or just
plain disable that:-) and on THESE nodes the context switch rates range
closer to 1300-1800/sec (the latter for those MP3's).  Interrupt rates
are still just over 100/sec -- this tends to vary only when doing some
sort of very intensive I/O.  Note that even mp3 decoding only takes a
few percent of my desktop's CPU.

However, beautieously enough, when I do an xep rubberband update, I
still get SIMULTANEOUSLY flawlessly decoded mp3's (not so much as a
bobble of the music stream) AND the maximum possible amount of CPU
diverted to the mandelbrot strip computations and their display.

I view this delightful responsiveness of linux as a very important
feature.  I've never hesitated to distribute CPU-intensive work around
on linux workstation nodes with an adequate amount of memory because I'm
totally confident that unless the application fills memory or involves a
very latency-bounded (e.g. small packet network) I/O stream, the
workstation user will notice, basically, "nothing" -- their interactive
response will be changed below the 0.1 second threshold where they are
likely to be ABLE to notice.

The one place I can recall where altering system timings has made a
noticeable difference in performance for certain classes of parallel
tasks is Josip Loncaric's tcp retuning, and I believe that he worked
quite hard at that for a long time to get good results.  Even that has a
price -- the tunings that he makes (again, IIRC, don't shoot me if I'm
wrong Josip:-) aren't really appropriate for use on a WAN as some of the
things that slow TCP down are the ones that make it robust and reliable
across the routing perils of the open internet.

   rgb

> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From leunen.d at fsagx.ac.be  Tue Apr 16 08:39:25 2002
From: leunen.d at fsagx.ac.be (David Leunen)
Date: Tue, 16 Apr 2002 17:39:25 +0200
Subject: Cannot find -lpvfs
Message-ID: <3CBC45AD.4060600@fsagx.ac.be>

Hi all,

We've installed Scyld Beowulf 27bz-8 on our cluster. But we cannot make 
the mpich examples to link the .o files. Here is the error we get:

/usr/bin/ld: cannot find -lpvfs

This error is thrown on every try to link an mpi program...
any idea?


Have a good day.

David


From hahn at physics.mcmaster.ca  Tue Apr 16 10:35:54 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Tue, 16 Apr 2002 13:35:54 -0400 (EDT)
Subject: decreasing #define HZ in the linux kernel for CPU/memory bound
 apps ?
In-Reply-To: <11EB52F86530894F98FFB1E21F9972547EF918@aeoexc01.emea.cpqcorp.net>
Message-ID: <Pine.LNX.4.33.0204161327190.8190-100000@coffee.psychology.mcmaster.ca>

> Would it be interesting to decrease the #define HZ in the linux kernel
> for CPU/Memory bound computationnal farms  ?

I'm guessing you're unaware that compute-bound processes
actually get multiple 10ms slices (200ms or so, as I recall,
but I'm remembering a discussion from 2.3.x days.  Ingo's new
scheduler probably preserves this limit.)

> I mean we very often have only one running process eating 99% of 
> the CPU, but we (in fact I) don't know if we loose time doing context
> switches ....

think of the numbers a bit: it's basically impossible to buy
a <1 GHz processor today, so you're getting at O(100M) instrs/HZ.
if you're cache-friendly, you'll probably have >1 instr/cycle,
so scale the number appropriately.  perhaps you're worried about
cache pollution?  the kernel's footprint is fairly small, probably
<4K or so for timer-irq-scheduler-nopreempt.  since a null syscall
is ~1 us or ~1000 instrs, and the work is about the same, I really
don't think there's anything to worry about.

there are people who run HZ=1024 or higher on ia32; I don't personally
think they know what the heck they're doing, but they like it, and 
don't report any serious problems.


From kus at free.net  Tue Apr 16 12:02:59 2002
From: kus at free.net (Mikhail Kuzminsky)
Date: Tue, 16 Apr 2002 23:02:59 +0400 (MSD)
Subject: again OpenPBS vs SGE
Message-ID: <200204161902.XAA04166@nocserv.free.net>

   
    I'm in process of choice of *free* batch queue system
for new Linux cluster(s). We are using GNQS on many SMP
systems and we are happy with it, but GNQS isn't develop
now. 
    Real competition is, IMHO, between OpenPBS and
Codine/SGE (which was very praised early in our maillist, in particular, 
by Chris Black).

Some comparisons are presented by Omar Hassaine from Sun 
(www.sun.com/products-n-solutions/edu/hpc/presentations/june01/
omar_hassaine.pdf). IMHO, some of this estimations are inconsistent
w/some Chris Blake statements. So, I'll try below to formulate
shortly few (looking as important for me) advantages and
disadvantages of OpenPBS and SGE. I'll be very appreciate in
any remarks, opinions etc (especially  were I'm wrong).

I.Some PBS minuses.

1) The main is instable work of deamons
2) PBS don't support user checkpoint migartion.
For example, I run Gaussian98 job (which creates own
checkpoint file) on one node, and there is now subsequent
G98 job which may run on other (free) node, but this other
node don't have the necessary G98 checkpoint file
3) Absence of interface w/Globus Grid - if it's Open PBS
(not PBSpro).

II. Some PBS pluses

- it looks as most popular for Linux clusters 
- it's possible to receive job from one node and send it to run
  on other node of *other cluster*
  
III. Some SGE minuses


1) Do not support "multiclustering"
2) The schedule algorithms are restricted to only one
   default (this is inconsistent w/Chris Black message, as
   I understand)

IV. Some SGE pluses

1) Reliable work
2) Globus Grid is integrated (?? is it correct ?)
3) There is support of job migration


I don't see to absence of SGE source today (I beleive it'll be
available in nearly future).

Thanks for the future help,
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow


From lindahl at keyresearch.com  Tue Apr 16 08:31:46 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Tue, 16 Apr 2002 08:31:46 -0700
Subject: decreasing #define HZ in the linux kernel for CPU/memory bound apps ?
In-Reply-To: <11EB52F86530894F98FFB1E21F9972547EF918@aeoexc01.emea.cpqcorp.net>; from Sebastien.Cabaniols@compaq.com on Tue, Apr 16, 2002 at 12:57:18PM +0200
References: <11EB52F86530894F98FFB1E21F9972547EF918@aeoexc01.emea.cpqcorp.net>
Message-ID: <20020416083146.B2918@wumpus.attbi.com>

On Tue, Apr 16, 2002 at 12:57:18PM +0200, Cabaniols, Sebastien wrote:

> Would it be interesting to decrease the #define HZ in the linux kernel
> for CPU/Memory bound computationnal farms  ?
> (I just posted the same question to lkml)

All pre-compiled user programs would then have the wrong HZ. So
/bin/time wouldn't work anymore.

As for "what HZ would be a good value?", Alpha has always used 1000,
and it isn't a significant performance hit. But x86 started life on
much slower machines, and now we're stuck with 100, unless you want to
rebuild ALL your packages.

I suspect IA64 uses 100 for compatibility reasons. I wonder how the
x86 emulator on AlphaLinux got around this... hm...

greg


From lindahl at keyresearch.com  Tue Apr 16 08:39:27 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Tue, 16 Apr 2002 08:39:27 -0700
Subject: very high bandwidth, low latency manner?
In-Reply-To: <5.1.0.14.0.20020416122156.05491530@62.70.89.10>; from Hakon.Bugge@scali.com on Tue, Apr 16, 2002 at 12:24:37PM +0200
References: <Pine.LNX.4.30.0204121919130.16089-100000@elin.scali.no> <3CBAA021.DB753C6F@markus-fischer.de> <5.1.0.14.0.20020416122156.05491530@62.70.89.10>
Message-ID: <20020416083927.C2918@wumpus.attbi.com>

On Tue, Apr 16, 2002 at 12:24:37PM +0200, H?kon Bugge wrote:

> Also, lack of page coloring will contribute to 
> different execution times, even for a sequential program.

Andrea's kernel patches now have page coloring in them. The code has
lived a tortured life, originally written by the Real World Computing
guys, rewritten by me, rewritten by Jason Popadopoulos at UMd, and
then by Andrea. 3 continents.

> I disagree for two reasons; 
> first, you imply that venture capitalists are naive (and to some extent 
> stupid).

That's what the local Silicon Valley VC tell me about VC. I guess
non-Silicon Valley VC are smarter, then ;-)

> It is to me strange, that no Pallas PMB 
> benchmark results ever has been published for a reasonable sized system 
> based on alternative interconnect technologies. To quote Lord Kelvin: "If 
> you haven't measured it, you don't know what you're talking about".

Maybe that's because other people are measuring their applications,
and not yet another synthetic benchmark? All-to-all isn't interesting
to me. I have plenty of bisection measurements, though, as that's how
I debug Myrinet. Typical variations are around 2%, by the way.

Lord Kelvin engaged in a 10 year flamewar in the Letters of the RAS
against people who thought the Sun was powered by nuclear fusion. He
believed that it was only 10 million years old, and was powered by
gravitational collapse. His mistake was ignoring geological evidence
because he didn't understand it. He probably wrote that quote during
that flamewar. It didn't make him right.

greg


From raysonlogin at yahoo.com  Tue Apr 16 12:39:39 2002
From: raysonlogin at yahoo.com (Rayson Ho)
Date: Tue, 16 Apr 2002 12:39:39 -0700 (PDT)
Subject: again OpenPBS vs SGE
In-Reply-To: <200204161902.XAA04166@nocserv.free.net>
Message-ID: <20020416193939.95169.qmail@web11403.mail.yahoo.com>

If you are looking for _free_ batch systems, you should choose SGE.

--- Mikhail Kuzminsky <kus at free.net> wrote:
> III. Some SGE minuses 
> 1) Do not support "multiclustering"

I believe you can setup multiple "SGE_CELL"s to partition your cluster
(I've never played with that before)

Or you can use Globus, or other 3rd party scheduler on top of SGE.

> 2) The schedule algorithms are restricted to only one
>    default (this is inconsistent w/Chris Black message, as
>    I understand)

You talking about SGE 5.2.x?

Chris Black must be talking about SGE 5.3, which has several advanced
nice scheduler features:

http://www.hardi.se/products/literature/sun_grid_engine.pdf

> 
> IV. Some SGE pluses
> 
> 1) Reliable work

Ron Chen has been talking about the new shadow master on the SGE
mailing list. which he said will improve fault tolerance, but I've
never heard anything yet...

> 2) Globus Grid is integrated (?? is it correct ?)

correct.

> 3) There is support of job migration
> 

Also, you may want to look at job arrays, which is not available in
PBS. (the other batch system which has job arrays is LSF)

You can download the SGE source from:

http://gridengine.sunsource.net

Rayson


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From josip at icase.edu  Tue Apr 16 12:58:40 2002
From: josip at icase.edu (Josip Loncaric)
Date: Tue, 16 Apr 2002 15:58:40 -0400
Subject: decreasing #define HZ in the linux kernel for CPU/memory bound apps 
 ?
References: <11EB52F86530894F98FFB1E21F9972547EF918@aeoexc01.emea.cpqcorp.net> <20020416083146.B2918@wumpus.attbi.com>
Message-ID: <3CBC8270.814F9387@icase.edu>

Greg Lindahl wrote:
> 
> As for "what HZ would be a good value?", Alpha has always used 1000,
> and it isn't a significant performance hit. But x86 started life on
> much slower machines, and now we're stuck with 100, unless you want to
> rebuild ALL your packages.
> 
> I suspect IA64 uses 100 for compatibility reasons. I wonder how the
> x86 emulator on AlphaLinux got around this... hm...

A minor correction: HZ=1024 on Alphas and on ia64 (elsewhere HZ=100).

HZ=1024 helps, e.g. it prevents certain kinds of timer-resolved TCP
stalls in kernel 2.2 on Alphas.  However, recompiling user programs
which were built with HZ=100 would be a pain... and one might uncover
new problems with the i386 hardware which has not been tested much with
HZ=1024.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From ting at fai.fujitsu.com  Tue Apr 16 15:08:59 2002
From: ting at fai.fujitsu.com (Ting)
Date: Tue, 16 Apr 2002 15:08:59 -0700
Subject: Parallel BLAST - help
In-Reply-To: <3450CC8673CFD411A24700105A618BD6267DC4@911TURBO>
Message-ID: <BPEGIDOPLCLDAMMNOCBMOEENOLAA.ting@fai.fujitsu.com>

Hello, All,

  I have three nodes Beowulf cluster MPI environment up and running now.
  And download the FASTA from NCBI on the master node.
  I successful wrote a code to break the data,
  but unfortunately I could not have the runable code to get the
  data back from the nodes to the host(master). :-(

  Can anyone give me some suggestion or web site that I
  can have the runable code to use?  It would help me a lot.

  Thank you very much.

Ting

-----Original Message-----
From: Steve Gaudet
Sent: Monday, April 15, 2002 11:12 AM
To: 'William R. Pearson'; beowulf at beowulf.org
Subject: RE: Parallel BLAST


> -----Original Message-----
> From: William R. Pearson
> Sent: Sunday, April 14, 2002 10:32 PM
> To: beowulf at beowulf.org
> Subject: Parallel BLAST
>
>
>
> > Why is it that BLAST is not available for MPI/PVM?  I would think
> > clusters would be the prefect host for such an application.
> > Is it there is no need because BLAST is already so fast and
> > no one wants to break the database out onto node-resident disks?
> > Or is it that BLAST is kept running on single processor or
> shared memory
> > machines BLAST so that the DB is always in memory ready to
> roll without
> > loading and doing the same for a cluster is not worth it
> > because the same trick is difficult to do on a node given
> the current
> > way clusters are built?  I assume the same is true for FASTA?
>
> I suspect that BLAST is not available for MPI/PVM because (1) it is
> too fast, and (2) there is not much demand for it.
>
> 95% of the time, BLAST is almost an in-memory grep (the other 5% of
> the time it is working on the things it is looking for).  Sequence
> comparison is embarrassingly parallel, and very easily threaded.
> Distributing the sequence databases and collecting results has more
> overhead (there probably aren't many distributed grep programs
> either).  FASTA is 5 - 10X slower than BLAST, and Smith-Waterman is
> another 5-20X slower than FASTA.  Here, the communications overhead is
> low, and distributed systems work OK for FASTA, and great for
> Smith-Waterman (where the overhead fraction is very small).
>
> Of course, it is a lot easier to compile a threaded program, and just
> run it, than it is to install and configure the MPI or PVM environment
> and the programs to run in it.  Bioinformatics software is often run
> by computer savvy biologists, not high-performance computing folks,
> and not having to install and configure PVM/MPI is a big advantage.
> The NCBI probably does not make a PVM/MPI parallel BLAST because there
> is very little demand for it, and it does not meet their computational
> needs.
--------------

There's also a commerical version from Turbogenomics.

http://www.turbogenomics.com

Offering:

1) Ready to go, plug-n-play solution for parallel BLAST
2) Expertise and 20+ years of experience in parallel computing
3) Dynamic database splitting feature to take advantage of computers that
have less memory than the size of the database
4) Smart load balancing - achieve linear to superlinear speedup
5) No modification made to the NCBI BLAST algorithm to ensure identical
results with the non-parallel version
6) Easy drop-in update whenever NCBI releases newer versions of their
algorithm
7) Excellent support
8) 30-days money back guarantee


Cheers,


Steve Gaudet
Linux Solutions Engineer
   .....
  <(???)>

===================================================================
| Turbotek Computer Corp.    tel:603-666-3062 ext. 21             |
| 8025 South Willow St.      fax:603-666-4519                     |
| Building 2, Unit 105       toll free:800-573-5393               |
| Manchester, NH 03103       e-mail:sgaudet at turbotekcomputer.com  |
|                            web: http://www.turbotekcomputer.com |
===================================================================


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From aby_sinha at yahoo.com  Tue Apr 16 18:18:16 2002
From: aby_sinha at yahoo.com (Abhishek sinha)
Date: Tue, 16 Apr 2002 18:18:16 -0700
Subject: Dual Xeon Clusters
Message-ID: <3CBCCD58.1080104@yahoo.com>

Hi list members,

I am building a dual Xeon 4-node cluster; My understanding of 
hyperthreading leaves me to conclusion that it depends largely on the 
code to benefit from it. Otherwise in many cases the performance can 
become worse than before using a hyperthreaded Xeon processors. My 
question is ; Are there any benchmarks available for the benchmarking of 
the Xeon processors in hyperthreaded mode ? Will the normal benchmarks 
that we use ..work on these systems and would it give a fair glance at 
the power of the Xeon .? If not what other way can i find the 
performance of Xeon processors in a clustered env.

I am using 2.2 Ghz Xeon processors on an E7500 chipset.

Thanks in advance to all

Abhishek


From ron_chen_123 at yahoo.com  Tue Apr 16 20:58:31 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Tue, 16 Apr 2002 20:58:31 -0700 (PDT)
Subject: Data management on Beowulf Clusters?
Message-ID: <20020417035831.90871.qmail@web14703.mail.yahoo.com>

Hi,

Is data management a real issue on Beowulf clusters?
Does anyone have problems moving data from one node to
another, or finds rcp not enough?

I recently discovered that the Globus project has
released the Globus ToolKit 2.0, which has some
components for data grids. Here are some of their nice
features that we may be able to take advantage of:

1) do data I/O accounting.

2) we don't depend on a shared filesystem anymore.

3) better security -- GridFTP is integrated with KRB5.

4) better performance in data transfer.

I am wondering if anyone knows if we can take
advantage of GridFTP and other components to solve
data management problems on beowulf clusters?

Any experience is welcome!

And lastly, Globus is an opensource, non-profit
project.

-Ron

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From sp at scali.com  Wed Apr 17 00:46:44 2002
From: sp at scali.com (Steffen Persvold)
Date: Wed, 17 Apr 2002 09:46:44 +0200 (CEST)
Subject: Dual Xeon Clusters
In-Reply-To: <3CBCCD58.1080104@yahoo.com>
Message-ID: <Pine.LNX.4.30.0204170933540.31755-100000@elin.scali.no>

On Tue, 16 Apr 2002, Abhishek sinha wrote:

> Hi list members,
>
> I am building a dual Xeon 4-node cluster; My understanding of
> hyperthreading leaves me to conclusion that it depends largely on the
> code to benefit from it. Otherwise in many cases the performance can
> become worse than before using a hyperthreaded Xeon processors. My
> question is ; Are there any benchmarks available for the benchmarking of
> the Xeon processors in hyperthreaded mode ? Will the normal benchmarks
> that we use ..work on these systems and would it give a fair glance at
> the power of the Xeon .? If not what other way can i find the
> performance of Xeon processors in a clustered env.
>
> I am using 2.2 Ghz Xeon processors on an E7500 chipset.
>
Hi,

First of all you will have to use a 2.4.18 kernel with these E7500
motherboards. Second, if you take a look at the linux-kernel malinglist
you will find a patch (originally developed bu Ingo Molnar, enhanced a
bit by me) that will do some IRQ balancing on Xeon chipsets (with the
stock 2.4.18 kernel i860 and E7500 chipsets are only able to handle
interrupts with CPU0). I don't know if this patch has made it to the
2.4.19-pre kernels yet, but you can check them out too.

Finally, I have some bad news about HT. I haven't been able to get it to
work stable enough with 2.4.18 (haven't tested 2.4.19-pre). The thing is
that in the beginning all works fine, but after a random amount of time
things start to slow down. Suddenly you find yourself having 'top' using
50% system time which is not normal. Turning off HT in the BIOS solves
this.

As a side note I can tell you that the PCI architecture on this chipset is
_much_ better than on i860 and you can expect it to perform well with
high speed interconnects (Myrinet, SCI, GBE).

Regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From kus at free.net  Wed Apr 17 01:16:19 2002
From: kus at free.net (Mikhail Kuzminsky)
Date: Wed, 17 Apr 2002 12:16:19 +0400 (MSD)
Subject: again OpenPBS vs SGE 
Message-ID: <200204170816.MAA08490@nocserv.free.net>

According to Rayson Ho
> From raysonlogin at yahoo.com Tue Apr 16 23:39:42 2002
> Date: Tue, 16 Apr 2002 12:39:39 -0700 (PDT)
> From: Rayson Ho <raysonlogin at yahoo.com>
> Subject: Re: again OpenPBS vs SGE
> To: Mikhail Kuzminsky <kus at free.net>, beowulf at beowulf.org
> ...
> 
> > 2) The schedule algorithms are restricted to only one
> >    default (this is inconsistent w/Chris Black message, as
> >    I understand)
> 
> You talking about SGE 5.2.x?
  Yes, I wrote about 5.2.3.1 which is last "production" version 
currently available.

> Chris Black must be talking about SGE 5.3, which has several advanced
> nice scheduler features:
> 
> http://www.hardi.se/products/literature/sun_grid_engine.pdf
> 

Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow


From shewa at inel.gov  Wed Apr 17 06:45:05 2002
From: shewa at inel.gov (Andrew Shewmaker)
Date: Wed, 17 Apr 2002 07:45:05 -0600
Subject: again OpenPBS vs SGE
References: <20020416193939.95169.qmail@web11403.mail.yahoo.com>
Message-ID: <3CBD7C61.4010108@inel.gov>

Rayson Ho wrote:

>You can download the SGE source from:
>
>http://gridengine.sunsource.net
>
I believe you must join the product (at least as an observer) or else 
you won't see the source
in the download section.

Andrew


From timm at fnal.gov  Wed Apr 17 06:48:05 2002
From: timm at fnal.gov (Steven Timm)
Date: Wed, 17 Apr 2002 08:48:05 -0500 (CDT)
Subject: Dual Xeon Clusters
In-Reply-To: <3CBCCD58.1080104@yahoo.com>
Message-ID: <Pine.LNX.4.31.0204170839210.9849-100000@snowball.fnal.gov>

> Hi list members,
>
> I am building a dual Xeon 4-node cluster; My understanding of
> hyperthreading leaves me to conclusion that it depends largely on the
> code to benefit from it. Otherwise in many cases the performance can
> become worse than before using a hyperthreaded Xeon processors. My
> question is ; Are there any benchmarks available for the benchmarking of
> the Xeon processors in hyperthreaded mode ? Will the normal benchmarks
> that we use ..work on these systems and would it give a fair glance at
> the power of the Xeon .? If not what other way can i find the
> performance of Xeon processors in a clustered env.
>
> I am using 2.2 Ghz Xeon processors on an E7500 chipset.
>
> Thanks in advance to all
>
> Abhishek


The only way we found to do it in hyperthreading mode under Linux was
just to keep on starting two instances of the process until we got one
started on either 0 or 1 and the other on 2 or 3.   It would be
interesting
to see a comparison of the SPEC rate benchmarks between the same
machine with hyperthreading disabled and two processors, which
is what we finally did, and hyperthreading enabled.

Steve Timm

>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From eugen at leitl.org  Wed Apr 17 08:14:39 2002
From: eugen at leitl.org (Eugen Leitl)
Date: Wed, 17 Apr 2002 17:14:39 +0200 (CEST)
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer 
Message-ID: <Pine.LNX.4.33.0204171703580.10977-100000@hydrogen.leitl.org>

http://slashdot.org/articles/02/04/17/1324227.shtml?tid=162

An anonymous reader wrote in to say "Pacific Northwest National Laboratory 
(US DOE) signed a $24.5 million dollar contract with HP for a Linux 
supercomputer. This will be one of the top ten fastest computers in the 
world. Some cool features: 8.3 Trillion Floating Point Operations per 
Second, 1.8 Terabytes of RAM, 170 Terabytes of disk, (including a 53 TB 
SAN), and 1400 Intel McKinley and Madison Processors. Nice quote: 'Today?s 
announcement shows how HP has worked to help accelerate the shift from 
proprietary platforms to open architectures, which provide increased 
scalability, speed and functionality at a lower cost,' said Rich DeMillo, 
vice president and chief technology officer at HP. Read Details of the 
announcement here or here."


From robl at mcs.anl.gov  Wed Apr 17 08:16:05 2002
From: robl at mcs.anl.gov (Robert Latham)
Date: Wed, 17 Apr 2002 10:16:05 -0500
Subject: Cannot find -lpvfs
In-Reply-To: <3CBC45AD.4060600@fsagx.ac.be>
References: <3CBC45AD.4060600@fsagx.ac.be>
Message-ID: <20020417151605.GG5243@mcs.anl.gov>

On Tue, Apr 16, 2002 at 05:39:25PM +0200, David Leunen wrote:
 
> We've installed Scyld Beowulf 27bz-8 on our cluster. But we cannot make 
> the mpich examples to link the .o files. Here is the error we get:
> 
> /usr/bin/ld: cannot find -lpvfs
> 
> This error is thrown on every try to link an mpi program...
> any idea?

-lpvfs ... that's the PVFS library.  check if you have the pvfs-devel
rpm installed.  

If it *is* installed, your mpicc needs to specify where to find it in
LDFLAGS.  The scyld guys are quite good at integrating all the
software pieces though, so i bet this is not the case :>

==rob

-- 
Rob Latham
                                             A215 0178 EA2D B059 8CDF  
                                             B29D F333 664A 4280 315B


From raysonlogin at yahoo.com  Wed Apr 17 08:24:48 2002
From: raysonlogin at yahoo.com (Rayson Ho)
Date: Wed, 17 Apr 2002 08:24:48 -0700 (PDT)
Subject: again OpenPBS vs SGE
In-Reply-To: <3CBD7C61.4010108@inel.gov>
Message-ID: <20020417152448.49905.qmail@web11406.mail.yahoo.com>

I am an observer of the project, but I never need to logon to the
server to download source via cvs.

http://gridengine.sunsource.net/servlets/ProjectSource

But I think you need to logon to download the source code archives.

Rayson


--- Andrew Shewmaker <shewa at inel.gov> wrote:
> Rayson Ho wrote:
> 
> >You can download the SGE source from:
> >
> >http://gridengine.sunsource.net
> >
> I believe you must join the product (at least as an observer) or else
> you won't see the source
> in the download section.
> 
> Andrew
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From hahn at physics.mcmaster.ca  Wed Apr 17 10:14:15 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Wed, 17 Apr 2002 13:14:15 -0400 (EDT)
Subject: Dual Xeon Clusters
In-Reply-To: <Pine.LNX.4.31.0204170839210.9849-100000@snowball.fnal.gov>
Message-ID: <Pine.LNX.4.33.0204171246130.11865-100000@coffee.psychology.mcmaster.ca>

> The only way we found to do it in hyperthreading mode under Linux was
> just to keep on starting two instances of the process until we got one
> started on either 0 or 1 and the other on 2 or 3.   It would be

there are a number of cpu-affinity patches for 2.4 and 2.5
(I doubt anyone has bothered with 2.2)


From leandro at ep.petrobras.com.br  Wed Apr 17 10:17:57 2002
From: leandro at ep.petrobras.com.br (Leandro Tavares Carneiro)
Date: 17 Apr 2002 14:17:57 -0300
Subject: HIGH MEM suport for up to 64GB
Message-ID: <1019063877.1795.13.camel@linux60>

Hi everyone,

	I am writing to ask to you all if anyone have tesed or used an machine
with more than 4GB of RAM or paging in virtual memory on intel machines.
	He have an linux beowulf cluster and one of ours developers have asked
us for how much memory an process can allocate to use. In the tests we
have made, we cannot allocate much more than 3GB, using an dual PIII
with 1GB of ram and 12Gb of swap area for testing.
	We can use 2 process alocating more or less 3Gb, but one process alone
canot pass this test.
	We have using redhat linux 7.2, kernel 2.4.9-21, recompiled with High
Mem suport.
	I have tested the same test aplication on an Itanium machine, with 1GB
of ram and 16Gb of swap area, and they passed. The aplication can
alocate more than 5GB of memory, using swap. In this machine, we are
using turbolinux 7, with kernel version 2.4.4-010508-18smp.

Thanks in advance for the help,

Best regards,
-- 
Leandro Tavares Carneiro
Analista de Suporte
EP-CORP/TIDT/INFI
Telefone: 2534-1427


From leandro at ep.petrobras.com.br  Wed Apr 17 10:30:54 2002
From: leandro at ep.petrobras.com.br (Leandro Tavares Carneiro)
Date: 17 Apr 2002 14:30:54 -0300
Subject: HIGH MEM suport for up to 64GB
Message-ID: <1019064654.1795.18.camel@linux60>

Hi,

	Anyone have tesed or used an machine with more than 4GB of RAM or paging
in virtual memory on intel machines?
	He have an linux beowulf cluster and one of ours developers have asked
us for how much memory an process can allocate to use. In the tests we
have made, we cannot allocate much more than 3GB, using an dual PIII
with 1GB of ram and 12Gb of swap area for testing.
	We can use 2 process alocating more or less 3Gb, but one process alone
canot pass this test.
	We have using redhat linux 7.2, kernel 2.4.9-21, recompiled with High
Mem suport.
	I have tested the same test aplication on an Itanium machine, with 1GB
of ram and 16Gb of swap area, and they passed. The aplication can
alocate more than 5GB of memory, using swap. In this machine, we are
using turbolinux 7, with kernel version 2.4.4-010508-18smp.
	If this works, we can improve our applications.

Thanks in advance for the help, and sorry about my bad english.

Best regards,
-- 
Leandro Tavares Carneiro
Analista de Suporte
EP-CORP/TIDT/INFI
Telefone: 2534-1427


From sp at scali.com  Wed Apr 17 11:44:10 2002
From: sp at scali.com (Steffen Persvold)
Date: Wed, 17 Apr 2002 20:44:10 +0200 (CEST)
Subject: HIGH MEM suport for up to 64GB
In-Reply-To: <1019064654.1795.18.camel@linux60>
Message-ID: <Pine.LNX.4.30.0204172036470.31755-100000@elin.scali.no>

On 17 Apr 2002, Leandro Tavares Carneiro wrote:

> Hi,
>
> 	Anyone have tesed or used an machine with more than 4GB of RAM or paging
> in virtual memory on intel machines?
> 	He have an linux beowulf cluster and one of ours developers have asked
> us for how much memory an process can allocate to use. In the tests we
> have made, we cannot allocate much more than 3GB, using an dual PIII
> with 1GB of ram and 12Gb of swap area for testing.
> 	We can use 2 process alocating more or less 3Gb, but one process alone
> canot pass this test.
> 	We have using redhat linux 7.2, kernel 2.4.9-21, recompiled with High
> Mem suport.
> 	I have tested the same test aplication on an Itanium machine, with 1GB
> of ram and 16Gb of swap area, and they passed. The aplication can
> alocate more than 5GB of memory, using swap. In this machine, we are
> using turbolinux 7, with kernel version 2.4.4-010508-18smp.
> 	If this works, we can improve our applications.
>

There is simply no way you can make a 32 bit machine address more than 4GB
of memory in a single application simply because memory pointers are
only 32 bit (2^32 = 4GB). The reason why you can only address 3GB on Linux
is that normally 1GB of the virtual memory area is reserved for the kernel
(can be trimmed down to 512MB, which gives you 3.5GB accessible from
userspace). Sure, you can have several applications each using 3GB if you
have the memory for it (either swap or real) providing that you use the
64GB option in Linux. Huge memory requirement from applications is one of
the reasons people choose a 64bit platform (ppc, sparc, s390, alpha and
ia64).

Regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
 mailto:sp at scali.com |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS latency


From rbw at ahpcrc.org  Wed Apr 17 12:31:31 2002
From: rbw at ahpcrc.org (Richard Walsh)
Date: Wed, 17 Apr 2002 14:31:31 -0500
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
Message-ID: <200204171931.g3HJVVs22371@mycroft.ahpcrc.org>

Eugene Leitel wrote:

>http://slashdot.org/articles/02/04/17/1324227.shtml?tid=162

>An anonymous reader wrote in to say "Pacific Northwest National Laboratory 
>(US DOE) signed a $24.5 million dollar contract with HP for a Linux 
>supercomputer. This will be one of the top ten fastest computers in the 
>world. Some cool features: 8.3 Trillion Floating Point Operations per 
>Second, 1.8 Terabytes of RAM, 170 Terabytes of disk, (including a 53 TB 
>SAN), and 1400 Intel McKinley and Madison Processors. Nice quote: 'Todays 
>announcement shows how HP has worked to help accelerate the shift from 
>proprietary platforms to open architectures, which provide increased 
>scalability, speed and functionality at a lower cost,' said Rich DeMillo,
>vice president and chief technology officer at HP. Read Details of the
>announcement here or here."

Mmmm ... working through some numbers ...

8.3 TFLOPS (if they are quoting peak) with 1400 processors 
would mean they are getting chips with 1.5 GHz clocks (peak
performance would be 6 GFLOPS per chip [4 ops per clock]). 

Stream numbers for this 1.5 GHz chip (estimated) would be around                
250 MFLOPS for the triad. Using the triad as a baseline for performance
for this and several others systems and relating it back to 
some estimated cost for several other systems (government purchase 
price only, no recurring costs) this is $70 per MFLOPS sustained 
for the Mckinley (again using triad) ... or more than the CRAY SV2 
($65), EV6($55), EV7 ($50), Pentium 4 ($30).

Interesting number ... the high-end IA-64 stuff does not look 
cheap when stream triad defines sustained performance. Of course, 
blocking for cache will push the sustained number up (maybe alot
and on all the systems), but you would think that QCHEM stuff 
they run at PNNL (G98) will be mostly memory bound and therefore 
the stream triad sustained performance is not too far off.

I am not sure this looks like a very good deal. 

rbw


#---------------------------------------------------
# Richard Walsh
# Project Manager, Cluster Computing, Computational
#                  Chemistry and Finance
# netASPx, Inc.
# 1200 Washington Ave. So.
# Minneapolis, MN 55415
# VOX:    612-337-3467
# FAX:    612-337-3400
# EMAIL:  rbw at networkcs.com, richard.walsh at netaspx.com
#
#---------------------------------------------------
# "What you can do, or dream you can, begin it;
#  Boldness has genius, power, and magic in it."
#                                  -Goethe
#---------------------------------------------------
# "Without mystery, there can be no authority."
#                                  -Charles DeGaulle
#---------------------------------------------------
# "Why waste time learning when ignornace is
#  instantaneous?"                 -Thomas Hobbes
#---------------------------------------------------
# "In the chaos of a river thrashing, all that water
#  still has to stand in line."    -Dave Dobbyn
#---------------------------------------------------


From mfischer at mufasa.informatik.uni-mannheim.de  Wed Apr 17 10:33:43 2002
From: mfischer at mufasa.informatik.uni-mannheim.de (Markus Fischer)
Date: Wed, 17 Apr 2002 19:33:43 +0200 (MEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <5.1.0.14.0.20020416122156.05491530@62.70.89.10>
Message-ID: <Pine.GSO.3.95.1020417190940.15244B-100000@mufasa.ra.informatik.uni-mannheim.de>

On Tue, 16 Apr 2002, [iso-8859-1] H?kon Bugge wrote:

>1) Performance.
>
>Performance transparency is always goal. Nevertheless, sometimes an 
>implementation will have a performance bug. The two organizations owning 
>the mentioned systems, have both support agreements with Scali. I have 
>checked the support requests, but cannot find any request where your 
>incidents were reported. We find this fact strange if you truly were aiming 
>at achieving good performance. We are happy to look into your application 
>and report findings back to this news group.

I don't think we have a performance bug. We have developed
a real world application using frequent communication and
have tested/run it on multiple systems.

We do not intend to modify our algorithms to try
to get better performance on a particular system.

If people need help for gaining performance on a 
particular system, then this platform is not a target
again if I can not do the tuning by myself, which we 
did.
Not all codes are PD which makes the point before also
important.

>2) Startup time.
>
>You contribute the bad scalability to high startup time and mapping of 
>memory. This is an interesting hypothesis; and can easily be verified by 

No, I said that with larger numbers of nodes (I would like to talk
about >100 , but here I mean more than 16) the scalability is limited
(amount spent in communication increases significantly and speedup
values decrease after a certain number of nodes) and yes
the startup time also increases, which I thought to be caused
by the SCI mechanisms of exporting/mapping mem).

>using a switch when you start the program, and measure the difference 
>between the elapsed time of the application and the time it uses after 
>MPI_Init() has been called. However, the startup time measured on 64-nodes, 
>two processors per node, where all processes have set up mapping to all 
>other processes, is nn second. If this contributes to bad scalability, your 
>application has a very short runtime.

I certainly think that scalability has nothing to do with startup time.
And I just checked my earlier posting on this.
>
>3) SCI ring structure
>
>You state that on a multi user, multi-process environment, it is hard to 
>get deterministic performance numbers. Indeed, that is true. True sharing 
>of resources implies that. Whether the resource is a file-server, a memory 
>controller, or a network component, you will probably always be subject to 
>performance differences. Also, lack of page coloring will contribute to 

I think that when running on a dedicated partition of a cluster,
I would not like to receive a significant impact from other applications
because their communication increases nor would I like to influence
my advisor's application.

>different execution times, even for a sequential program. You further 
>indicate that performance numbers reported f. ex. by Pallas PMB benchmark 
>only can be used for applying for more VC. I disagree for two reasons; 
>first, you imply that venture capitalists are naive (and to some extent 
>stupid). That is not my impression, merely the opposite. Secondly, such 
>numbers are a good example to verify/deny your hypothesis that the SCI ring 
>structure is volatile to traffic generated by other applications. PMB's 
>*multi* option is architected to investigate exactly the problem you 
>mention; Run f. ex. MPI_Alltoall() on N/2 of the machine. Then measure how 
>performance is affected when the other N/2 of the machine is also running 
>Alltoall(). This is the reason we are interested in comparative performance 
>numbers to SCI based systems. It is to me strange, that no Pallas PMB 
>benchmark results ever has been published for a reasonable sized system 
>based on alternative interconnect technologies. To quote Lord Kelvin: "If 
>you haven't measured it, you don't know what you're talking about".
>
>As a bottom line, I would appreciate that initiatives to compare cluster 
>interconnect performance should be appreciated, rather than be scrutinized 
>and be phrased as "only usable to apply for more VC".
>

what's the goal then of having marketing statements which can
not be applied in general in a .signature ?

there is also PD SCI-MPICH which from reading papers applies for
the same statement.

Markus
>
>H
>At 11:40 AM 4/15/02 +0200, Markus Fischer wrote:
>>Steffen Persvold wrote:
>> >
>> > Now we have price comparisons for the interconnects (SCI,Myrinet and
>> > Quadrics). What about performance ? Does anyone have NAS/PMB numbers for
>> > ~144 node Myrinet/Quadrics clusters (I can provide some numbers from a 132
>> > node Athlon 760MP based SCI cluster, and I guess also a 81 node PIII 
>> ServerWorks
>> > HE-SL based cluster).
>>
>>yes, please.
>>
>>I would like to get/see some numbers.
>>I have run tests with SCI for a non linear diffusion algorithm on a 96 node
>>cluster with 32/33 interface. I thought that the poor
>>scalability was due to the older interface, so I switched to
>>a SCI system with 32 nodes and 64/66 interface.
>>
>>Still, the speedup values were behaving like a dog with more than 8 nodes.
>>
>>Especially, the startup time will reach minutes which is probably due to
>>the exporting and mapping of memory.
>>
>>Yes, the MPI library used was Scampi. Thus, I think the
>>(marketing) numbers you provide
>>below are not relevant except for applying for more VC.
>>
>>Even worse, we noticed, that the SCI ring structure has an impact on the
>>communication pattern/performance of other applications.
>>This means we only got the same execution time if other nodes were
>>I idle or did not have communication intensive applications.
>>How will you determine the performance of the algorithm you just invented
>>in such a case ?
>>
>>We then used a 512 node cluster with Myrinet2000. The algorithm scaled
>>very fine up to 512 nodes.
>>
>>Markus
>>
>> >
>> > Regards,
>> > --
>> >   Steffen Persvold   | Scalable Linux Systems |   Try out the world's best
>> >  mailto:sp at scali.com |  http://www.scali.com  | performing MPI 
>> implementation:
>> > Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.13.8 -
>> > Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >320MBytes/s and <4uS 
>> latency
>> >
>> > _______________________________________________
>> > Beowulf mailing list, Beowulf at beowulf.org
>> > To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit 
>>http://www.beowulf.org/mailman/listinfo/beowulf
>
>--
>H?kon Bugge; VP Product Development; Scali AS;
>mailto:hob at scali.no; http://www.scali.com; fax: +47 22 62 89 51;
>Voice: +47 22 62 89 50; Cellular (Europe+US): +47 924 84 514;
>Visiting Addr: Olaf Helsets vei 6, Bogerud, N-0621 Oslo, Norway;
>Mail Addr:  Scali AS, Postboks 150, Oppsal, N-0619  Oslo, Norway;
>
>


From rocky at atipa.com  Wed Apr 17 13:02:40 2002
From: rocky at atipa.com (Rocky McGaugh)
Date: Wed, 17 Apr 2002 15:02:40 -0500 (CDT)
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
In-Reply-To: <200204171931.g3HJVVs22371@mycroft.ahpcrc.org>
Message-ID: <Pine.LNX.4.33.0204171456170.5279-100000@rocky.lab.atipa.com>

On Wed, 17 Apr 2002, Richard Walsh wrote:

> 
> Eugene Leitel wrote:
> 
> >http://slashdot.org/articles/02/04/17/1324227.shtml?tid=162
> 
> >An anonymous reader wrote in to say "Pacific Northwest National Laboratory 
> >(US DOE) signed a $24.5 million dollar contract with HP for a Linux 
> >supercomputer. This will be one of the top ten fastest computers in the 
> >world. Some cool features: 8.3 Trillion Floating Point Operations per 
> >Second, 1.8 Terabytes of RAM, 170 Terabytes of disk, (including a 53 TB 
> >SAN), and 1400 Intel McKinley and Madison Processors. Nice quote: 'Todays 
> >announcement shows how HP has worked to help accelerate the shift from 
> >proprietary platforms to open architectures, which provide increased 
> >scalability, speed and functionality at a lower cost,' said Rich DeMillo,
> >vice president and chief technology officer at HP. Read Details of the
> >announcement here or here."
> 
> Mmmm ... working through some numbers ...
> 
> 8.3 TFLOPS (if they are quoting peak) with 1400 processors 
> would mean they are getting chips with 1.5 GHz clocks (peak
> performance would be 6 GFLOPS per chip [4 ops per clock]). 
> 

I'll give ya an even 9.0 Tflops (peak, of course) for $3million. Myrinet 
included. Storage not.

Can't deliver till June though, as we're installing a 7.3 at the moment.

-- 
Rocky McGaugh
Atipa Technologies
rocky at atipatechnologies.com
rmcgaugh at atipa.com
1-785-841-9513 x3110
http://1087800222/
perl -e 'print unpack(u, ".=W=W+F%T:7\!A+F-O;0H`");'


From tim.carlson at pnl.gov  Wed Apr 17 14:16:01 2002
From: tim.carlson at pnl.gov (Tim Carlson)
Date: Wed, 17 Apr 2002 14:16:01 -0700 (PDT)
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
In-Reply-To: <200204171931.g3HJVVs22371@mycroft.ahpcrc.org>
Message-ID: <Pine.LNX.4.44.0204171405310.10892-100000@roach.emsl.pnl.gov>

On Wed, 17 Apr 2002, Richard Walsh wrote:

> Stream numbers for this 1.5 GHz chip (estimated) would be around
> 250 MFLOPS for the triad. Using the triad as a baseline for performance
> for this and several others systems and relating it back to
> some estimated cost for several other systems (government purchase
> price only, no recurring costs) this is $70 per MFLOPS sustained
> for the Mckinley (again using triad) ... or more than the CRAY SV2
> ($65), EV6($55), EV7 ($50), Pentium 4 ($30).

I don't think you can calculate the cost at $70 without subtracting out a
few million dollars for various parts. Off the top of my head

1) 53 TB SAN
2) 1.8 TB RAM
3) Quadrics interconnect
4) 117TB local storage

The list just had a discussion about the cost of Quadrics. People were
guessing something like 3K per box?

I haven't seen the actual breakdown of costs, and I'm sure it is under
some NDA anyway.

Having said that, I don't work for the MSCF folks, and I don't want to
speak for them. :)

Tim Carlson
Voice: (509) 376 3423
Email: Tim.Carlson at pnl.gov
EMSL UNIX System Support


From raysonlogin at yahoo.com  Wed Apr 17 15:38:22 2002
From: raysonlogin at yahoo.com (Rayson Ho)
Date: Wed, 17 Apr 2002 15:38:22 -0700 (PDT)
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
In-Reply-To: <Pine.LNX.4.44.0204171405310.10892-100000@roach.emsl.pnl.gov>
Message-ID: <20020417223822.75469.qmail@web11407.mail.yahoo.com>

What OS does the machines run?? If it is IA64 HP-UX, we need to
subtract out a few more million dollars...

Rayson

--- Tim Carlson <tim.carlson at pnl.gov> wrote:
> On Wed, 17 Apr 2002, Richard Walsh wrote:
> 
> > Stream numbers for this 1.5 GHz chip (estimated) would be around
> > 250 MFLOPS for the triad. Using the triad as a baseline for
> performance
> > for this and several others systems and relating it back to
> > some estimated cost for several other systems (government purchase
> > price only, no recurring costs) this is $70 per MFLOPS sustained
> > for the Mckinley (again using triad) ... or more than the CRAY SV2
> > ($65), EV6($55), EV7 ($50), Pentium 4 ($30).
> 
> I don't think you can calculate the cost at $70 without subtracting
> out a
> few million dollars for various parts. Off the top of my head
> 
> 1) 53 TB SAN
> 2) 1.8 TB RAM
> 3) Quadrics interconnect
> 4) 117TB local storage
> 
> The list just had a discussion about the cost of Quadrics. People
> were
> guessing something like 3K per box?
> 
> I haven't seen the actual breakdown of costs, and I'm sure it is
> under
> some NDA anyway.
> 
> Having said that, I don't work for the MSCF folks, and I don't want
> to
> speak for them. :)
> 
> Tim Carlson
> Voice: (509) 376 3423
> Email: Tim.Carlson at pnl.gov
> EMSL UNIX System Support
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From lindahl at conservativecomputer.com  Wed Apr 17 15:51:48 2002
From: lindahl at conservativecomputer.com (Greg Lindahl)
Date: Wed, 17 Apr 2002 15:51:48 -0700
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
In-Reply-To: <200204171931.g3HJVVs22371@mycroft.ahpcrc.org>; from rbw@ahpcrc.org on Wed, Apr 17, 2002 at 02:31:31PM -0500
References: <200204171931.g3HJVVs22371@mycroft.ahpcrc.org>
Message-ID: <20020417155148.A2248@wumpus.skymv.com>

On Wed, Apr 17, 2002 at 02:31:31PM -0500, Richard Walsh wrote:

> 8.3 TFLOPS (if they are quoting peak) with 1400 processors 
> would mean they are getting chips with 1.5 GHz clocks (peak
> performance would be 6 GFLOPS per chip [4 ops per clock]). 

This bid is for an install in the future, and it involves a
combination of McKinley and Madison parts. I don't believe that Intel
has made Madison's specs available, nor has HP made the specs of the
chipset they'll be using available.

It's likely that they aren't quoting peak; PNL prefers figures like
the actual speed of matrix-matrix multiple (DGEMM). Now the Itanium is
reasonably good at delivering a nice % of peak for DGEMM, but it's not
the same as peak. It's a lot more fair number to use than peak, and
gives you a good idea of what the Top500 Linpack score will be.

> this is $70 per MFLOPS sustained 
> for the Mckinley (again using triad) ... or more than the CRAY SV2 
> ($65), EV6($55), EV7 ($50), Pentium 4 ($30).

And Cray SV2 figures aren't publically available either. Cough.

> but you would think that QCHEM stuff 
> they run at PNNL (G98) will be mostly memory bound and therefore 
> the stream triad sustained performance is not too far off.

As you might guess, the bid required that you benchmark PNL's actual
codes at PNL's actual data sizes. I don't believe that your analysis
is correct.

Alas, I can no longer say that FSL was the only time a Linux cluster
won a traditional supercomputing bid.

greg


From shaeffer at neuralscape.com  Wed Apr 17 08:37:03 2002
From: shaeffer at neuralscape.com (Karen Shaeffer)
Date: Wed, 17 Apr 2002 08:37:03 -0700
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
In-Reply-To: <20020417223822.75469.qmail@web11407.mail.yahoo.com>; from raysonlogin@yahoo.com on Wed, Apr 17, 2002 at 03:38:22PM -0700
References: <Pine.LNX.4.44.0204171405310.10892-100000@roach.emsl.pnl.gov> <20020417223822.75469.qmail@web11407.mail.yahoo.com>
Message-ID: <20020417083703.A30407@synapse.neuralscape.com>

On Wed, Apr 17, 2002 at 03:38:22PM -0700, Rayson Ho wrote:
> What OS does the machines run?? If it is IA64 HP-UX, we need to
> subtract out a few more million dollars...
> 
> Rayson
> 

CNET says it will run Linux.

http://news.com.com/2100-1001-884297.html

cheers,
Karen
-- 
 Karen Shaeffer
 Neuralscape; Santa Cruz, Ca. 95060
 shaeffer at neuralscape.com  http://www.neuralscape.com


From tim.carlson at pnl.gov  Wed Apr 17 18:54:11 2002
From: tim.carlson at pnl.gov (Tim Carlson)
Date: Wed, 17 Apr 2002 18:54:11 -0700 (PDT)
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
In-Reply-To: <20020417223822.75469.qmail@web11407.mail.yahoo.com>
Message-ID: <Pine.GSO.4.44.0204171832570.7887-100000@poincare.emsl.pnl.gov>

On Wed, 17 Apr 2002, Rayson Ho wrote:

> What OS does the machines run?? If it is IA64 HP-UX, we need to
> subtract out a few more million dollars...

If it ran HP-UX, I don't think they would have bought it :)

It will run Linux.

And as Greg pointed out, for some of our core applications (NWChem for
one), the machine may be the best bang for the buck. Could you get more
Tflops for less money?  Of course you could. Factor in the fact that you
need fast access to the SAN, sustained saturation of the interconnect,
some horendous amount of memory bandwidth, etc, etc. The RFP had some
pretty specific (and... err... odd) requirements.

Again.. I don't speak for the MSCF, I know nothing of the other bids,
disclaimer, disclaimer, discaimer :)

One thing I will say is that from my office which is a good 50 yards from
the big computer room, I should be able to feel the heat this thing is
going to put out.

Tim Carlson
Voice: (509) 376 3423
Email: Tim.Carlson at pnl.gov
EMSL UNIX System Support


From ron_chen_123 at yahoo.com  Wed Apr 17 20:58:37 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Wed, 17 Apr 2002 20:58:37 -0700 (PDT)
Subject: again OpenPBS vs SGE
In-Reply-To: <200204170816.MAA08490@nocserv.free.net>
Message-ID: <20020418035837.64134.qmail@web14702.mail.yahoo.com>

In fact, SGE 5.3 is the newest production version. See
the "announce" mailing-list for details. But Sun
haven't update the official web site yet -- may be due
to marketing strategy or something?

-Ron

--- Mikhail Kuzminsky <kus at free.net> wrote:
> According to Rayson Ho
> > From raysonlogin at yahoo.com Tue Apr 16 23:39:42
> 2002
> > Date: Tue, 16 Apr 2002 12:39:39 -0700 (PDT)
> > From: Rayson Ho <raysonlogin at yahoo.com>
> > Subject: Re: again OpenPBS vs SGE
> > To: Mikhail Kuzminsky <kus at free.net>,
> beowulf at beowulf.org
> > ...
> > 
> > > 2) The schedule algorithms are restricted to
> only one
> > >    default (this is inconsistent w/Chris Black
> message, as
> > >    I understand)
> > 
> > You talking about SGE 5.2.x?
>   Yes, I wrote about 5.2.3.1 which is last
> "production" version 
> currently available.
> 
> > Chris Black must be talking about SGE 5.3, which
> has several advanced
> > nice scheduler features:
> > 
> >
>
http://www.hardi.se/products/literature/sun_grid_engine.pdf
> > 
> 
> Mikhail Kuzminsky
> Zelinsky Institute of Organic Chemistry
> Moscow
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From joachim at lfbs.RWTH-Aachen.DE  Thu Apr 18 01:56:59 2002
From: joachim at lfbs.RWTH-Aachen.DE (Joachim Worringen)
Date: Thu, 18 Apr 2002 10:56:59 +0200
Subject: very high bandwidth, low latency manner?
References: <Pine.GSO.3.95.1020417190940.15244B-100000@mufasa.ra.informatik.uni-mannheim.de>
Message-ID: <3CBE8A5B.AADEC997@lfbs.rwth-aachen.de>

Markus Fischer wrote:
> I don't think we have a performance bug. We have developed
> a real world application using frequent communication and
> have tested/run it on multiple systems.

I think Hakon was thinking of a performance bug in ScaMPI (the MPI
library), not in your  application.

> No, I said that with larger numbers of nodes (I would like to talk
> about >100 , but here I mean more than 16) the scalability is limited
> (amount spent in communication increases significantly and speedup
> values decrease after a certain number of nodes) and yes
> the startup time also increases, which I thought to be caused
> by the SCI mechanisms of exporting/mapping mem).

If you could give some numbers, it would help very much. And which kind
of communication pattern is used in this application? Which MPI
communication calls, which message sizes?

> there is also PD SCI-MPICH which from reading papers applies for
> the same statement.

I am the author of SCI-MPICH. I do not understand the meaning of this
sentence of yours ("applies for the same statement"). What are you
refering to?

Anyway, I would be happy to test your application with SCI-MPICH on our
cluster. You may just want to sent me an object file linked to dynamic
MPICH libraries, if you can not publish the source code. 

My bottom line is: I do not consider it good style to publically blaim a
product for bad performance without having checked back with the people
behind this product, and being a consultant for another product at the
same time.

 Joachim

-- 
|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339


From alex at compusys.co.uk  Thu Apr 18 01:58:16 2002
From: alex at compusys.co.uk (alex at compusys.co.uk)
Date: Thu, 18 Apr 2002 09:58:16 +0100 (BST)
Subject: Intial Pallas performance with Myrinet on a 860 & E7500
Message-ID: <Pine.LNX.4.21.0204180953011.28436-100000@hpc-support.compusys.co.uk>

For your information, please look at the following performance
measurements for the 'C' class Myrinet2000 cards.

Details of the two machines (optimisation level: -fast & PGI):

- 2.4.17 kernel
- mpich-1.2.1..7b
- gm-1.5.1
- measurement performed between machines

860 Supermicro DCE:
- Dual P4 2 GHz
- C class Myrinet2000

The new E75000 Supermicro DDR : 
- Dual P4 1.8GHz 
- C class Myrinet2000, using PCI-X slot


Notice the results for E75000 Sendrecv (and Exchange):  

4194304 --> 290.29Mbytes/s

That is more than the serverworks LE chipset.


Alex

(shown results are limited due to mailing list size limit)
///////////////////////// E75000 //////////////////////////////////////

#---------------------------------------------------
# Benchmarking PingPong 
# ( #processes = 2 ) 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         8.79         0.00
            1         1000         8.96         0.11
            2         1000         8.94         0.21
            4         1000         8.96         0.43
            8         1000         8.96         0.85
           16         1000         9.03         1.69
           32         1000         9.32         3.27
           64         1000         9.44         6.47
          128         1000        12.14        10.06
      2097152           20      8726.80       229.18
      4194304           10     17300.95       231.20

#---------------------------------------------------
# Benchmarking PingPing 
# ( #processes = 2 ) 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        11.97         0.00
            1         1000        12.64         0.08
            2         1000        12.09         0.16
            4         1000        12.87         0.30
            8         1000        12.33         0.62
           16         1000        12.36         1.23
           32         1000        11.73         2.60
           64         1000        11.90         5.13
          128         1000        14.65         8.33
      2097152           20     13792.25       145.01
      4194304           10     27535.80       145.27

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# ( #processes = 2 ) 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        12.60        12.60        12.60         0.00
            1         1000        12.68        12.69        12.68         0.15
            2         1000        12.76        12.76        12.76         0.30
            4         1000        12.73        12.73        12.73         0.60
            8         1000        12.52        12.53        12.53         1.22
           16         1000        12.59        12.59        12.59         2.42
           32         1000        11.74        11.74        11.74         5.20
           64         1000        11.81        11.81        11.81        10.34
          128         1000        14.41        14.42        14.42        16.93
      2097152           20     13778.64     13778.80     13778.72       290.30
      4194304           10     27558.40     27558.70     27558.55       290.29

#-----------------------------------------------------------------------------
# Benchmarking Exchange 
# ( #processes = 2 ) 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        20.93        20.93        20.93         0.00
            1         1000        20.96        20.97        20.97         0.18
            2         1000        20.96        20.97        20.96         0.36
            4         1000        20.96        20.97        20.97         0.73
            8         1000        21.03        21.05        21.04         1.45
           16         1000        21.01        21.02        21.02         2.90
           32         1000        21.30        21.30        21.30         5.73
           64         1000        21.33        21.34        21.33        11.44
          128         1000        23.98        23.98        23.98        20.36
      2097152           20     27563.40     27563.60     27563.50       290.24
      4194304           10     55052.30     55053.10     55052.70       290.63

#----------------------------------------------------------------
# Benchmarking Allreduce 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.15         0.15         0.15
            4         1000        19.22        19.23        19.23
            8         1000        19.25        19.26        19.25
           16         1000        19.31        19.33        19.32
           32         1000        20.02        20.03        20.03
           64         1000        20.39        20.40        20.40
          128         1000        25.96        25.97        25.96
      2097152           20     30421.15     30422.15     30421.65
      4194304           10     67887.70     67889.70     67888.70

#----------------------------------------------------------------
# Benchmarking Reduce 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.08         0.08         0.08
            4         1000        10.01        10.02        10.01
            8         1000        10.01        10.02        10.01
           16         1000        10.06        10.07        10.07
           32         1000        10.40        10.41        10.40
           64         1000        10.62        10.63        10.63
          128         1000        14.25        14.26        14.25
      2097152           20     25153.70     25231.65     25192.67
      4194304           10     49852.60     50856.40     50354.50

#----------------------------------------------------------------
# Benchmarking Reduce_scatter 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.53         0.55         0.54
            4         1000        21.00        21.00        21.00
            8         1000        21.39        21.40        21.39
           16         1000        21.29        21.30        21.29
           32         1000        21.70        21.71        21.70
           64         1000        22.37        22.38        22.37
          128         1000        25.39        25.40        25.40
      2097152           20     41247.20     41436.80     41342.00
      4194304           10     70550.10     70943.20     70746.65

#----------------------------------------------------------------
# Benchmarking Allgather 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000        12.16        12.16        12.16
            1         1000        12.61        12.61        12.61
            2         1000        12.52        12.52        12.52
            4         1000        12.41        12.41        12.41
            8         1000        12.69        12.69        12.69
           16         1000        12.71        12.71        12.71
           32         1000        13.26        13.27        13.26
           64         1000        13.06        13.06        13.06
          128         1000        17.72        17.73        17.72
      2097152           20     23349.75     23350.30     23350.03
      4194304           10     38150.00     38151.60     38150.80

#----------------------------------------------------------------
# Benchmarking Allgatherv 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000        12.48        12.48        12.48
            1         1000        12.41        12.42        12.42
            2         1000        12.16        12.17        12.16
            4         1000        12.21        12.21        12.21
            8         1000        12.52        12.52        12.52
           16         1000        12.35        12.35        12.35
           32         1000        13.09        13.09        13.09
           64         1000        12.80        12.80        12.80
          128         1000        17.19        17.19        17.19
      2097152           20     19057.95     19058.55     19058.25
      4194304           10     37964.00     37965.39     37964.70

#----------------------------------------------------------------
# Benchmarking Alltoall 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000        12.44        12.44        12.44
            1         1000        12.72        12.72        12.72
            2         1000        12.50        12.50        12.50
            4         1000        12.56        12.56        12.56
            8         1000        12.56        12.56        12.56
           16         1000        12.75        12.75        12.75
           32         1000        12.86        12.86        12.86
           64         1000        13.73        13.73        13.73
          128         1000        17.87        17.87        17.87
      2097152           20     19927.10     19927.60     19927.35
      4194304           10     39608.10     39609.79     39608.94

#----------------------------------------------------------------
# Benchmarking Bcast 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.09         0.09         0.09
            1         1000         9.09         9.09         9.09
            2         1000         9.07         9.08         9.07
            4         1000         9.06         9.07         9.07
            8         1000         9.09         9.10         9.09
           16         1000         9.15         9.16         9.16
           32         1000         9.44         9.44         9.44
           64         1000         9.56         9.57         9.56
          128         1000        11.63        11.64        11.63
      2097152           20      8740.05      8740.20      8740.12
      4194304           10     17313.69     17313.90     17313.80

#---------------------------------------------------
# Benchmarking Barrier 
# ( #processes = 2 ) 
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        12.45        12.45        12.45


/////////////////////////// 860 //////////////////////////////////


#---------------------------------------------------
# Benchmarking PingPong 
# ( #processes = 2 ) 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         8.93         0.00
            1         1000         9.14         0.10
            2         1000         9.17         0.21
            4         1000         9.14         0.42
            8         1000         9.41         0.81
           16         1000         9.54         1.60
           32         1000         9.85         3.10
           64         1000        10.06         6.06
          128         1000        12.77         9.56
      2097152           20     11924.45       167.72
      4194304           10     23752.65       168.40

#---------------------------------------------------
# Benchmarking PingPing 
# ( #processes = 2 ) 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        11.94         0.00
            1         1000        12.47         0.08
            2         1000        12.83         0.15
            4         1000        13.02         0.29
            8         1000        12.41         0.61
           16         1000        12.82         1.19
           32         1000        12.07         2.53
           64         1000        12.27         4.98
          128         1000        14.50         8.42
      2097152           20     21075.20        94.90
      4194304           10     42104.29        95.00

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# ( #processes = 2 ) 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        12.20        12.21        12.20         0.00
            1         1000        12.83        12.84        12.83         0.15
            2         1000        12.85        12.85        12.85         0.30
            4         1000        12.61        12.62        12.62         0.60
            8         1000        12.63        12.63        12.63         1.21
           16         1000        12.55        12.55        12.55         2.43
           32         1000        12.40        12.40        12.40         4.92
           64         1000        12.70        12.70        12.70         9.61
          128         1000        14.54        14.55        14.55        16.78
      2097152           20     21075.50     21075.85     21075.67       189.79
      4194304           10     42110.90     42112.10     42111.50       189.97

#-----------------------------------------------------------------------------
# Benchmarking Exchange 
# ( #processes = 2 ) 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        21.23        21.23        21.23         0.00
            1         1000        21.56        21.63        21.59         0.18
            2         1000        21.56        21.57        21.56         0.35
            4         1000        21.49        21.49        21.49         0.71
            8         1000        21.63        21.63        21.63         1.41
           16         1000        21.68        21.68        21.68         2.81
           32         1000        21.87        21.88        21.88         5.58
           64         1000        22.16        22.16        22.16        11.02
          128         1000        24.66        24.66        24.66        19.80
      2097152           20     42154.20     42155.05     42154.62       189.78
      4194304           10     84224.61     84225.20     84224.90       189.97

#----------------------------------------------------------------
# Benchmarking Allreduce 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.15         0.16         0.16
            4         1000        19.93        19.94        19.94
            8         1000        20.31        20.33        20.32
           16         1000        20.77        20.78        20.78
           32         1000        21.54        21.55        21.55
           64         1000        21.96        21.97        21.96
          128         1000        26.05        26.06        26.05
      2097152           20     36295.15     36300.15     36297.65
      4194304           10     72057.59     72060.60     72059.09

#----------------------------------------------------------------
# Benchmarking Reduce 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.09         0.09         0.09
            4         1000        10.48        10.49        10.49
            8         1000        10.92        10.93        10.92
           16         1000        11.03        11.04        11.03
           32         1000        11.40        11.41        11.40
           64         1000        11.65        11.66        11.65
          128         1000        13.85        13.86        13.86
      2097152           20     24145.65     24442.65     24294.15
      4194304           10     47357.39     48542.51     47949.95

#----------------------------------------------------------------
# Benchmarking Reduce_scatter 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.65         0.66         0.66
            4         1000        22.04        22.05        22.05
            8         1000        22.72        22.73        22.73
           16         1000        23.31        23.32        23.31
           32         1000        23.89        23.90        23.90
           64         1000        24.45        24.46        24.45
          128         1000        26.94        26.95        26.94
      2097152           20     33828.60     33844.10     33836.35
      4194304           10     67314.90     67377.90     67346.40

#----------------------------------------------------------------
# Benchmarking Allgather 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000        12.16        12.16        12.16
            1         1000        12.32        12.32        12.32
            2         1000        12.33        12.33        12.33
            4         1000        12.36        12.36        12.36
            8         1000        12.61        12.61        12.61
           16         1000        12.71        12.71        12.71
           32         1000        13.04        13.04        13.04
           64         1000        13.86        13.86        13.86
          128         1000        17.59        17.59        17.59
      2097152           20     26607.50     26608.05     26607.78
      4194304           10     53338.10     53338.91     53338.50

#----------------------------------------------------------------
# Benchmarking Allgatherv 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000        12.18        12.18        12.18
            1         1000        12.35        12.35        12.35
            2         1000        12.30        12.30        12.30
            4         1000        12.32        12.32        12.32
            8         1000        12.52        12.53        12.53
           16         1000        12.91        12.91        12.91
           32         1000        13.11        13.11        13.11
           64         1000        13.73        13.73        13.73
          128         1000        17.58        17.58        17.58
      2097152           20     26836.70     26838.00     26837.35
      4194304           10     53090.61     53091.80     53091.20

#----------------------------------------------------------------
# Benchmarking Alltoall 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000        12.86        12.86        12.86
            1         1000        12.85        12.85        12.85
            2         1000        13.19        13.19        13.19
            4         1000        13.01        13.01        13.01
            8         1000        13.23        13.23        13.23
           16         1000        13.43        13.44        13.44
           32         1000        13.78        13.78        13.78
           64         1000        14.41        14.41        14.41
          128         1000        18.18        18.18        18.18
      2097152           20     27169.85     27170.25     27170.05
      4194304           10     54303.90     54304.40     54304.15

#----------------------------------------------------------------
# Benchmarking Bcast 
# ( #processes = 2 ) 
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.07         0.10         0.08
            1         1000         9.29         9.30         9.29
            2         1000         9.23         9.24         9.24
            4         1000         9.25         9.26         9.26
            8         1000         9.54         9.55         9.55
           16         1000         9.69         9.69         9.69
           32         1000         9.96         9.98         9.97
           64         1000        10.14        10.15        10.15
          128         1000        12.21        12.22        12.21
      2097152           20     11937.15     11937.50     11937.32
      4194304           10     23764.00     23764.71     23764.35

#---------------------------------------------------
# Benchmarking Barrier 
# ( #processes = 2 ) 
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        12.90        12.90        12.90


From alex at compusys.co.uk  Thu Apr 18 02:58:03 2002
From: alex at compusys.co.uk (alex at compusys.co.uk)
Date: Thu, 18 Apr 2002 10:58:03 +0100 (BST)
Subject: Intial Pallas performance with Myrinet on a 860 & E7500
Message-ID: <Pine.LNX.4.21.0204181056110.28436-100000@hpc-support.compusys.co.uk>

For your information, please look at the following performance
measurements for the 'C' class Myrinet2000 cards.

Details of the two machines (optimisation level: -fast & PGI):

- 2.4.17 kernel
- mpich-1.2.1..7b
- gm-1.5.1
- measurement performed between machines

860 Supermicro DCE:
- Dual P4 2 GHz
- C class Myrinet2000

The new E75000 Supermicro DDR : 
- Dual P4 1.8GHz 
- C class Myrinet2000, using PCI-X slot


Notice the results for E75000 Sendrecv:

4194304 --> 290.29Mbytes/s

That is more than the serverworks LE chipset.

Alex

(shown results are limited due to mailing list size limit)
///////////////////////// E75000 //////////////////////////////////////

#---------------------------------------------------
# Benchmarking PingPong 
# ( #processes = 2 ) 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         8.79         0.00
            1         1000         8.96         0.11
            2         1000         8.94         0.21
            4         1000         8.96         0.43
            8         1000         8.96         0.85
           16         1000         9.03         1.69
           32         1000         9.32         3.27
           64         1000         9.44         6.47
          128         1000        12.14        10.06
      2097152           20      8726.80       229.18
      4194304           10     17300.95       231.20

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# ( #processes = 2 ) 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        12.60        12.60        12.60         0.00
            1         1000        12.68        12.69        12.68         0.15
            2         1000        12.76        12.76        12.76         0.30
            4         1000        12.73        12.73        12.73         0.60
            8         1000        12.52        12.53        12.53         1.22
           16         1000        12.59        12.59        12.59         2.42
           32         1000        11.74        11.74        11.74         5.20
           64         1000        11.81        11.81        11.81        10.34
          128         1000        14.41        14.42        14.42        16.93
      2097152           20     13778.64     13778.80     13778.72       290.30
      4194304           10     27558.40     27558.70     27558.55       290.29


/////////////////////////// 860 //////////////////////////////////


#---------------------------------------------------
# Benchmarking PingPong 
# ( #processes = 2 ) 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         8.93         0.00
            1         1000         9.14         0.10
            2         1000         9.17         0.21
            4         1000         9.14         0.42
            8         1000         9.41         0.81
           16         1000         9.54         1.60
           32         1000         9.85         3.10
           64         1000        10.06         6.06
          128         1000        12.77         9.56
      2097152           20     11924.45       167.72
      4194304           10     23752.65       168.40

#---------------------------------------------------
# Benchmarking PingPing 
# ( #processes = 2 ) 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000        11.94         0.00
            1         1000        12.47         0.08
            2         1000        12.83         0.15
            4         1000        13.02         0.29
            8         1000        12.41         0.61
           16         1000        12.82         1.19
           32         1000        12.07         2.53
           64         1000        12.27         4.98
          128         1000        14.50         8.42
      2097152           20     21075.20        94.90
      4194304           10     42104.29        95.00

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv 
# ( #processes = 2 ) 
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000        12.20        12.21        12.20         0.00
            1         1000        12.83        12.84        12.83         0.15
            2         1000        12.85        12.85        12.85         0.30
            4         1000        12.61        12.62        12.62         0.60
            8         1000        12.63        12.63        12.63         1.21
           16         1000        12.55        12.55        12.55         2.43
           32         1000        12.40        12.40        12.40         4.92
           64         1000        12.70        12.70        12.70         9.61
          128         1000        14.54        14.55        14.55        16.78
      2097152           20     21075.50     21075.85     21075.67       189.79
      4194304           10     42110.90     42112.10     42111.50       189.97


From joachim at lfbs.RWTH-Aachen.DE  Thu Apr 18 03:36:29 2002
From: joachim at lfbs.RWTH-Aachen.DE (Joachim Worringen)
Date: Thu, 18 Apr 2002 12:36:29 +0200
Subject: Intial Pallas performance with Myrinet on a 860 & E7500
References: <Pine.LNX.4.21.0204180953011.28436-100000@hpc-support.compusys.co.uk>
Message-ID: <3CBEA1AD.46C2A773@lfbs.rwth-aachen.de>

Thanks - is it possible to post related numbers for 
- intra-node communication (shared memory)
- mixed inter- and intra-node communication
- maybe more than 2 nodes (if available)

Is there a reason for the "initial" in your subject - do you expect
these numbers to change?

 Joachim

-- 
|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339


From joachim at lfbs.RWTH-Aachen.DE  Thu Apr 18 05:29:34 2002
From: joachim at lfbs.RWTH-Aachen.DE (Joachim Worringen)
Date: Thu, 18 Apr 2002 14:29:34 +0200
Subject: very high bandwidth, low latency manner?
References: <Pine.GSO.3.95.1020418134623.27577A-100000@mufasa.ra.informatik.uni-mannheim.de>
Message-ID: <3CBEBC2E.B2F0BEDE@lfbs.rwth-aachen.de>

Markus Fischer wrote:
> 
> On Thu, 18 Apr 2002, Joachim Worringen wrote:
> 
> >Markus Fischer wrote:
> >If you could give some numbers, it would help very much. And which kind
> >of communication pattern is used in this application? Which MPI
> >communication calls, which message sizes?
> >
> >I am the author of SCI-MPICH. I do not understand the meaning of this
> >sentence of yours ("applies for the same statement"). What are you
> >refering to?
> >
> to have "the best performing"

Please indicate where you read this statement for SCI-MPICH. If at all,
it says "best performing of the evaluated implementations" in a certain
context. I do not (contrary to Scali) say that SCI-MPICH is the solution
to all your problems. Don't quote me wrong, please.

> >MPICH libraries, if you can not publish the source code.
> >
> >My bottom line is: I do not consider it good style to publically blaim a
> >product for bad performance without having checked back with the people
> >behind this product, and being a consultant for another product at the
> >same time.
> 
> the first message was a follow up to other messages as a real
> world application performance issue. As I stated
> earlier, I did not focus on bringing this application to the max
> on every system, but to use an existing system and see how it goes.

This is perfectly understood. But if you experience strange results
which do not relate to what you would expect, wouldn't it be a good idea
to ask (in this case the Scali guys): "Hey, what is the reason for this
ugly numbers?" before publically announing "This technology is not able
to scale beyond 8 nodes for a simple application!". You won't deny that
there are numerous counter-examples.

> I also can read between the lines of some postings, too.

That doesn't help the reader of your postings.

BTW, it's enough to post to the mailing list, CC is not required.

 regards, Joachim

-- 
|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339


From shin at guss.org.uk  Thu Apr 18 06:27:01 2002
From: shin at guss.org.uk (shin at guss.org.uk)
Date: Thu, 18 Apr 2002 14:27:01 +0100
Subject: Replacing Quad proc SMP multi node DEC Alpha Cluster with Linux Dual P4 cluster?
Message-ID: <20020418142700.A680@gre.ac.uk>

Hi,

We have an old cluster setup that has 3 Alpha 4100 nodes (each node
has 4x466 processors) connected with memory channel (first version),
1Gb Ram per node. The cluster is used to run internal code which is
mostly CFD (fine grain synchronous) problems. The code is parallized
and currently uses dec's mpi implementation.

We now need to replicate this system at a remote site, and with an
eye on keeping the cost down, so the idea is to go with a bunch of
dual processor P4 (2GHz xeon?) systems with 2Gb ram each and myrinet
interconnect. 

We expect to want to scale up to at least 8 of these dual nodes
initially.

I need to look into the performance of various aspects of the
proposed system as we have no experience in this type of setup.

Disclaimer: I dont' necessarily know what I'm talking about - I'm
the hardware/admin guy; the parallel guys do all the coding! Sorry.

I'd appreciate any answers anyone could offer on:

1. In terms of the floating point performance, looking at CFP2000 on
www.spec.org and the Xeon should offer much better FP performace
that the older alphas we have. I could only find results for a 4100
5/533 (which is the closest to our current setup) and these were
much lower than the results from Dell Precision Workstation 530 with
2.0Ghz proc.

So I assume this won't be an issue - we'll get fast processors. Is
there a mboard that really sticks out here for offering best support
to these processors - or should we even be looking at AMD MP systems
now. I'm not sure I have the timescale to get in test systems and
test anything out.

2. Quad systems seem to be way more expensive than duals and I could
only find quad systems running at 900Mhz per proc instead of 2GHz in
the duals - so I assume the quads are out on cost and proc. speed
alone.

3. One of my concerns was the use of mpi across 8xdual Xeon nodes
versus 3xquad alpha nodes. I'm assuming that mpi(ch) will look after
all the necessary for us in terms of communication between
processors within a node and communication across nodes - but is the
speed of memory, throughput etc a limiting factor on this type of PC
architecture? Will we hit latency issues within a node that we're
not currently hitting?

What sort of memory is recommended? DDR/SDRAM/other?

However having ruled out the quads above - will they offer better
memory performance than the duals - on a par with the quad alpha
nodes? (I appreciate it's not a like for like comparison).

3. I think an entry level myrinet switch will enable me to connect 8
nodes - at a cost of approx 2400 USD for a switch and 1700 USD per
myrinet card per node? And it will offer better performance than our
MC - so I'm assuming that the choice of myrinet is ok. 

4. In terms of cache - we believe that the large cache on the
alpha's helps our performance quite significantly - as far as I can
determine the cache on the xeons is still 256/512K? Presumably this
won't make that much of a difference as we're scaling out across 8
nodes instead of 3?

Many thanks in advance,
Rgds
Shin


From rbw at ahpcrc.org  Thu Apr 18 07:37:07 2002
From: rbw at ahpcrc.org (Richard Walsh)
Date: Thu, 18 Apr 2002 09:37:07 -0500
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
Message-ID: <200204181437.g3IEb7w27647@mycroft.ahpcrc.org>

Greg Lindahl wrote:

>This bid is for an install in the future, and it involves a
>combination of McKinley and Madison parts. I don't believe that Intel
>has made Madison's specs available, nor has HP made the specs of the 
>chipset they'll be using available. 

 True, but the dual-floating point units in the core are not
 likely to be added to ... so its a question of what the clock
 is going to be (is my estimate of 1.5 GHz unreasonable?), what
 the impact of the chipset/system-bus on memory bandwidth is 
 going to, and cache sizes.

>
>It's likely that they aren't quoting peak; PNL prefers figures like
>the actual speed of matrix-matrix multiple (DGEMM). Now the Itanium is
>reasonably good at delivering a nice % of peak for DGEMM, but it's not
>the same as peak. It's a lot more fair number to use than peak, and
>gives you a good idea of what the Top500 Linpack score will be.

 True, the source is not official, I guess, but when no qualifying
 information is given the numbers presented are usually peak. If
 they aren't then my numbers would need to be reworked on a different
 envelop ;-) ... 
 
 ... but there is another issue if we assume that the 8.3 TFLOPS is DGEMM 
 performance at say 50% of peak (doing this on a large matrix (G98) 
 would require very good bandwidth to memory) then these 1400 processors 
 must have a system peak of around 17 TFLOPS. What does this mean for
 clock period ... ??

 Assuming the same number (4) of FMA's per core on the Madison, then each
 processor is capable of 12 GFLOPS peak.  This would mean that the processors
 would have to be running at 3 GHz (when are they taking delivery).  This 
 seems a bit high to me seeing as the Itanium is sitting at 800 MHz and 
 does not have a 20 stage pipeline like the Pentium 4 ... but if the deliver
 is far enough into the future who knows.  The idea that the Madison will 
 have more floating point cores seems unlikely (how you going to feed them 
 without real vector memory loads?).

 My $$ per MFLOPS estimates are ballpark numbers, but did include 
 the cost of interconnect (Myrinet or better), and a large chunk 
 of disk and memory. But I won't claim they are perfectly apples-to
 -apples. They were estimated based on estimated purchase price 
 only ... they do not include total cost of ownership effects or 
 factor in expected utilization over the term of ownership (an
 import consideration).

 When I saw the posting, I was surprise how few IA-64 processors 
 (even with the extras) were to be had for ~$25,000,000. The Pentium
 4 looks a better deal at this level of analysis.

 Cheers,

 rbw


From pblaise at cea.fr  Thu Apr 18 08:21:18 2002
From: pblaise at cea.fr (Philippe Blaise - GRENOBLE)
Date: Thu, 18 Apr 2002 17:21:18 +0200
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
References: <Pine.LNX.4.33.0204171703580.10977-100000@hydrogen.leitl.org>
Message-ID: <3CBEE46E.D0B8158D@cea.fr>

Oh la la, belle machine !

any idea about QSNet2/elan4 quick specs ?
and what file system(s) will be used to achieve a 200 MB/s bw in parallel,
a super HP/Linux new one ?

  Philippe Blaise

Eugen Leitl wrote:

> http://slashdot.org/articles/02/04/17/1324227.shtml?tid=162
>
> An anonymous reader wrote in to say "Pacific Northwest National Laboratory
> (US DOE) signed a $24.5 million dollar contract with HP for a Linux
> supercomputer. This will be one of the top ten fastest computers in the
> world. Some cool features: 8.3 Trillion Floating Point Operations per
> Second, 1.8 Terabytes of RAM, 170 Terabytes of disk, (including a 53 TB
> SAN), and 1400 Intel McKinley and Madison Processors. Nice quote: 'Today?s
> announcement shows how HP has worked to help accelerate the shift from
> proprietary platforms to open architectures, which provide increased
> scalability, speed and functionality at a lower cost,' said Rich DeMillo,
> vice president and chief technology officer at HP. Read Details of the
> announcement here or here."
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From mfischer at mufasa.informatik.uni-mannheim.de  Thu Apr 18 04:54:51 2002
From: mfischer at mufasa.informatik.uni-mannheim.de (Markus Fischer)
Date: Thu, 18 Apr 2002 13:54:51 +0200 (MEST)
Subject: very high bandwidth, low latency manner?
In-Reply-To: <3CBE8A5B.AADEC997@lfbs.rwth-aachen.de>
Message-ID: <Pine.GSO.3.95.1020418134623.27577A-100000@mufasa.ra.informatik.uni-mannheim.de>

On Thu, 18 Apr 2002, Joachim Worringen wrote:

>Markus Fischer wrote:
>If you could give some numbers, it would help very much. And which kind
>of communication pattern is used in this application? Which MPI
>communication calls, which message sizes?
>
>I am the author of SCI-MPICH. I do not understand the meaning of this
>sentence of yours ("applies for the same statement"). What are you
>refering to?
>
to have "the best performing" 

>MPICH libraries, if you can not publish the source code. 
>
>My bottom line is: I do not consider it good style to publically blaim a
>product for bad performance without having checked back with the people
>behind this product, and being a consultant for another product at the
>same time.

the first message was a follow up to other messages as a real
world application performance issue. As I stated 
earlier, I did not focus on bringing this application to the max
on every system, but to use an existing system and see how it goes.

I know this general process of 'yes this is the current status but
we have a bunch of fixes which will help you' very well.

I also can read between the lines of some postings, too.

Markus

> Joachim
>
>-- 
>|  _  RWTH|  Joachim Worringen
>|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
>  | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
>    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From raysonlogin at yahoo.com  Thu Apr 18 10:54:06 2002
From: raysonlogin at yahoo.com (Rayson Ho)
Date: Thu, 18 Apr 2002 10:54:06 -0700 (PDT)
Subject: Replacing Quad proc SMP multi node DEC Alpha Cluster with Linux Dual P4 cluster?
In-Reply-To: <20020418142700.A680@gre.ac.uk>
Message-ID: <20020418175406.91343.qmail@web11408.mail.yahoo.com>

--- shin at guss.org.uk wrote:
> 1. In terms of the floating point performance, looking at CFP2000 on
> www.spec.org and the Xeon should offer much better FP performace
> that the older alphas we have. I could only find results for a 4100
> 5/533 (which is the closest to our current setup) and these were
> much lower than the results from Dell Precision Workstation 530 with
> 2.0Ghz proc.

Since you are using the processors in SMP configurations, you should be
looking at SPECfp_rate2000. 

SPECfp2000 tells you how fast you code runs on a single CPU. But with
SPECfp_rate2000, you can see how well a processor scales in SMP
configuration.

One thing I found last yr was that when the processors share the memory
bandwidth on an SMP machine, the performance is really bad. My
configuration was daul-P3s with Myirnet. I measured the performance of
the cluster running MPI programs using 8 machines with 1 process on
each machine, and 2 processes on 4 machines. To my suprise, 1 process
on 8 machines had a better performance.

My prof. then gave us his results on an IBM server machine (not PC
server), his results were the opposite. His conclusion was that the
memory bandwidth of PCs does not scale with the number of processors.

(BTW, it was an assignment -- I wasn't the only one who found similar
results -- there were around 20 people in that class)

> 2. Quad systems seem to be way more expensive than duals and I could
> only find quad systems running at 900Mhz per proc instead of 2GHz in
> the duals - so I assume the quads are out on cost and proc. speed
> alone.

I believe the performance of qurd systems will not give you double of
the duals, even if you use the 2Ghz CPUs. The PC (or should I say
Intel??) architecture has the shared memory bus, which does not scale
with the #CPUs.

BTW, is AMD MP better?? I've heard that each Althon MP CPU talks to its
own system cpuset.

> 
> 3. One of my concerns was the use of mpi across 8xdual Xeon nodes
> versus 3xquad alpha nodes. I'm assuming that mpi(ch) will look after
> all the necessary for us in terms of communication between
> processors within a node and communication across nodes - but is the
> speed of memory, throughput etc a limiting factor on this type of PC
> architecture? Will we hit latency issues within a node that we're
> not currently hitting?

See above.

You can actually use OpenMP within the nodes and MPI between the nodes.
However, MPICH and LAM MPI are not thread safe...

Rayson


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/


From hahn at physics.mcmaster.ca  Thu Apr 18 11:07:55 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Thu, 18 Apr 2002 14:07:55 -0400 (EDT)
Subject: Dual Xeon Clusters
In-Reply-To: <3CBCCD58.1080104@yahoo.com>
Message-ID: <Pine.LNX.4.33.0204181353590.15475-100000@coffee.psychology.mcmaster.ca>

> I am building a dual Xeon 4-node cluster; My understanding of 
> hyperthreading leaves me to conclusion that it depends largely on the 
> code to benefit from it. 

I think that's an ambiguous way to put it.  yes, the benefit depends
very much on what your code does.  but no, you code doesn't have to 
do anything special: HT simply turns one cpu into a pair of "virtual"
cpus which compete for the same fixed set of resources.  

so you don't necessarily have to use threads to see HT benefit,
since the virtualization is not visible at the programmer level.
but you will certainly not see any HT benefit if your program 
somehow manages to keep every functional unit (including cache and 
ram) busy all the time.

> Otherwise in many cases the performance can 
> become worse than before using a hyperthreaded Xeon processors. My 

well, first, you don't have to turn on HT at all, so a prestonia
can be expected to behave like a northwood.  but I would expect most 
system resources to deteriorate linearly, except for cache, which 
is highly nonlinear when working set size is near cache size.
so if you run two threads/procs on an HT chip, and a small working
set (both seeing high cache hit rates) they should get along just fine, 
with half the speed each.  or very large working sets (where cache is 
basically irrelevant).

hmm, maybe that's too verbose to be clear.  in short:
	1. if you normally have some idle resources, HT may let
	you achieve higher efficiency through interleaving.
	2. some physical resources will deteriorate fairly linearly,
	so two procs see half the throughput.
	3. several resources can behave nonlinearly, so two procs
	could interfere badly when interleaved.
	4. most of these issues are the same for traditional timesliced
	multiprocessing, except that HT interleaves much finer.

> If not what other way can i find the 
> performance of Xeon processors in a clustered env.

I don't know why clustering would change anything.


From edwardsa at plk.af.mil  Thu Apr 18 10:21:20 2002
From: edwardsa at plk.af.mil (Arthur H. Edwards)
Date: Thu, 18 Apr 2002 11:21:20 -0600
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
In-Reply-To: <200204171931.g3HJVVs22371@mycroft.ahpcrc.org>; from rbw@ahpcrc.org on Wed, Apr 17, 2002 at 02:31:31PM -0500
References: <200204171931.g3HJVVs22371@mycroft.ahpcrc.org>
Message-ID: <20020418112120.B12634@plk.af.mil>

On Wed, Apr 17, 2002 at 02:31:31PM -0500, Richard Walsh wrote:
> 
> Eugene Leitel wrote:
> 
> >http://slashdot.org/articles/02/04/17/1324227.shtml?tid=162
> 
> >An anonymous reader wrote in to say "Pacific Northwest National Laboratory 
> >(US DOE) signed a $24.5 million dollar contract with HP for a Linux 
> >supercomputer. This will be one of the top ten fastest computers in the 
> >world. Some cool features: 8.3 Trillion Floating Point Operations per 
> >Second, 1.8 Terabytes of RAM, 170 Terabytes of disk, (including a 53 TB 
> >SAN), and 1400 Intel McKinley and Madison Processors. Nice quote: 'Todays 
> >announcement shows how HP has worked to help accelerate the shift from 
> >proprietary platforms to open architectures, which provide increased 
> >scalability, speed and functionality at a lower cost,' said Rich DeMillo,
> >vice president and chief technology officer at HP. Read Details of the
> >announcement here or here."
> 
> Mmmm ... working through some numbers ...
> 
> 8.3 TFLOPS (if they are quoting peak) with 1400 processors 
> would mean they are getting chips with 1.5 GHz clocks (peak
> performance would be 6 GFLOPS per chip [4 ops per clock]). 
> 
> Stream numbers for this 1.5 GHz chip (estimated) would be around                
> 250 MFLOPS for the triad. Using the triad as a baseline for performance
> for this and several others systems and relating it back to 
> some estimated cost for several other systems (government purchase 
> price only, no recurring costs) this is $70 per MFLOPS sustained 
> for the Mckinley (again using triad) ... or more than the CRAY SV2 
> ($65), EV6($55), EV7 ($50), Pentium 4 ($30).
> 
> Interesting number ... the high-end IA-64 stuff does not look 
> cheap when stream triad defines sustained performance. Of course, 
> blocking for cache will push the sustained number up (maybe alot
> and on all the systems), but you would think that QCHEM stuff 
> they run at PNNL (G98) will be mostly memory bound and therefore 
I think they will be using NWChem- an intrinsically parallel code. It
has some really bad numbers for serial but apparently scales fairly well.
> the stream triad sustained performance is not too far off.
> 
> I am not sure this looks like a very good deal. 
> 
> rbw
> 
> 
> #---------------------------------------------------
> # Richard Walsh
> # Project Manager, Cluster Computing, Computational
> #                  Chemistry and Finance
> # netASPx, Inc.
> # 1200 Washington Ave. So.
> # Minneapolis, MN 55415
> # VOX:    612-337-3467
> # FAX:    612-337-3400
> # EMAIL:  rbw at networkcs.com, richard.walsh at netaspx.com
> #
> #---------------------------------------------------
> # "What you can do, or dream you can, begin it;
> #  Boldness has genius, power, and magic in it."
> #                                  -Goethe
> #---------------------------------------------------
> # "Without mystery, there can be no authority."
> #                                  -Charles DeGaulle
> #---------------------------------------------------
> # "Why waste time learning when ignornace is
> #  instantaneous?"                 -Thomas Hobbes
> #---------------------------------------------------
> # "In the chaos of a river thrashing, all that water
> #  still has to stand in line."    -Dave Dobbyn
> #---------------------------------------------------
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Arthur H. Edwards
AFRL/VSSE
Bldg. 914
3550 Aberdeen Ave SE
KAFB, NM 87117-5776


From shewa at inel.gov  Thu Apr 18 12:08:30 2002
From: shewa at inel.gov (Andrew Shewmaker)
Date: Thu, 18 Apr 2002 13:08:30 -0600
Subject: Replacing Quad proc SMP multi node DEC Alpha Cluster with Linux
 Dual P4 cluster?
References: <20020418175406.91343.qmail@web11408.mail.yahoo.com>
Message-ID: <3CBF19AE.1050105@inel.gov>

Rayson Ho wrote:

>--- shin at guss.org.uk wrote:
>
>
>One thing I found last yr was that when the processors share the memory
>bandwidth on an SMP machine, the performance is really bad. My
>configuration was daul-P3s with Myirnet. I measured the performance of
>the cluster running MPI programs using 8 machines with 1 process on
>each machine, and 2 processes on 4 machines. To my suprise, 1 process
>on 8 machines had a better performance.
>
For the 1 process on 8 machines case, were those 8 machines also duals?  If
they were, did you notice if one cpu was taking care of networking overhead
while the other was doing work?  If these were duals did you also run the
same tests on similar speed uniprocessor system?  Just wondering.

>
>
>My prof. then gave us his results on an IBM server machine (not PC
>server), his results were the opposite. His conclusion was that the
>memory bandwidth of PCs does not scale with the number of processors.
>
>(BTW, it was an assignment -- I wasn't the only one who found similar
>results -- there were around 20 people in that class)
>
>>2. Quad systems seem to be way more expensive than duals and I could
>>only find quad systems running at 900Mhz per proc instead of 2GHz in
>>the duals - so I assume the quads are out on cost and proc. speed
>>alone.
>>
>
>I believe the performance of qurd systems will not give you double of
>the duals, even if you use the 2Ghz CPUs. The PC (or should I say
>Intel??) architecture has the shared memory bus, which does not scale
>with the #CPUs.
>
>BTW, is AMD MP better?? I've heard that each Althon MP CPU talks to its
>own system cpuset.
>
I have mostly used dual AMDs in a high throughput rather than high 
performance
setting.  Our CFD codes would take 99% of each processor and 500-800 MB 
each
of RAM and would not interfere with each other.  Each would complete in the
same amount of time as we saw in the 1:1 case.

We also are using a monte carlo based code over PVM and there is almost no
difference between 1*8 and 2*4.  I don't remember how much memory each
process uses though (less than the above).

We have been pleased so far.

Andrew


From lindahl at keyresearch.com  Thu Apr 18 11:12:52 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Thu, 18 Apr 2002 11:12:52 -0700
Subject: /. US DOE gets a $24.5 Million Linux Supercomputer
In-Reply-To: <200204181437.g3IEb7w27647@mycroft.ahpcrc.org>; from rbw@ahpcrc.org on Thu, Apr 18, 2002 at 09:37:07AM -0500
References: <200204181437.g3IEb7w27647@mycroft.ahpcrc.org>
Message-ID: <20020418111252.A2016@wumpus.skymv.com>

On Thu, Apr 18, 2002 at 09:37:07AM -0500, Richard Walsh wrote:

>  True, but the dual-floating point units in the core are not
>  likely to be added to ...

I don't think that's a good guess. Not only are conventional cpus
getting wider over time, but the whole point of EPIC and VLIW in
general is that they potentially go _really_ wide. And most scientific
codes can use lots of functional units.

>  When I saw the posting, I was surprise how few IA-64 processors 
>  (even with the extras) were to be had for ~$25,000,000. The Pentium
>  4 looks a better deal at this level of analysis.

PNL's codes require 64-bit addressing. I don't think anyone was
willing to bid AMD's Hammer, although it was a possibility.

The problem with pricing this bid are that it's all forward-priced. HP
is committing to delivering Madison processors when McKinley isn't
even formally released. That means risk, and that means you need to
subtract off $$ to cover that risk. That skews all of your analysis.

BTW, the interconnect is next generation Quadrics. I've never seen any
specs for it, nor pricing.

greg


From alex at compusys.co.uk  Thu Apr 18 13:57:49 2002
From: alex at compusys.co.uk (alex at compusys.co.uk)
Date: Thu, 18 Apr 2002 21:57:49 +0100 (BST)
Subject: Intial Pallas performance with Myrinet on a 860 & E7500
In-Reply-To: <200204181601.g3IG1jG09278@blueraja.scyld.com>
Message-ID: <Pine.LNX.4.21.0204182130360.4994-100000@hpc-support.compusys.co.uk>

I can show PALLAS -multi numbers up to 94 CPUs orso. It would be
interesting to see whether Dolphin can keep up with that, Myrinet so far
looks pretty solid.

I will make the results public as soon as we have confirmation that the
results are within the boundaries of what is expected.

'Initial' means that with pushing and tweaking we probably win that last
10% performance increase. 

Alex


From france at handhelds.org  Thu Apr 18 10:11:28 2002
From: france at handhelds.org (George France)
Date: Thu, 18 Apr 2002 13:11:28 -0400
Subject: Replacing Quad proc SMP multi node DEC Alpha Cluster with Linux Dual P4 cluster?
In-Reply-To: <20020418142700.A680@gre.ac.uk>
References: <20020418142700.A680@gre.ac.uk>
Message-ID: <02041813112800.01959@shadowfax.middleearth>

Have you considered a quad processor EV68 (EV7 when they become available) 
Alpha system?  

--George

On Thursday 18 April 2002 09:27, shin at guss.org.uk wrote:
> Hi,
>
> We have an old cluster setup that has 3 Alpha 4100 nodes (each node
> has 4x466 processors) connected with memory channel (first version),
> 1Gb Ram per node. The cluster is used to run internal code which is
> mostly CFD (fine grain synchronous) problems. The code is parallized
> and currently uses dec's mpi implementation.
>
> We now need to replicate this system at a remote site, and with an
> eye on keeping the cost down, so the idea is to go with a bunch of
> dual processor P4 (2GHz xeon?) systems with 2Gb ram each and myrinet
> interconnect.
>
> We expect to want to scale up to at least 8 of these dual nodes
> initially.
>
> I need to look into the performance of various aspects of the
> proposed system as we have no experience in this type of setup.
>
> Disclaimer: I dont' necessarily know what I'm talking about - I'm
> the hardware/admin guy; the parallel guys do all the coding! Sorry.
>
> I'd appreciate any answers anyone could offer on:
>
> 1. In terms of the floating point performance, looking at CFP2000 on
> www.spec.org and the Xeon should offer much better FP performace
> that the older alphas we have. I could only find results for a 4100
> 5/533 (which is the closest to our current setup) and these were
> much lower than the results from Dell Precision Workstation 530 with
> 2.0Ghz proc.
>
> So I assume this won't be an issue - we'll get fast processors. Is
> there a mboard that really sticks out here for offering best support
> to these processors - or should we even be looking at AMD MP systems
> now. I'm not sure I have the timescale to get in test systems and
> test anything out.
>
> 2. Quad systems seem to be way more expensive than duals and I could
> only find quad systems running at 900Mhz per proc instead of 2GHz in
> the duals - so I assume the quads are out on cost and proc. speed
> alone.
>
> 3. One of my concerns was the use of mpi across 8xdual Xeon nodes
> versus 3xquad alpha nodes. I'm assuming that mpi(ch) will look after
> all the necessary for us in terms of communication between
> processors within a node and communication across nodes - but is the
> speed of memory, throughput etc a limiting factor on this type of PC
> architecture? Will we hit latency issues within a node that we're
> not currently hitting?
>
> What sort of memory is recommended? DDR/SDRAM/other?
>
> However having ruled out the quads above - will they offer better
> memory performance than the duals - on a par with the quad alpha
> nodes? (I appreciate it's not a like for like comparison).
>
> 3. I think an entry level myrinet switch will enable me to connect 8
> nodes - at a cost of approx 2400 USD for a switch and 1700 USD per
> myrinet card per node? And it will offer better performance than our
> MC - so I'm assuming that the choice of myrinet is ok.
>
> 4. In terms of cache - we believe that the large cache on the
> alpha's helps our performance quite significantly - as far as I can
> determine the cache on the xeons is still 256/512K? Presumably this
> won't make that much of a difference as we're scaling out across 8
> nodes instead of 3?
>
> Many thanks in advance,
> Rgds
> Shin
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


From shin at guss.org.uk  Fri Apr 19 02:47:17 2002
From: shin at guss.org.uk (shin at guss.org.uk)
Date: Fri, 19 Apr 2002 10:47:17 +0100
Subject: Replacing Quad proc SMP multi node DEC Alpha Cluster with Linux Dual P4 cluster?
In-Reply-To: <02041813112800.01959@shadowfax.middleearth>; from france@handhelds.org on Thu, Apr 18, 2002 at 01:11:28PM -0400
References: <20020418142700.A680@gre.ac.uk> <02041813112800.01959@shadowfax.middleearth>
Message-ID: <20020419104717.B680@gre.ac.uk>

Hi,

On Thu, Apr 18, 2002 at 01:11:28PM -0400, George France wrote:
> Have you considered a quad processor EV68 (EV7 when they become available) 
> Alpha system?  

I ruled this out as being more costly than going down the PC route -
I might be mistaken on that - but the last time I looked (about a yr
ago for another project) it was quite costly. 

I'm not even sure what the alpha future is now - and we have had
good performance from our previous alpha setups - but cost is a real
issue in this particular case and x86 of some variety looked the
cheapest way forward.

Shin


From john.hearns at cern.ch  Fri Apr 19 03:07:15 2002
From: john.hearns at cern.ch (John Hearns)
Date: 19 Apr 2002 12:07:15 +0200
Subject: what architecture was MPI and PVM 1st designed for?
In-Reply-To: <E16xStZ-000Pdh-00@oracle.clara.net>
References: <E16xStZ-000Pdh-00@oracle.clara.net>
Message-ID: <1019210836.14791.8.camel@ues4>

On Tue, 2002-04-16 at 16:34, Jayne Heger wrote:
> 
> Hi,
> 
> Coulld anyone tell me what computer architecture MPI and PVM were
first 
> designed for./written on.
> Thanks,
> 

Jayne,
there is a nice discussion on PVM and MPI in chapter 11 of
Dowd and Severances Oreilly bok on High Performance Computing.

http://www.oreilly.com/catalog/hpc2/

By the way, it was nice to hear that your cluster is flying.
(Hmmm.... I suppose Beowulves howl really).


From Daniel.Kidger at quadrics.com  Fri Apr 19 06:06:00 2002
From: Daniel.Kidger at quadrics.com (Daniel Kidger)
Date: Fri, 19 Apr 2002 14:06:00 +0100
Subject: very high bandwidth, low latency manner?
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA74D2D51@stegosaurus.bristol.quadrics.com>

Craig Tierney wrote:

>I talked to a guy at SC2002 from Quadrics and he said
>that list pricing on a Quadrics network was about $3500
>per node when you are in the 100s of nodes and up.  
>The price includes the cards, cables, switches,
>etc.  This doesn't include any sort of discount that you
>might get.  Myrinet is about $2000 for an equivelent 
>network at list price.   Dolphin/SCI falls around $2245 list 
>per node (if the system is > 144 nodes and you have to get
>the 3d card).


I guess I should jump in here to give the Quadrics perspective...

I have spent 2 weeks in the USA doing some benchmarking on clusters of
McKinleys under Linux and I get home find lots of e-mails talking about the
Quadrics stuff, but no e-mails were from people either at Quadrics or from a
customer site.

I have only been with Quadrics for six months or so, and (fortunately) it is
not me but the marketing people that decide the pricing scheme.


Quadrics have sold most systems to date as part of Compaq's Alphaserver SC
range. However we also do sell via other vendors, particularly as linux
clusters. Our model though is not to sell direct to end-users but via
systems integrators. 

?3500 per node is maybe about right - pricing would always include all
cards, cables, switches and software.  
The cost is admittedly high, after all as well as having the fastest
line-speed, the Quadrics interconnect sends all data as virtual addresses
(the NIC has its own MMU and TLB). That way any process can read and write
the memory of any other node without any CPU overhead.
The cost also tries to cover the high R+D; with volume sales the price may
end up significantly  lower.


Any discussion on costing would be better taken up with our marketing types
- but I am happy to share my knowledge on performance issues. :-)


Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


From David_Walters at sra.com  Fri Apr 19 07:48:16 2002
From: David_Walters at sra.com (Walters, David)
Date: Fri, 19 Apr 2002 10:48:16 -0400
Subject: very high bandwidth, low latency manner?
Message-ID: <F9883C7AB1B4D311B6D000508B6F7D500561265D@flex2.sra.com>

>I talked to a guy at SC2002 from Quadrics and he said
>that list pricing on a Quadrics network was about $3500
>per node when you are in the 100s of nodes and up.  

To heck with the Quadrics, I want a ride on that Time Machine!!  SC2002 is
(was?) in November 2002, IIRC...  Man, what I could accomplish by seeing
SC2002 presentations 6 months in advance...  <grin>

Dave


From france at handhelds.org  Fri Apr 19 05:10:03 2002
From: france at handhelds.org (George France)
Date: Fri, 19 Apr 2002 08:10:03 -0400
Subject: Replacing Quad proc SMP multi node DEC Alpha Cluster with Linux Dual P4 cluster?
In-Reply-To: <20020419104717.B680@gre.ac.uk>
References: <20020418142700.A680@gre.ac.uk> <02041813112800.01959@shadowfax.middleearth> <20020419104717.B680@gre.ac.uk>
Message-ID: <02041908100300.05034@shadowfax.middleearth>

Hello Shin,

On Friday 19 April 2002 05:47, shin at guss.org.uk wrote:
> Hi,
>
> On Thu, Apr 18, 2002 at 01:11:28PM -0400, George France wrote:
> > Have you considered a quad processor EV68 (EV7 when they become
> > available) Alpha system?
>
> I ruled this out as being more costly than going down the PC route -
> I might be mistaken on that - but the last time I looked (about a yr
> ago for another project) it was quite costly.

If you need a 64 bit architecture, then I prefer the Alpha Architecture to 
other 64 bit systems.  The EV67, EV68 and EV7 systems price vs performence 
appears reasonable to me, assuming that you really need a 64 bit 
Architecture.  If you do not need a 64 bit system aimed at High Performance 
Technical Computing, then a 32 bit system will probably provide a less 
expensive solution.  

>
> I'm not even sure what the alpha future is now - and we have had
> good performance from our previous alpha setups - but cost is a real
> issue in this particular case and x86 of some variety looked the
> cheapest way forward.

The Alpha EV7 CPU will probably be last the alpha chip produced as we know 
it.   I believe the EV7 systems should be out late summer or in the fall.  I 
suspect that these systems will be available until at lease 2005.

When the the Hammer chips / systems are released or the next release of 
Itanium, we will have to wait and see how they compare to Alpha.

Best Regards,


--George


From Todd_Henderson at raytheon.com  Fri Apr 19 12:20:51 2002
From: Todd_Henderson at raytheon.com (Todd Henderson)
Date: Fri, 19 Apr 2002 14:20:51 -0500
Subject: 64 bit Intels?
References: <Pine.LNX.4.10.10201161210170.856-100000@vaio.greennet>
Message-ID: <3CC06E13.E1D4FBFA@raytheon.com>

We are currently specing out a new cluster and since we have a corp. agreement with one of the big pc vendors, I
thought I'd contact them.  They claim that the P4 Itanium is 64 bit, only in the 733 and 800 mhz speeds.  Is
Linux on Intel 64 bit for these processors?

Thanks,
Todd


From rfoster at lnxi.com  Sat Apr 20 07:22:15 2002
From: rfoster at lnxi.com (William Harman)
Date: Sat, 20 Apr 2002 14:22:15 +0000
Subject: HIGH MEM suport for up to 64GB
Message-ID: <20020420.Oah.69718200@bart>

Sounds right.  __PAGE_OFFSET is set to 0xC0000000 by default, so
you get 3GB of address space for user processes.  And about 2GB
max for the data segment.

You can change the kernel to give you more than 3GB of address
space and greater than 2GB of data segment for heap allocations.
It's three lines to change. Two lines in header files and one
line in the one of the assembly init files.


Leandro Tavares Carneiro (leandro at ep.petrobras.com.br) wrote*:
>
>Hi everyone,
>
>	I am writing to ask to you all if anyone have tesed or used an machine
>with more than 4GB of RAM or paging in virtual memory on intel machines.
>	He have an linux beowulf cluster and one of ours developers have asked
>us for how much memory an process can allocate to use. In the tests we
>have made, we cannot allocate much more than 3GB, using an dual PIII
>with 1GB of ram and 12Gb of swap area for testing.
>	We can use 2 process alocating more or less 3Gb, but one process alone
>canot pass this test.
>	We have using redhat linux 7.2, kernel 2.4.9-21, recompiled with High
>Mem suport.
>	I have tested the same test aplication on an Itanium machine, with 1GB
>of ram and 16Gb of swap area, and they passed. The aplication can
>alocate more than 5GB of memory, using swap. In this machine, we are
>using turbolinux 7, with kernel version 2.4.4-010508-18smp.
>
>Thanks in advance for the help,
>
>Best regards,
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>

From bari at onelabs.com  Sun Apr 21 19:24:40 2002
From: bari at onelabs.com (Bari Ari)
Date: Sun, 21 Apr 2002 21:24:40 -0500
Subject: 64 bit Intels?
References: <Pine.LNX.4.10.10201161210170.856-100000@vaio.greennet> <3CC06E13.E1D4FBFA@raytheon.com>
Message-ID: <3CC37468.9090902@onelabs.com>

P4 and Itanium are two different Intel processors. Itanium is 64 bit and 
is currently available in 733 and 800 MHz speed grades. P4 is only 32 bit.

http://developer.intel.com/design/itanium/downloads/249634.htm
vs
http://developer.intel.com/design/Pentium4/datashts/

For info on Linux on IA-64, see:

http://www.linuxia64.org/

Bari

Todd Henderson wrote:

>We are currently specing out a new cluster and since we have a corp. agreement with one of the big pc vendors, I
>thought I'd contact them.  They claim that the P4 Itanium is 64 bit, only in the 733 and 800 mhz speeds.  Is
>Linux on Intel 64 bit for these processors?
>
>Thanks,
>Todd
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
>


From troy at osc.edu  Mon Apr 22 06:32:41 2002
From: troy at osc.edu (Troy Baer)
Date: Mon, 22 Apr 2002 09:32:41 -0400
Subject: 64 bit Intels?
In-Reply-To: <3CC06E13.E1D4FBFA@raytheon.com>
Message-ID: <Pine.SGI.4.21.0204220929270.20036-100000@paladin.osc.edu>

On Fri, 19 Apr 2002, Todd Henderson wrote:
> We are currently specing out a new cluster and since we have a corp. agreement with one of the big pc vendors, I
> thought I'd contact them.  They claim that the P4 Itanium is 64 bit, only in the 733 and 800 mhz speeds.  Is
> Linux on Intel 64 bit for these processors?

The Itanium isn't a P4 derivative; it's a totally different architecture
that is in fact 64-bit.  Linux on them is 64-bit clean AFAIK.

	--Troy
-- 
Troy Baer                       email:  troy at osc.edu
Science & Technology Support    phone:  614-292-9701
Ohio Supercomputer Center       web:  http://oscinfo.osc.edu


From joe.griffin at mscsoftware.com  Mon Apr 22 06:52:02 2002
From: joe.griffin at mscsoftware.com (Joe Griffin)
Date: Mon, 22 Apr 2002 06:52:02 -0700
Subject: 64 bit Intels?
References: <Pine.SGI.4.21.0204220929270.20036-100000@paladin.osc.edu>
Message-ID: <3CC41582.7060706@mscsoftware.com>

The Itanium is NOT 64 bit like a CRAY is 64 bit.
It is an LP64 (longs and pointers).

In FORTRAN: INTEGERs and REALs are still 32 bits.

In C, int are still 32 bits.

You are allowed larger addressing because
longs and pointers are 64 bits.

You may get 64 bit numberical accuracy
in FORTRAN by use of "DOUBLE PRECISSION" but
this capability is the same as on IA32 systems.
The item you gain is that you have a
bigger address space.

Regards,
Joe


Troy Baer wrote:
> On Fri, 19 Apr 2002, Todd Henderson wrote:
> 
>>We are currently specing out a new cluster and since we have a corp. agreement with one of the big pc vendors, I
>>thought I'd contact them.  They claim that the P4 Itanium is 64 bit, only in the 733 and 800 mhz speeds.  Is
>>Linux on Intel 64 bit for these processors?
> 
> 
> The Itanium isn't a P4 derivative; it's a totally different architecture
> that is in fact 64-bit.  Linux on them is 64-bit clean AFAIK.
> 
> 	--Troy


From manel at labtie.mmt.upc.es  Mon Apr 22 07:37:29 2002
From: manel at labtie.mmt.upc.es (Manel Soria)
Date: Mon, 22 Apr 2002 16:37:29 +0200
Subject: Maximum room temperature
References: <200204201603.g3KG3iF08234@blueraja.scyld.com>
Message-ID: <3CC42029.5B484258@labtie.mmt.upc.es>

I'm wondering what is the maximum reasonable ambient
temperature to have in a cluster room. In our room
with 72 nodes we have about 29-30 oC (84-86 oF).
Is this too high ? Can this be the cause of hardware
failures ?

Thanks.

--
===============================================
Dr. Manel Soria
ETSEIT - Centre Tecnologic de Transferencia de Calor
C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
E-Mail: manel at labtie.mmt.upc.es


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020422/c06c190b/attachment.html>

From raysonlogin at yahoo.com  Mon Apr 22 09:44:01 2002
From: raysonlogin at yahoo.com (Rayson Ho)
Date: Mon, 22 Apr 2002 09:44:01 -0700 (PDT)
Subject: 64 bit Intels?
In-Reply-To: <3CC41582.7060706@mscsoftware.com>
Message-ID: <20020422164401.5570.qmail@web11407.mail.yahoo.com>

--- Joe Griffin <joe.griffin at mscsoftware.com> wrote:
> The Itanium is NOT 64 bit like a CRAY is 64 bit.
> It is an LP64 (longs and pointers).
> 
> In FORTRAN: INTEGERs and REALs are still 32 bits.
> 
> In C, int are still 32 bits.

Those are software issues, you can always define ints as 64-bit on IA64
if you know where to hack the gcc source.

Rayson

 
> You are allowed larger addressing because
> longs and pointers are 64 bits.
> 
> You may get 64 bit numberical accuracy
> in FORTRAN by use of "DOUBLE PRECISSION" but
> this capability is the same as on IA32 systems.
> The item you gain is that you have a
> bigger address space.
> 
> Regards,
> Joe
> 
> 
> Troy Baer wrote:
> > On Fri, 19 Apr 2002, Todd Henderson wrote:
> > 
> >>We are currently specing out a new cluster and since we have a
> corp. agreement with one of the big pc vendors, I
> >>thought I'd contact them.  They claim that the P4 Itanium is 64
> bit, only in the 733 and 800 mhz speeds.  Is
> >>Linux on Intel 64 bit for these processors?
> > 
> > 
> > The Itanium isn't a P4 derivative; it's a totally different
> architecture
> > that is in fact 64-bit.  Linux on them is 64-bit clean AFAIK.
> > 
> > 	--Troy
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


__________________________________________________
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/


From aby_sinha at yahoo.com  Sun Apr 21 18:08:10 2002
From: aby_sinha at yahoo.com (Abhishek sinha)
Date: Sun, 21 Apr 2002 18:08:10 -0700
Subject: HIGH MEM suport for up to 64GB
References: <20020420.Oah.69718200@bart>
Message-ID: <3CC3627A.1010301@yahoo.com>

hi all

This is what i did to allocate more than 3 Gb out of 4 Gb of RAM to the 
userspace .
In the file /usr/src/linux2.4.*/include/asm-i386/page_offset.h

Under the line #ifdef CONFIG_1GB
Changed #define PAGE_OFFSET_RAW 0xC00000000 to PAGE_OFFSET_RAW 0xE00000000
 
then in processor.h changed #define TASK_UNMAPPED_BASE (TASK_SIZE/3) to 
(TASK_SIZE/16)

and then recomplied the kernel with high mem support.
this was on 2.4.7-10 , in 2.4.18 there are minor changes

I am not a big kernel hacker and still learning so please use it with 
backups.
comments invited

abhishek


William Harman wrote:

>Sounds right.  __PAGE_OFFSET is set to 0xC0000000 by default, so
>you get 3GB of address space for user processes.  And about 2GB
>max for the data segment.
>
>You can change the kernel to give you more than 3GB of address
>space and greater than 2GB of data segment for heap allocations.
>It's three lines to change. Two lines in header files and one
>line in the one of the assembly init files.
>
>
>
>Leandro Tavares Carneiro (leandro at ep.petrobras.com.br) wrote*:
>
>>Hi everyone,
>>
>>	I am writing to ask to you all if anyone have tesed or used an machine
>>with more than 4GB of RAM or paging in virtual memory on intel machines.
>>	He have an linux beowulf cluster and one of ours developers have asked
>>us for how much memory an process can allocate to use. In the tests we
>>have made, we cannot allocate much more than 3GB, using an dual PIII
>>with 1GB of ram and 12Gb of swap area for testing.
>>	We can use 2 process alocating more or less 3Gb, but one process alone
>>canot pass this test.
>>	We have using redhat linux 7.2, kernel 2.4.9-21, recompiled with High
>>Mem suport.
>>	I have tested the same test aplication on an Itanium machine, with 1GB
>>of ram and 16Gb of swap area, and they passed. The aplication can
>>alocate more than 5GB of memory, using swap. In this machine, we are
>>using turbolinux 7, with kernel version 2.4.4-010508-18smp.
>>
>>Thanks in advance for the help,
>>
>>Best regards,
>>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>>
>http://www.beowulf.org/mailman/listinfo/beowulf
>
>>
>> Part 1.1
>>
>> Content-Type:
>>
>> text/plain
>> Content-Encoding:
>>
>> 8bit
>>
>>


From john at computation.com  Mon Apr 22 10:12:48 2002
From: john at computation.com (John Nelson)
Date: Mon, 22 Apr 2002 13:12:48 -0400 (EDT)
Subject: 64 bit Intels?
In-Reply-To: <3CC41582.7060706@mscsoftware.com>
Message-ID: <Pine.LNX.4.44.0204221300200.20241-100000@myrna.lexonia.net>

Have the number of bits per machine instruction also increased to 64 bits?  
This would imply that all of your compiled executables have now doubled in
size (although I don't know why you would need 2**32 additional
instructions).  Are all pointers consistantly using 64 bits?  If so, there
will be a proportional growth in the size of your executable.

The larger architecture also impacts your data formats. If your data sets
are in binary format, and depending on the language you are using, there
may be incompatibilities as well as new demands on storage.

Stating the obvious I guess, but there are considerations when going to
larger architectures.

-- John


On Mon, 22 Apr 2002, Joe Griffin wrote:

> The Itanium is NOT 64 bit like a CRAY is 64 bit.
> It is an LP64 (longs and pointers).
> 
> In FORTRAN: INTEGERs and REALs are still 32 bits.
> 
> In C, int are still 32 bits.
> 
> You are allowed larger addressing because
> longs and pointers are 64 bits.

-- 
_____________________________________________________

John T. Nelson
President               |    Computation.com Inc
mail:                   |    john at computation.com
company:                |    http://www.computation.com/
journal of computation: |    http://www.computation.org/
_____________________________________________________
"Providing quality IT consulting services since 1992"


From rgb at phy.duke.edu  Mon Apr 22 11:58:44 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 22 Apr 2002 14:58:44 -0400 (EDT)
Subject: Maximum room temperature
In-Reply-To: <3CC42029.5B484258@labtie.mmt.upc.es>
Message-ID: <Pine.LNX.4.44.0204221429290.30950-100000@lucifer.rgb.private.net>

On Mon, 22 Apr 2002, Manel Soria wrote:

> I'm wondering what is the maximum reasonable ambient
> temperature to have in a cluster room. In our room
> with 72 nodes we have about 29-30 oC (84-86 oF).
> Is this too high ? Can this be the cause of hardware
> failures ?

Yes, it can.  This is pretty high for a server room.

The best way to think of temperature and heat disposal in a cluster is
to think in layers.  Heat generally flows from hot to cold, at a rate
proportional to the difference in temperture in degrees Kelvin.  More
specifically, the rate of flow is influenced by things like
conductivities, convective flow, and radiative trapping.  

The CPU core generates heat at some roughly constant rate under load.
Current/modern CPU's "can" operate at very high temperatures, order of
100C, although they will almost certainly operate more reliably and
longer at considerably cooler core temperatures.

This heat generally flows from the CPU into the attached heat sink/fan
at a rate determined by the temperature DIFFERENCE between the heatsink
and the CPU.  If the conductivity of the heatsink is high, and the
conductivity of the interface is also high, a small temperature
difference will cause a lot of heat to flow from the hotter to the
cooler.  The CPU is thus cooled until it isn't too much warmer than the
operating temperature of the heatsink.

The heatsink then has to be cooled so that IT is cooler than the desired
operating temperature of the CPU.  The hotter it is, the faster it loses
heat to the ambient air.  The cooler the ambient air, the faster it
loses heat.  Here things get a bit arcane.  Air is not all that great a
conductor of heat.  It does have some heat capacity and will warm up
when in contact with a warmer surface.  Heat sinks therefore generally
have lots of surface area and fans in the case and heatsink itself move
(hopefully cooler) air rapidly across this surface.  All things being
equal, though, when the CPU produces heat at a constant rate the
heatsink/fan/air arrangement can remove heat at that rate only when the
air and the heatsink have a given, approximately constant, temperature
difference.

This warmed air has to then be removed from the case and replaced with
cooler ambient air from the server room, and the warmed air eventually
has to be circulated over actively cooled (refrigerated) coils to remove
it from the room altogether and eventually dump it, plus all the energy
required to do the cooling, into the outside air.

The cooler the room air, the cooler all the components inside your
system, especially the CPU.  Cooling down the room air temperature 10C
should reduce the operating temperature of your CPU by very close to
10C.

Most systems are probably engineered with the assumption that they will
operate in air in the 68-75F temperature range (20-23C), and can
probably tolerate ambient air up to 80F or 26C without much risk.  If
the ambient temperatures get much higher than this, though, your risk of
catastrophic heat-induced failure starts creeping up.  At around
100F/38C they become very high indeed -- close to "certain" if you try
operating a system 24 hours under a high load at or above this ambient
air temperature.  If a system is ever operated for an extended period
over 30C (in the 90s F) it may not fail, but even if you cool it back
down you may have marginally damaged components that will fail later.

An additional risk for even fairly short periods of high temperature
operation is that hard disks are made of metal that expands when heated.
If a disk expands too much, the write head can actually become
misaligned with the tracks and your disk can be instantly and
irrecoverably trashed.  This can also happen if the disk is COOLED too
much -- it is a bad idea to crank up a laptop after it has sat all night
in a sub-zero car without letting it come to a "normal" operating
temperature first...

If I were you I'd engineer enough cooling to drop the ambient air in
your cluster space by at least 5C, if not 10C, and make sure that there
is enough air circulation and mixing that no systems are in local "hot
spots" (where air exhausted from one system is sucked into another
system, for example). A really happy server room is one you need to wear
a jacket or sweater in to be comfortable, not one that makes you want to
take clothes off...;-)

   rgb

> 
> Thanks.
> 
> --
> ===============================================
> Dr. Manel Soria
> ETSEIT - Centre Tecnologic de Transferencia de Calor
> C/ Colom 11  08222 Terrassa (Barcelona) SPAIN
> Tf:  +34 93 739 8287 ; Fax: +34 93 739 8101
> E-Mail: manel at labtie.mmt.upc.es
> 
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From alvin at Maggie.Linux-Consulting.com  Mon Apr 22 13:33:54 2002
From: alvin at Maggie.Linux-Consulting.com (alvin at Maggie.Linux-Consulting.com)
Date: Mon, 22 Apr 2002 13:33:54 -0700 (PDT)
Subject: Maximum room temperature
In-Reply-To: <3CC42029.5B484258@labtie.mmt.upc.es>
Message-ID: <Pine.LNX.3.96.1020422132638.26297C-100000@Maggie.Linux-Consulting.com>

hi ya manel

if your systems have "health monitoring"...
check your bios to see what it thinks is your
current system temp and cpu temp...


you're cpu reliability/performance goes down
by 1/2  for every 10 degree C 
	- ie  if it would have lasted 5 more years...
	( starting temp is "normal/nominal  temp" as provided
	( by intel or amd that they provide a cpu warranty for 5 years
	- if temp went up another 10deg C... its now 2.5 yrs
	- it temp went up 20 degree.... its now 1.25yrs...
	- or some silly guidelines like that..

to test if the ambient temperature is too high...
	- add a regular fan and blow air on it...

	- if the cpu temp drops significantly...
	than its too hot in the room

max cpu temp...
	http://users.erols.com/chare/elec.htm
	http://www.heatsink-guide.com/maxtemp.htm

-- add lm_sensors to your "distro" to read the cpu temp
   and if it gets too high... shutdown the server 
   or at least dont do heavy computations on it
	- add more fans ... and better air flow...

c ya
alvin
http://www.Linux-1U.net/CPU .. more specs ...


On Mon, 22 Apr 2002, Manel Soria wrote:

> I'm wondering what is the maximum reasonable ambient
> temperature to have in a cluster room. In our room
> with 72 nodes we have about 29-30 oC (84-86 oF).
> Is this too high ? Can this be the cause of hardware
> failures ?
> 


From richard_fryer at charter.net  Mon Apr 22 10:01:08 2002
From: richard_fryer at charter.net (Richard Fryer)
Date: Mon, 22 Apr 2002 10:01:08 -0700
Subject: Kidger's comments on Quadric's design and performance
Message-ID: <003601c1ea1f$4c3a75f0$6601a8c0@charterpipeline.com>

On Fri, 19 Apr 2002 14:06:00 +0100
Daniel Kidger <Daniel.Kidger at quadrics.com> wrote:

> after all as well as having the fastest line-speed, the Quadrics
> interconnect sends all data as virtual addresses (the NIC has its
> own MMU and TLB). That way any process can read and write
> the memory of any other node without any CPU overhead.

I appreciate getting a bit of technical detail on Quadrics interfaces.  Is
there a web location that might provide more information - comparative
benchmarks or protocol information or ???

This message also reminded me to ask if a long-held opinion is valid - and
that opinion is "that a cache coherent interconnect would offer performance
enhancement when applications are at the 'more tightly coupled' end of the
spectrum."  I know that present PCI based interfaces can't do that without
invoking software overhead and latencies.  Anyone have data - or an argument
for invalidating this opinion?

I did recently read that the AMD 'HyperTransport' interfaces ARE capable of
cache coherent transactions.  This would appear to allow protocols (such as
SCI) that support cache coherence to operate in that mode.  But I wonder if
it matters to the MPI world.  Seems to me that it would be a factor in
improving scalability (providing that other interconnect issues such as
bandwidth bottlenecks) don't prevent it.  My recollection is that the SCI
simulations I saw required very little added traffic to maintain coherency.

Also a brief note about the Dolphin product line, since the issue of link
saturation has come up:  - they DO also sell switches - or at least offer
them.  And if you check the SCI specification, you'll see that there are
some elaborate discussions of fabric architectures that the protocol
supports and switches enable.  What I DO NOT know is if the SCALI software
supports switch-based operation, and also don't know what the impact is on
the system cost per node.  My 'inexperienced' assessment of the appeal in
the Dolphin family is that you can start without the switch and later add it
if the performance benefit warrents.  That's what I'd say if I were selling
them anyway - and didn't know otherwise.  :-)

Richard Fryer
rfryer at beomax.com


From hahn at physics.mcmaster.ca  Mon Apr 22 14:37:31 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Mon, 22 Apr 2002 17:37:31 -0400 (EDT)
Subject: 64 bit Intels?
In-Reply-To: <Pine.LNX.4.44.0204221300200.20241-100000@myrna.lexonia.net>
Message-ID: <Pine.LNX.4.33.0204221717110.1258-100000@coffee.psychology.mcmaster.ca>

>> Have the number of bits per machine instruction also increased to 64 bits?  

not exactly.  ia64 has "bundles" as its atomic instruction-stream format;
a 128b bundle contains three 41b instruction fields as well as a template
field. the legal combinations of instruction fields are fairly constrained,
which means that the compiler is somtimes (often?) forced to put nops
into bundles.

> instructions).  Are all pointers consistantly using 64 bits?  If so, there
> will be a proportional growth in the size of your executable.

how often are pointers encoded in your executables?  not often, I think.

> The larger architecture also impacts your data formats. If your data sets
> are in binary format, and depending on the language you are using, there
> may be incompatibilities as well as new demands on storage.

it's easy to say that ia64 is/was a pretty crazy thing to do,
but Intel isn't quite *that* far gone that they'd define wholly
new data formats.  modulo the usual endian considerations,
they're using familiar 2's complement integers and IEEE FP.

for PR-level slides:
http://developer.intel.com/design/itanium/idfisa/index.htm

for programmer-level intro:
http://developer.intel.com/design/itanium/downloads/24531703s.htm


From hahn at physics.mcmaster.ca  Mon Apr 22 14:51:49 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Mon, 22 Apr 2002 17:51:49 -0400 (EDT)
Subject: Maximum room temperature
In-Reply-To: <Pine.LNX.3.96.1020422132638.26297C-100000@Maggie.Linux-Consulting.com>
Message-ID: <Pine.LNX.4.33.0204221739080.1258-100000@coffee.psychology.mcmaster.ca>

> 	- or some silly guidelines like that..

uh, yeah, that's the word that came to my mind too.

I get quite upset when my 33KW machineroom is over 23C or so.
when it is at 20C, various bits of hardware report that they're
at 30-35C inside their case.  if you assume a fairly safe
.5 C/W thermal resistance for heatsink/fan combination,
that means you can technically have, a CPU dissipating 140W.
this is why dual-athlon machines (two 60+W CPUs) are a bit 
tricky to cool.  especially since although current CPUs are spec'ed 
at 90-100C, you REALLY DO NOT WANT TO DO SO.  I consider 50C
a fairly hot CPU.


From timm at fnal.gov  Mon Apr 22 14:58:21 2002
From: timm at fnal.gov (Steven Timm)
Date: Mon, 22 Apr 2002 16:58:21 -0500 (CDT)
Subject: Maximum room temperature
In-Reply-To: <Pine.LNX.3.96.1020422132638.26297C-100000@Maggie.Linux-Consulting.com>
Message-ID: <Pine.LNX.4.31.0204221656540.20885-100000@snowball.fnal.gov>

On Mon, 22 Apr 2002 alvin at Maggie.Linux-Consulting.com wrote:

>
> hi ya manel
>
> if your systems have "health monitoring"...
> check your bios to see what it thinks is your
> current system temp and cpu temp...
>

Am I missing something here?  The BIOS sensors can only
tell you what the temperature is when the machine is
effectively idle.  A good CPU load can be good for an
increase of 7-10 degrees C.

>
> you're cpu reliability/performance goes down
> by 1/2  for every 10 degree C
> 	- ie  if it would have lasted 5 more years...
> 	( starting temp is "normal/nominal  temp" as provided
> 	( by intel or amd that they provide a cpu warranty for 5 years
> 	- if temp went up another 10deg C... its now 2.5 yrs
> 	- it temp went up 20 degree.... its now 1.25yrs...
> 	- or some silly guidelines like that..
>
> to test if the ambient temperature is too high...
> 	- add a regular fan and blow air on it...
>
> 	- if the cpu temp drops significantly...
> 	than its too hot in the room
>
> max cpu temp...
> 	http://users.erols.com/chare/elec.htm
> 	http://www.heatsink-guide.com/maxtemp.htm
>
> -- add lm_sensors to your "distro" to read the cpu temp

Does anyone have success yet with making
lm_sensors work on the Tyan 246x series of dual AMD motherboards?

>    and if it gets too high... shutdown the server
>    or at least dont do heavy computations on it
> 	- add more fans ... and better air flow...
>
> c ya
> alvin
> http://www.Linux-1U.net/CPU .. more specs ...
>
>
> On Mon, 22 Apr 2002, Manel Soria wrote:
>
> > I'm wondering what is the maximum reasonable ambient
> > temperature to have in a cluster room. In our room
> > with 72 nodes we have about 29-30 oC (84-86 oF).
> > Is this too high ? Can this be the cause of hardware
> > failures ?
> >
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>


From walke at usna.edu  Mon Apr 22 15:10:21 2002
From: walke at usna.edu (LT V. H. Walke)
Date: 22 Apr 2002 18:10:21 -0400
Subject: Maximum room temperature (Tyan S246x)
In-Reply-To: <Pine.LNX.4.31.0204221656540.20885-100000@snowball.fnal.gov>
References: <Pine.LNX.4.31.0204221656540.20885-100000@snowball.fnal.gov>
Message-ID: <1019513422.21658.38.camel@vhwalke.mathsci.usna.edu>

Temperature monitoring for the dual processor Tyan board is mostly
working in lm_sensors 2.6.3 (March 22, 2002).  See the tickets
referenced on the lm_sensors web page:   http://www.netroedge.com/~lm78/

Unfortunately, you still have to go into bios on every boot to
initialize the monitoring chips, but everything seems to work. 
Fortunately it seems progress is being made - a solution was posted on
April 11th (again, see the web page).

Good luck,
Vann

On Mon, 2002-04-22 at 17:58, Steven Timm wrote:
> On Mon, 22 Apr 2002 alvin at Maggie.Linux-Consulting.com wrote:
> 
> >
> > hi ya manel
> >
> > if your systems have "health monitoring"...
> > check your bios to see what it thinks is your
> > current system temp and cpu temp...
> >
> 
> Am I missing something here?  The BIOS sensors can only
> tell you what the temperature is when the machine is
> effectively idle.  A good CPU load can be good for an
> increase of 7-10 degrees C.
> 
> >
> > you're cpu reliability/performance goes down
> > by 1/2  for every 10 degree C
> > 	- ie  if it would have lasted 5 more years...
> > 	( starting temp is "normal/nominal  temp" as provided
> > 	( by intel or amd that they provide a cpu warranty for 5 years
> > 	- if temp went up another 10deg C... its now 2.5 yrs
> > 	- it temp went up 20 degree.... its now 1.25yrs...
> > 	- or some silly guidelines like that..
> >
> > to test if the ambient temperature is too high...
> > 	- add a regular fan and blow air on it...
> >
> > 	- if the cpu temp drops significantly...
> > 	than its too hot in the room
> >
> > max cpu temp...
> > 	http://users.erols.com/chare/elec.htm
> > 	http://www.heatsink-guide.com/maxtemp.htm
> >
> > -- add lm_sensors to your "distro" to read the cpu temp
> 
> Does anyone have success yet with making
> lm_sensors work on the Tyan 246x series of dual AMD motherboards?
> 
> >    and if it gets too high... shutdown the server
> >    or at least dont do heavy computations on it
> > 	- add more fans ... and better air flow...
> >
> > c ya
> > alvin
> > http://www.Linux-1U.net/CPU .. more specs ...
> >
> >
> > On Mon, 22 Apr 2002, Manel Soria wrote:
> >
> > > I'm wondering what is the maximum reasonable ambient
> > > temperature to have in a cluster room. In our room
> > > with 72 nodes we have about 29-30 oC (84-86 oF).
> > > Is this too high ? Can this be the cause of hardware
> > > failures ?
> > >
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> >
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- 
----------------------------------------------------------------------
  Vann H. Walke                        Office: Chauvenet 341
  Computer Science Dept.               Ph:  410-293-6811
  572 Holloway Road, Stop 9F           Fax: 410-293-2686
  United States Naval Academy          email: walke at usna.edu
  Annapolis, MD 21402-5002             http://www.cs.usna.edu/~walke
----------------------------------------------------------------------


From xyzzy at speakeasy.org  Mon Apr 22 15:46:52 2002
From: xyzzy at speakeasy.org (Trent Piepho)
Date: Mon, 22 Apr 2002 15:46:52 -0700 (PDT)
Subject: Maximum room temperature (Tyan S246x)
In-Reply-To: <1019513422.21658.38.camel@vhwalke.mathsci.usna.edu>
Message-ID: <Pine.LNX.4.04.10204221536520.15847-100000@xyzzy.dsl.speakeasy.net>

On 22 Apr 2002, LT V. H. Walke wrote:
> Temperature monitoring for the dual processor Tyan board is mostly
> working in lm_sensors 2.6.3 (March 22, 2002).  See the tickets
> referenced on the lm_sensors web page:   http://www.netroedge.com/~lm78/
> 
> Unfortunately, you still have to go into bios on every boot to
> initialize the monitoring chips, but everything seems to work. 
> Fortunately it seems progress is being made - a solution was posted on
> April 11th (again, see the web page).

I've created a patch for the the lm_sensors w83781d driver that lets it
properly initialize and detect the chips in the Tyan dual-amd boards.  This
lets me get temperature monitoring without going into the bios first. 
Temperature and fan speed monitoring works, but voltage doesn't work correctly
for some inputs (like +12V) because Tyan used non-standard resistor values and
I don't know what they are.

I'm planning to give it to the lm sensors people soon.  In the past, they've
been very receptive to my fixes for the supermicro 370DE6 and vid settings for
socketA, P4, and P3-S boards.


From moor007 at bellsouth.net  Mon Apr 22 17:01:22 2002
From: moor007 at bellsouth.net (Timothy W. Moore)
Date: Mon, 22 Apr 2002 18:01:22 -0600
Subject: Ethernet Channel Bonding (ECB)
Message-ID: <3CC4A452.9080403@bellsouth.net>

I am new to the Beowulf cluster computing and am awaiting UPS to deliver 
my systems.  I have been researching ECB and it seems to have mixed 
reviews regarding performance enhancement.  Would/Could someone shed 
some light on this topic to the following effect:

[1] Is it truly necessary?

[2] If using RedHat 7.2, should I re-compile the kernel?

Any and all assistance is truly appreciated!

Tim


From siegert at sfu.ca  Mon Apr 22 17:45:41 2002
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 22 Apr 2002 17:45:41 -0700
Subject: Ethernet Channel Bonding (ECB)
In-Reply-To: <3CC4A452.9080403@bellsouth.net>; from moor007@bellsouth.net on Mon, Apr 22, 2002 at 06:01:22PM -0600
References: <3CC4A452.9080403@bellsouth.net>
Message-ID: <20020422174541.B24777@stikine.ucs.sfu.ca>

On Mon, Apr 22, 2002 at 06:01:22PM -0600, Timothy W. Moore wrote:
> I am new to the Beowulf cluster computing and am awaiting UPS to deliver 
> my systems.  I have been researching ECB and it seems to have mixed 
> reviews regarding performance enhancement.  Would/Could someone shed 
> some light on this topic to the following effect:
> 
> [1] Is it truly necessary?

That depends on the programs you are planning to run your cluster. With
respect to performance: I am getting 269Mbit/s bandwith with 3-way
channel bonded fast ethernet (using 3Com NICs). This is almost exactly
three times as much as I get from a single NIC. The latency is not quite
as good as with a single NIC: 55us vs. 43us (all numbers measured with
netpipe). Thus if your program doesn't need the bandwith or if your
program is extremely sensitive to latencies then you don't need channel
bonding.

> [2] If using RedHat 7.2, should I re-compile the kernel?

Not necessarily. The bonding.o module is part of all RedHat kernels.
However, I strongly recommend upgrading the kernel to 2.4.18 - I had
problems with all earlier 2.4.x versions.

Regards,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================


From ron_chen_123 at yahoo.com  Mon Apr 22 23:58:36 2002
From: ron_chen_123 at yahoo.com (Ron Chen)
Date: Mon, 22 Apr 2002 23:58:36 -0700 (PDT)
Subject: FYI : Out of the box clustering in SuSE 8.0
Message-ID: <20020423065836.27793.qmail@web14706.mail.yahoo.com>

Just read the news today about SGE 5.3 being included
in SuSE 8.0. From now on, building compute farms is
easier, saving the time on downloading and compiling
the batch systems.

You can read the news here at --
http://zdnet.com.com/2110-1104-888742.html

And also, SGE is mainly used in compute farms, but
some beowulfers use batch systems (like SGE, PBS) to
improve the throughput of their clusters.

Thanks,
 -Ron

__________________________________________________
Do You Yahoo!?
Yahoo! Games - play chess, backgammon, pool and more
http://games.yahoo.com/


From eugen at leitl.org  Tue Apr 23 07:29:44 2002
From: eugen at leitl.org (Eugen Leitl)
Date: Tue, 23 Apr 2002 16:29:44 +0200 (CEST)
Subject: Intel releases C++/Fortran suite V 6.0 for Linux
Message-ID: <Pine.LNX.4.33.0204231628220.3620-100000@hydrogen.leitl.org>

Intel announces the release of Version 6 of the Intel(R) C++ and Fortran 
Compilers for Windows and Linux. Take advantage of performance for your 
software. Please visit our Compilers Home Page today. 

http://www.intel.com/software/products/compilers/


From josip at icase.edu  Tue Apr 23 09:33:05 2002
From: josip at icase.edu (Josip Loncaric)
Date: Tue, 23 Apr 2002 12:33:05 -0400
Subject: Maximum room temperature
References: <200204201603.g3KG3iF08234@blueraja.scyld.com> <3CC42029.5B484258@labtie.mmt.upc.es>
Message-ID: <3CC58CC1.41C0EBDC@icase.edu>

Manel Soria wrote:
> 
> I'm wondering what is the maximum reasonable ambient
> temperature to have in a cluster room. In our room
> with 72 nodes we have about 29-30 oC (84-86 oF).
> Is this too high ? Can this be the cause of hardware
> failures ?

Yes it can.  We start to lose hardware (disks, etc.) whenever
temperature climbs to 85 deg. F (30 deg. C).  Our computer room AC is
set to maintain about 70 deg. F (21 deg. C), and we turn on spare AC
units if this reaches 75 deg. F (about 24 deg. C).  By 80 deg. F (27
deg. C), we start shutting down machines.

BTW, hardware temperature monitoring measures temperatures inside the
boxes, which are higher.  CPU temperatures vary a lot and can easily
reach 55 deg. C when loaded; motherboard temperatures are more stable
(typically about 29-30 deg. C).  We also wrote some periodic scripts
which can e-mail root or even trigger automatic cluster shutdown when
the average motherboard temperatures exceed reasonable limits (e.g.
35-40 deg. C).  Unfortunately, dual CPU machines do not poweroff (Red
Hat's Linux kernel 2.4.9-31smp considers "poweroff" unsafe on SMP
machines) but at least they produce less heat when halted.

Sincerely,
Josip


-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From timm at fnal.gov  Tue Apr 23 10:31:28 2002
From: timm at fnal.gov (Steven Timm)
Date: Tue, 23 Apr 2002 12:31:28 -0500 (CDT)
Subject: Packet Engines "Hamachi" gigabit ethernet card
Message-ID: <Pine.LNX.4.31.0204231230010.22248-100000@snowball.fnal.gov>

Does anyone have a Hamachi gigabit ethernet card that is
working under any kind of a 2.4 kernel at all?  If so, what did
it take to make it work?  With all the various drivers we have
tried including the latest available, we get the error
message "too much work at interrupt" and no traffic through the card.

Thanks

Steve Timm

------------------------------------------------------------------
Steven C. Timm (630) 840-8525  timm at fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations


From keithu at parl.clemson.edu  Tue Apr 23 13:21:01 2002
From: keithu at parl.clemson.edu (Keith Underwood)
Date: Tue, 23 Apr 2002 16:21:01 -0400 (EDT)
Subject: Packet Engines "Hamachi" gigabit ethernet card
In-Reply-To: <Pine.LNX.4.31.0204231230010.22248-100000@snowball.fnal.gov>
Message-ID: <Pine.LNX.4.44.0204231616430.9978-200000@keithu-pc2.parl.clemson.edu>

It should be working "soon" in newer 2.4 kernels.  There was a bug
introduced by someone doing some "clean-ups" early in 2.4.  I have 
attached a patch against the 2.4.18 version of the driver that should let 
you get traffic through the card.  I don't know if anything else was 
broken along the way or not.

					Keith

On Tue, 23 Apr 2002, Steven Timm wrote:

> 
> Does anyone have a Hamachi gigabit ethernet card that is
> working under any kind of a 2.4 kernel at all?  If so, what did
> it take to make it work?  With all the various drivers we have
> tried including the latest available, we get the error
> message "too much work at interrupt" and no traffic through the card.
> 
> Thanks
> 
> Steve Timm
> 
> ------------------------------------------------------------------
> Steven C. Timm (630) 840-8525  timm at fnal.gov  http://home.fnal.gov/~timm/
> Fermilab Computing Division/Operating Systems Support
> Scientific Computing Support Group--Computing Farms Operations
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

---------------------------------------------------------------------------
Keith Underwood                   Parallel Architecture Research Lab (PARL)
keithu at parl.clemson.edu                                  Clemson University

-------------- next part --------------
--- drivers/net/hamachi.c	Mon Feb 25 14:37:59 2002
+++ drivers/net/hamachi.patched.c	Tue Apr  2 15:32:45 2002
@@ -210,8 +210,10 @@
 /* Condensed bus+endian portability operations. */
 #if ADDRLEN == 64
 #define cpu_to_leXX(addr)	cpu_to_le64(addr)
+#define desc_to_virt(addr) bus_to_virt(le64_to_cpu(addr))
 #else 
 #define cpu_to_leXX(addr)	cpu_to_le32(addr)
+#define desc_to_virt(addr) bus_to_virt(le32_to_cpu(addr))
 #endif   
 
 
@@ -1544,7 +1546,8 @@
 			break;
 		pci_dma_sync_single(hmp->pci_dev, desc->addr, hmp->rx_buf_sz, 
 			PCI_DMA_FROMDEVICE);
-		buf_addr = (u8 *)hmp->rx_ring + entry*sizeof(*desc);
+		//buf_addr = (u8 *)hmp->rx_ring + entry*sizeof(*desc);
+		buf_addr = desc_to_virt(desc->addr);
 		frame_status = le32_to_cpu(get_unaligned((s32*)&(buf_addr[data_size - 12])));
 		if (hamachi_debug > 4)
 			printk(KERN_DEBUG "  hamachi_rx() status was %8.8x.\n",

From joachim at lfbs.RWTH-Aachen.DE  Tue Apr 23 01:01:18 2002
From: joachim at lfbs.RWTH-Aachen.DE (Joachim Worringen)
Date: Tue, 23 Apr 2002 10:01:18 +0200
Subject: Kidger's comments on Quadric's design and performance
References: <003601c1ea1f$4c3a75f0$6601a8c0@charterpipeline.com>
Message-ID: <3CC514CE.8D08E38@lfbs.rwth-aachen.de>

Richard Fryer wrote:
> 
> On Fri, 19 Apr 2002 14:06:00 +0100
> Daniel Kidger <Daniel.Kidger at quadrics.com> wrote:
> 
> > after all as well as having the fastest line-speed, the Quadrics
> > interconnect sends all data as virtual addresses (the NIC has its
> > own MMU and TLB). That way any process can read and write
> > the memory of any other node without any CPU overhead.
> 
> I appreciate getting a bit of technical detail on Quadrics interfaces.  Is
> there a web location that might provide more information - comparative
> benchmarks or protocol information or ???

Of course www.quadrics.com, and Fabrizio Petrini is doing a lot of
evaluation work (http://www.c3.lanl.gov/~fabrizio, esp.
http://www.c3.lanl.gov/~fabrizio/quadrics.html).

> This message also reminded me to ask if a long-held opinion is valid - and
> that opinion is "that a cache coherent interconnect would offer performance
> enhancement when applications are at the 'more tightly coupled' end of the
> spectrum."  I know that present PCI based interfaces can't do that without
> invoking software overhead and latencies.  Anyone have data - or an argument
> for invalidating this opinion?

You would need another programming model than MPI for that (see below),
maybe OpenMP as you basically have the characteristics of a SMP system
with cc-NUMA architecture.

> I did recently read that the AMD 'HyperTransport' interfaces ARE capable of
> cache coherent transactions.  This would appear to allow protocols (such as
> SCI) that support cache coherence to operate in that mode.  But I wonder if
> it matters to the MPI world.  Seems to me that it would be a factor in
> improving scalability (providing that other interconnect issues such as
> bandwidth bottlenecks) don't prevent it.  My recollection is that the SCI
> simulations I saw required very little added traffic to maintain coherency.

This is true (for an introduction, see
http://www.SCIzzL.com/HowSCIcohWorks.html).

However, for MPI, cache-coherence would not really add a performance
benefit. MPI is designed to be efficient with "write-only" protocols.
One-sided communication may benefit from it, but other techniques like
Cray SHMEM do the same w/o cache-coherence.

And I do not expect anybody except AMD or chipset designers to design
network adapters / bus bridges for something propietary like
HyperTransport...

> Also a brief note about the Dolphin product line, since the issue of link
> saturation has come up:  - they DO also sell switches - or at least offer
> them.  And if you check the SCI specification, you'll see that there are
> some elaborate discussions of fabric architectures that the protocol
> supports and switches enable.  What I DO NOT know is if the SCALI software
> supports switch-based operation, and also don't know what the impact is on
> the system cost per node.  My 'inexperienced' assessment of the appeal in
> the Dolphin family is that you can start without the switch and later add it
> if the performance benefit warrents.  That's what I'd say if I were selling
> them anyway - and didn't know otherwise.  :-)

The "external" switches are not designed for large-scale HPC
applications (although they scale quite well inside the range of their
supported number of nodes), but for high-performance, high-availabitlity
small-scale cluster or embedded applications, as i.e. Sun sells. With
ext. switches, you don't have to do anything to keep the network up if a
node fails (and also nothing if it comes back as SCI is not
source-routed). In torus topologies, re-routing needs to be applied to
bypass bad nodes (Scali does this on-the-fly).

Scali does not support external switches AFAIK (at least doesn't sell
such systems any longer), which is less a technical issue but more a
design-issue as the topology is fully transparent for the nodes
accessing the network (they did use switches in the past, see
http://www.scali.com/whitepaper/ehpc97/slide_9.html). 

For large scale applications, distributed switches as in torus
topologies scale better and more cost-efficient (see
http://www.scali.com/whitepaper/scieurope98/scale_paper.pdf and other
resources). With switches, you need *a lot* of cables and switches
(which doesn't hinder Quadrics to do so - resulting in an impressive 14
miles of cables for a recent system (IIRC) with single cables being up
to 25m in length). It would need to be verified if such a system build
with a Quadrics-like fat-tree topologie using Dolphins 8-port switches
would scale better than the equivalent torus topologie for different
communication patterns. I doubt it. At least, the interconect would cost
a lot more (at least twice, or even more depending on the dimension of
the tree).

SCI-MPICH, can be used with arbitraries SCI topologies (because it uses
the SISCI interface and thus runs with Scali or Dolphin SCI drivers). It
is not that closely coupled to the SCI drivers as ScaMPI is.

 Joachim

-- 
|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339


From cpignol at seismiccity.com  Tue Apr 23 15:32:00 2002
From: cpignol at seismiccity.com (Claude Pignol)
Date: Tue, 23 Apr 2002 17:32:00 -0500
Subject: Ethernet Channel Bonding (ECB)
References: <3CC4A452.9080403@bellsouth.net> <20020422174541.B24777@stikine.ucs.sfu.ca>
Message-ID: <3CC5E0E0.3000706@seismiccity.com>

Martin,

Have you generalized channel bonding to all the nodes of your cluster?

Which switch do you use?

Thanks
Claude

Martin Siegert wrote:

>On Mon, Apr 22, 2002 at 06:01:22PM -0600, Timothy W. Moore wrote:
>
>>I am new to the Beowulf cluster computing and am awaiting UPS to deliver 
>>my systems.  I have been researching ECB and it seems to have mixed 
>>reviews regarding performance enhancement.  Would/Could someone shed 
>>some light on this topic to the following effect:
>>
>>[1] Is it truly necessary?
>>
>
>That depends on the programs you are planning to run your cluster. With
>respect to performance: I am getting 269Mbit/s bandwith with 3-way
>channel bonded fast ethernet (using 3Com NICs). This is almost exactly
>three times as much as I get from a single NIC. The latency is not quite
>as good as with a single NIC: 55us vs. 43us (all numbers measured with
>netpipe). Thus if your program doesn't need the bandwith or if your
>program is extremely sensitive to latencies then you don't need channel
>bonding.
>
>>[2] If using RedHat 7.2, should I re-compile the kernel?
>>
>
>Not necessarily. The bonding.o module is part of all RedHat kernels.
>However, I strongly recommend upgrading the kernel to 2.4.18 - I had
>problems with all earlier 2.4.x versions.
>
>Regards,
>Martin
>
>========================================================================
>Martin Siegert
>Academic Computing Services                        phone: (604) 291-4691
>Simon Fraser University                            fax:   (604) 291-4242
>Burnaby, British Columbia                          email: siegert at sfu.ca
>Canada  V5A 1S6
>========================================================================
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
------------------------------------------------------------------------
Claude Pignol SeismicCity, Inc. <http://www.seismiccity.com>
2900 Wilcrest Dr.    Suite 470  Houston TX 77042
Phone:832 251 1471 Mob:281 703 2933  Fax:832 251 0586


From heckendo at cs.uidaho.edu  Tue Apr 23 17:41:35 2002
From: heckendo at cs.uidaho.edu (Robert B Heckendorn)
Date: Tue, 23 Apr 2002 17:41:35 -0700 (PDT)
Subject: cooling
In-Reply-To: <200204231601.g3NG13b05509@blueraja.scyld.com>
Message-ID: <200204240041.RAA21877@brownlee.cs.uidaho.edu>

We are looking at the facilities issues in installing a beowulf on the
order of 500 nodes.  What facilities is telling us is that it is going
to almost cost us more to buy the cooling for the machine than to buy
machine itself.  How are people making the air conditioning for their
machines affordable?  Have we miscalculated the HVAC loads?  Are we
being over charged?  

thanks for any guidance.

-- 
| Robert Heckendorn                        | We may not be the only
| heckendo at cs.uidaho.edu                   | species on the planet but
| http://www.cs.uidaho.edu/~heckendo       | we sure do act like it.
| CS Dept, University of Idaho             |
| Moscow, Idaho, USA   83844-1010          |


From bob at drzyzgula.org  Tue Apr 23 18:54:37 2002
From: bob at drzyzgula.org (Bob Drzyzgula)
Date: Tue, 23 Apr 2002 21:54:37 -0400
Subject: cooling
In-Reply-To: <200204240041.RAA21877@brownlee.cs.uidaho.edu>
References: <200204231601.g3NG13b05509@blueraja.scyld.com> <200204240041.RAA21877@brownlee.cs.uidaho.edu>
Message-ID: <20020423215437.A3370@www2>

If the new load requires the installation of new
chillers, it could indeed cost a pile-o'-money. Even
if each node burned electricity at 100 Watts, you
are looking at 50 kW of power consumption, or about
170,000 BTU/hr, requiring about 14 tons of cooling to
remove -- your facilities folks may well be looking
at installing something like one or more Liebert
chillers such as these:
http://www.liebert.com/dynamic/displayproduct.asp?id=545&cycles=60Hz

There could well be additional shortfalls in external
heat exchanger capacity, pipe capacity out to the
heat exchangers, electric power for the computers
and for the chillers, etc. If you don't already
have the raised floor space, that could also add
quite a bit to the cost to cool all those nodes.

As to how we are making the A/C for our systems "affordable",
we do it by virtue of the HVAC budget belonging to
a different division, :-) although that also means
that we don't have *control* over that budget, and
when we hit the ceiling on cooling we kind of have
to just stop installing new equipment until the whining
and begging and pleading might eventually get us
a new chiller -- and even then we might have to give
up some rack space so there'd be a place to put it. :-(

--Bob

On Tue, Apr 23, 2002 at 05:41:35PM -0700, Robert B Heckendorn wrote:
> 
> We are looking at the facilities issues in installing a beowulf on the
> order of 500 nodes.  What facilities is telling us is that it is going
> to almost cost us more to buy the cooling for the machine than to buy
> machine itself.  How are people making the air conditioning for their
> machines affordable?  Have we miscalculated the HVAC loads?  Are we
> being over charged?  
> 
> thanks for any guidance.
> 
> -- 
> | Robert Heckendorn                        | We may not be the only
> | heckendo at cs.uidaho.edu                   | species on the planet but
> | http://www.cs.uidaho.edu/~heckendo       | we sure do act like it.
> | CS Dept, University of Idaho             |
> | Moscow, Idaho, USA   83844-1010          |
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From jon at minotaur.com  Tue Apr 23 19:32:47 2002
From: jon at minotaur.com (Jon Mitchiner)
Date: Tue, 23 Apr 2002 22:32:47 -0400
Subject: cooling
References: <200204231601.g3NG13b05509@blueraja.scyld.com> <200204240041.RAA21877@brownlee.cs.uidaho.edu> <20020423215437.A3370@www2>
Message-ID: <024901c1eb38$52a8ba90$0d01a8c0@jonxp>

The other consideration to have is some kind of monitoring/alerting system
for the room.  A client has a dedicated cooling equipment for a beowulf
cluster for 52 machines.  Recently the A/C broke one morning and they did
not find out till the afternoon when someone walked into the small network
room and found the room was in excess of 100 degrees.

I dont want to think about what could have happened if it happened on a
friday evening and nobody found about it until Monday. :)

Jon Mitchiner

----- Original Message -----
From: "Bob Drzyzgula" <bob at drzyzgula.org>
To: "Robert B Heckendorn" <heckendo at cs.uidaho.edu>
Cc: <beowulf at beowulf.org>
Sent: Tuesday, April 23, 2002 9:54 PM
Subject: Re: cooling


> If the new load requires the installation of new
> chillers, it could indeed cost a pile-o'-money. Even
> if each node burned electricity at 100 Watts, you
> are looking at 50 kW of power consumption, or about
> 170,000 BTU/hr, requiring about 14 tons of cooling to
> remove -- your facilities folks may well be looking
> at installing something like one or more Liebert
> chillers such as these:
> http://www.liebert.com/dynamic/displayproduct.asp?id=545&cycles=60Hz
>
> There could well be additional shortfalls in external
> heat exchanger capacity, pipe capacity out to the
> heat exchangers, electric power for the computers
> and for the chillers, etc. If you don't already
> have the raised floor space, that could also add
> quite a bit to the cost to cool all those nodes.
>
> As to how we are making the A/C for our systems "affordable",
> we do it by virtue of the HVAC budget belonging to
> a different division, :-) although that also means
> that we don't have *control* over that budget, and
> when we hit the ceiling on cooling we kind of have
> to just stop installing new equipment until the whining
> and begging and pleading might eventually get us
> a new chiller -- and even then we might have to give
> up some rack space so there'd be a place to put it. :-(
>
> --Bob
>
> On Tue, Apr 23, 2002 at 05:41:35PM -0700, Robert B Heckendorn wrote:
> >
> > We are looking at the facilities issues in installing a beowulf on the
> > order of 500 nodes.  What facilities is telling us is that it is going
> > to almost cost us more to buy the cooling for the machine than to buy
> > machine itself.  How are people making the air conditioning for their
> > machines affordable?  Have we miscalculated the HVAC loads?  Are we
> > being over charged?
> >
> > thanks for any guidance.
> >
> > --
> > | Robert Heckendorn                        | We may not be the only
> > | heckendo at cs.uidaho.edu                   | species on the planet but
> > | http://www.cs.uidaho.edu/~heckendo       | we sure do act like it.
> > | CS Dept, University of Idaho             |
> > | Moscow, Idaho, USA   83844-1010          |
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>


From scott.delinger at ualberta.ca  Tue Apr 23 19:45:36 2002
From: scott.delinger at ualberta.ca (Scott Delinger)
Date: Tue, 23 Apr 2002 20:45:36 -0600
Subject: cooling
Message-ID: <p05111714b8ebcc3ba2d8@[129.128.2.254]>

Hmm. I just bought 126 dual AthlonMP boxes, and needed to renovate 
the lab (electricity and AC). I've now got 7 tons of AC, and a whole 
panel devoted to the clusters in this facility. The reno was about 
CDN$35K (US$1.50), and the machines cost maybe eight times that?
-- 

Scott L. Delinger, Ph.D.		IT Administrator
Department of Chemistry
University of Alberta
Edmonton, Alberta, Canada  T6G 2G2
scott.delinger at ualberta.ca


From scott.delinger at ualberta.ca  Tue Apr 23 19:56:53 2002
From: scott.delinger at ualberta.ca (Scott Delinger)
Date: Tue, 23 Apr 2002 20:56:53 -0600
Subject: cooling
In-Reply-To: <024901c1eb38$52a8ba90$0d01a8c0@jonxp>
References: <200204231601.g3NG13b05509@blueraja.scyld.com>
 <200204240041.RAA21877@brownlee.cs.uidaho.edu>
 <20020423215437.A3370@www2> <024901c1eb38$52a8ba90$0d01a8c0@jonxp>
Message-ID: <p05111716b8ebce9a3122@[129.128.2.254]>

>The other consideration to have is some kind of monitoring/alerting system
>for the room.  A client has a dedicated cooling equipment for a beowulf
>cluster for 52 machines.  Recently the A/C broke one morning and they did
>not find out till the afternoon when someone walked into the small network
>room and found the room was in excess of 100 degrees.
>
>I dont want to think about what could have happened if it happened on a
>friday evening and nobody found about it until Monday. :)

Ah, and on that: http://www.netbotz.com/ (rackmountable and wall- or 
camera-mountable Temp, RH, air speed, door contact, camera, and 
external sensors: all web/SNMP addressable w/email alerts available). 
I've got four WallBotz 310 units, in server rooms, wiring closets, 
and cluster rooms. And the cluster room has a thermostat monitored 
24x7 by our Physical Plant. Braces and belt, when that much hardware 
is on the line.
-- 

Scott L. Delinger, Ph.D.		IT Administrator
Department of Chemistry
University of Alberta
Edmonton, Alberta, Canada  T6G 2G2
scott.delinger at ualberta.ca


From steveb at aei-potsdam.mpg.de  Tue Apr 23 21:01:02 2002
From: steveb at aei-potsdam.mpg.de (Steven Berukoff)
Date: Wed, 24 Apr 2002 06:01:02 +0200 (MET DST)
Subject: cooling
In-Reply-To: <200204240041.RAA21877@brownlee.cs.uidaho.edu>
Message-ID: <Pine.OSF.4.21.0204240555450.29099-100000@holodec15.aei-potsdam.mpg.de>

Hi,

We just purchased ~150 dual AMDs, and are cooling them with 4 Fujitsu
ceiling-mounted air-conditioners: about 50kW of AC cost us about $25k,
which is about 10% of the cost of the machines.

So yeah, you're getting screwed.  It might be that your facilites people
are thinking that you need one huge monolithic cooling system, and you
may; however, cooling in a piecemeal way ends up costing a lot less.

Steve

> We are looking at the facilities issues in installing a beowulf on the
> order of 500 nodes.  What facilities is telling us is that it is going
> to almost cost us more to buy the cooling for the machine than to buy
> machine itself.  How are people making the air conditioning for their
> machines affordable?  Have we miscalculated the HVAC loads?  Are we
> being over charged?  
> 
> thanks for any guidance.
> 
> -- 
> | Robert Heckendorn                        | We may not be the only
> | heckendo at cs.uidaho.edu                   | species on the planet but
> | http://www.cs.uidaho.edu/~heckendo       | we sure do act like it.
> | CS Dept, University of Idaho             |
> | Moscow, Idaho, USA   83844-1010          |
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


=====
Steve Berukoff					tel: 49-331-5677233
Albert-Einstein-Institute			fax: 49-331-5677298
Am Muehlenberg 1, D14477 Golm, Germany		email:steveb at aei.mpg.de


From heckendo at cs.uidaho.edu  Tue Apr 23 22:00:16 2002
From: heckendo at cs.uidaho.edu (Robert B Heckendorn)
Date: Tue, 23 Apr 2002 22:00:16 -0700 (PDT)
Subject: COTS cooling
In-Reply-To: <200204240407.g3O47Jb22438@blueraja.scyld.com>
Message-ID: <200204240500.WAA23212@brownlee.cs.uidaho.edu>

We don't have to pay for the cooling but the cost of the installation
of cooling is being used as an argument to cut corners on the machine
itself.  :-( So I would love to get the cost of the installation of
cooling down.

One of the responses to my mail said:

"We just purchased ~150 dual AMDs, and are cooling them with 4 Fujitsu
ceiling-mounted air-conditioners: about 50kW of AC cost us about $25k,
which is about 10% of the cost of the machines."

This sounds like COTS cooling to go with our COTS machines.  :-) It
has the nice feature that if one AC goes out the others keep running.
It is also nice in that half a dozen 125KBTU/hr units in the ceiling
would seem to handle a fairly large load and all machines for the next
4 years of expansion.

450W/dualnode * 3.4BTU/hr/W * 400 nodes  = 612K BTU/hr

Does anyone else comments on this scheme (pros or con)?
Is anyone doing anything like this?


-- 
| Robert Heckendorn                        | We may not be the only
| heckendo at cs.uidaho.edu                   | species on the planet but
| http://www.cs.uidaho.edu/~heckendo       | we sure do act like it.
| CS Dept, University of Idaho             |
| Moscow, Idaho, USA   83844-1010          |


From rauch at inf.ethz.ch  Wed Apr 24 00:46:38 2002
From: rauch at inf.ethz.ch (Felix Rauch)
Date: Wed, 24 Apr 2002 09:46:38 +0200 (CEST)
Subject: Packet Engines "Hamachi" gigabit ethernet card
In-Reply-To: <Pine.LNX.4.31.0204231230010.22248-100000@snowball.fnal.gov>
Message-ID: <Pine.LNX.4.33.0204240943470.13044-100000@maloney.ethz.ch>

On Tue, 23 Apr 2002, Steven Timm wrote:
> Does anyone have a Hamachi gigabit ethernet card that is
> working under any kind of a 2.4 kernel at all?  If so, what did
> it take to make it work?  With all the various drivers we have
> tried including the latest available, we get the error
> message "too much work at interrupt" and no traffic through the card.

We use the Hamachi GNIC-II cards on 2.4.3 kernels without any problems.

We currently use the driver compiled as a module. Here's some more
information which might be relevant for you:

Apr  6 13:36:39 c1 kernel: hamachi.c:v1.01 5/16/2000  Written by Donald Becker
[...]
Apr  6 13:36:39 c1 kernel: eth1: Hamachi GNIC-II type 10911 at 0xe081d000, 00:e0:b1:04:16:cb, IRQ 20.
Apr  6 13:36:39 c1 kernel: eth1:  64-bit 33 Mhz PCI bus (60), Virtual Jumpers 30, LPA 0000.

Regards,
Felix
-- 
Felix Rauch                      | Email: rauch at inf.ethz.ch
Institute for Computer Systems   | Homepage: http://www.cs.inf.ethz.ch/~rauch/
ETH Zentrum / RZ H18             | Phone: ++41 1 632 7489
CH - 8092 Zuerich / Switzerland  | Fax:   ++41 1 632 1307


From bjornts at mi.uib.no  Wed Apr 24 00:58:06 2002
From: bjornts at mi.uib.no (Bjorn Tore Sund)
Date: Wed, 24 Apr 2002 09:58:06 +0200 (CEST)
Subject: Intel releases C++/Fortran suite V 6.0 for Linux
In-Reply-To: <200204231601.g3NG1Wb05553@blueraja.scyld.com>
Message-ID: <Pine.LNX.4.33.0204240954130.21703-100000@abel.mi.uib.no>

On Tue, 23 Apr 2002 beowulf-request at beowulf.org wrote:
> Date: Tue, 23 Apr 2002 16:29:44 +0200 (CEST)
> From: Eugen Leitl <eugen at leitl.org>
> To: <Beowulf at beowulf.org>
> Subject: Intel releases C++/Fortran suite V 6.0 for Linux
>
>
> Intel announces the release of Version 6 of the Intel(R) C++ and Fortran
> Compilers for Windows and Linux. Take advantage of performance for your
> software. Please visit our Compilers Home Page today.
>
> http://www.intel.com/software/products/compilers/

I've been wanting to test these out, both in the previous versions
and this, but as long as Intel are only releasing them as RedHat
rpms, they are fundamentally useless on a SuSE system.  Or at least
a lot of hassle to install.  Anyone know if Intel are going to come
to their senses and start releasing tarballs, or am I going to have
to go through that hassle?

Bj?rn
-- 
Bj?rn Tore Sund         Phone:  (+47) 555-84894      Stupidity is like a
System administrator    Fax:    (+47) 555-89672      fractal; universal and
Math. Department        Mobile: (+47) 918 68075      infinitely repetitive.
University of Bergen    VIP:    81724
teknisk at mi.uib.no       Email:  bjornts at mi.uib.no    http://www.mi.uib.no/


From jcownie at etnus.com  Wed Apr 24 02:09:17 2002
From: jcownie at etnus.com (James Cownie)
Date: Wed, 24 Apr 2002 10:09:17 +0100
Subject: A better "Titanium" reference
Message-ID: <170Im5-0I7-00@etnus.com>

This is a better reference site for Titanium than the one I gave in my
previous mail :-

  http://www.cs.berkeley.edu/Research/Projects/titanium/

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


From javier.iglesias at freesurf.ch  Wed Apr 24 03:18:46 2002
From: javier.iglesias at freesurf.ch (javier.iglesias at freesurf.ch)
Date: Wed, 24 Apr 2002 11:18:46 +0100
Subject: Suggestions on fiber Gigabit NICs
Message-ID: <1019639926.webexpressdV3.1.f@smtp.freesurf.ch>

Hi all,

To cope with some network bottleneck problems leading to calculation 
crashes, we envisage to migrate our 18-nodes' (bi-AMD 1600+/Tyan Tiger 
MP/FastEthernet/Scyld 27-b8) master to Gigabit. 

I would like to get your feelings/experiences on two fiber Gigabit 
NICs :

1) Netgear GA-621
-> http://www.netgear.com/product_view.asp?xrp=1&yrp=1&zrp=106

2) 3Com 3C996-SX
-> 
http://www.3com.com/products/en_US/detail.jsp?tab=features&pathtype=purchase&sku=3C996-SX


We have a really nice ethernet Extreme Networks Summit 48 switch 
-> http://www.extremenetworks.com/products/datasheets/summit24.asp
that offers 2 fiber Gigabit ports we want to push to work :)

Any other suggestion ? want to share comments ?

Thanks in advance for your help !


--javier

---
Genug gewartet? sunrise ADSL: schneller im Internet.
http://www.sunrise.ch/de/internet/int_ads.asp
---
Assez attendu? Avec sunrise ADSL, surfez encore plus vite sur le net.
http://www.sunrise.ch/fr/internet/int_ads.asp
---
Stufo di aspettare? Con sunrise ADSL pi? veloce che mai in Internet.
http://www.sunrise.ch/it/internet/int_ads.asp


From daniel.pfenniger at obs.unige.ch  Wed Apr 24 02:51:28 2002
From: daniel.pfenniger at obs.unige.ch (Daniel Pfenniger)
Date: Wed, 24 Apr 2002 11:51:28 +0200
Subject: Intel releases C++/Fortran suite V 6.0 for Linux
References: <Pine.LNX.4.33.0204240954130.21703-100000@abel.mi.uib.no>
Message-ID: <3CC68020.2080609@obs.unige.ch>

Bjorn Tore Sund wrote:

...
> 
> I've been wanting to test these out, both in the previous versions
> and this, but as long as Intel are only releasing them as RedHat
> rpms, they are fundamentally useless on a SuSE system.  Or at least
> a lot of hassle to install.  Anyone know if Intel are going to come
> to their senses and start releasing tarballs, or am I going to have
> to go through that hassle?
> 
> Bj?rn


At least on Mandrake 8.2 no particular problem occured.
The installation just detect which kernel and glibc versions
are there and install using rpm the files (by default, but modifiable,
  in /opt/intel/)
The installation script warns that on a non RedHat distribution
the compilers have not been tested, etc..
There is also an uninstall script that use rpm too.

On Intel based computers the compilers are sufficiently interesting
in boosting performance (sometimes by over 100% w.r.t. gcc/g77) to
invest perhaps 30min for their installation and preliminary testing.
The remaining "hassle" is perhaps to extend the PATH and LD_LIBRARY
variables, or add a line in /etc/ld.so.conf + run ldconfig

	Dan


From rgb at phy.duke.edu  Wed Apr 24 06:07:51 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 24 Apr 2002 09:07:51 -0400 (EDT)
Subject: cooling
In-Reply-To: <200204240041.RAA21877@brownlee.cs.uidaho.edu>
Message-ID: <Pine.LNX.4.44.0204240822230.27746-100000@lucifer.rgb.private.net>

On Tue, 23 Apr 2002, Robert B Heckendorn wrote:

> We are looking at the facilities issues in installing a beowulf on the
> order of 500 nodes.  What facilities is telling us is that it is going
> to almost cost us more to buy the cooling for the machine than to buy
> machine itself.  How are people making the air conditioning for their
> machines affordable?  Have we miscalculated the HVAC loads?  Are we
> being over charged?  

No, this is one of the miracles of modern beowulfery.  Our new facility
in the physics department here is a modest sized room, perhaps 5mx13m.
It has 75 KW of power in umpty 20A and 15A (120VAC) circuits.  It has a
heat exchange unit in one end of the room (unfortunately we were unable
to commandeer the small room next door which would have put it and its
noise out of the space itself) that is about 3mx3mx3m (to the ceiling,
anyway) and that eats an extra half-meter or more on the sides in wasted
space (across from the door, fortunately), making the first 3m+ of the
room unusable for anything but entrance and AC.

The room did require a certain amount of prep -- old floor out, new
floor in, asbestos removal, paint.  It did require fairly extensive
wiring for all of the nodes -- a couple of large power distribution
panels, power poles every couple of meters where they can service
clusters of racks, a nifty thermal kill for the room power (room temp
hits a preset of say, 30-35C and bammo, all nodes are just shut down the
hard way).  It did require a certain number of overhead cable trays and
so forth.  Still, I believe that the AC alone (one capable of removing
75 KW continuously) dominated the cost of the $150K renovation.  It was
so expensive that we had to reall work to convince the University to do
it at all, and share the space with another department to ensure that it
is filled as much as possible.

Right now we are probably balancing along at the point where the number
of nodes in the room equals the cost of renovation -- we probably have
on the order of $150K worth of systems racked up and shelved.  However,
we are also ordering new nodes and upgrades pretty steadily as grants
and so forth come in, and will likely have well over $250K worth of
hardware in the room by the end of the year (which will translate into
order 250 CPUs -- even buying duals, our nodes (without myrinet and with
only some nodes on gigabit ethernet) are costing roughly $1K/cpu in a 2U
dual athlon rackmount configuration.

By the time the room is FULL (or as full as we can get it), probably in
a couple of years, it should have order of 500 cpus (we're highball
estimating 150W per CPU, although we're hoping for an average that is
more like 100W -- high end Athlons draw about 70W loaded all by
themselves, and then there is the rest of the system).  At that point
our node investment will likely exceed our renovation expense by 3 to 1
or better, and of course the value to the University in grant-funded
research enabled by all of those nodes will be higher still -- every
postdoc or faculty person grant-supported by research done with the
cluster will probably net the university $30K or more in indirect costs.

Overall, I therefore think that this is a solid win for the University
and an investment essential to keeping the University current and
competitive in its theoretical physics (and statistics, the group with
whom we share) research.  The University has at this point some two or
three similar facilities in several buildings on campus.  Computer
science has an even (much) larger cluster/server facility that it shares
with e.g. math (which has at least one large cluster doing imaging
research supported by petrochemical companies).  I believe that they are
considering the construction of an even larger centralized facility to
put genomic research and some biomed engineering clusters in.

In a way it this is wistfully interesting.  Old Guys (tm) will remember
well the days of totally centralized compute resources, where huge,
expensive facilities housed rows of e.g. IBM 370s.  There were high
priests who cared for and fed these beasts, acolytes who scurried in and
out, and one prayed to them in the form of Fortran IV card decks with
HASP job control prologue/epilogues and awaited the granting of your
prayers in the form of a green-barred lineprinter output (charged per
page including the bloody header page) placed into the box labelled with
your last name initial.  It was all very solemn, expensive, and
ritualized.

Then first the minicomputer, then the PC, liberated us from all of that.
An IBM PC didn't run as fast as a 370, but time on the 370 was billed at
something like $1/minute of CPU and time on the PC, even at a capital
cost of $5K for the PC itself (yes, they were expensive little pups) was
amortized out over YEARS (at 1440 minutes/day).  Even using the PC as a
terminal to the 370 allowed one to edit remotely instead of on a
timeshare basis (billed at $1/minute, damn it!) and saved one loads of
connect time (hence money).  And then came Sun workstations, faster PCs,
linux and somewhere in there computing became almost completely
decentralized with a client/server paradigm -- yes, there were a few
centralized servers, but most actual computation and presentation was
done at the desktop.

Even early beowulfs were largely spread out and not horribly
centralized.  An 8 node or 16 node system could fit in an office, a 32
node or even 64 node shelved beowulf could fit in a small server room.
The beauty of them was that you bought one for YOUR research, you didn't
share it (time or otherwise), and once you figured out how to put it all
together it didn't require much care and feeding, certainly not at the
high priest/acolyte stage (although cooling even 32 nodes starts to be
serious business).

Alas, we now seem to have come full circle.  Beowulfs are indeed COTS
supercomputers, but high density beowulfs are rackmounted and put in
centralized, expensive, often shared server rooms and strongly resemble
those centralized computers from which we once were freed.

I exaggerate the woe, of course.  The whole cluster NOW is transparently
accessible at gigabit speeds from your desktop across campus (and
wouldn't be any MORE accessible if you were sitting at a workstation in
the room with it listening to 80db worth of AC roar in your ear), linux
is excruciatingly stable (when it isn't unstable as hell, of course:-),
and once you get the nodes installed and burned in a human needs to
actually visit the cluster room only once in a long while.  We've
replaced the high priests and acolytes with sysadmin wizards and
application/programming gurus but this is a welcome change, actually
(they may appear similar but philosophically they are very different
indeed:-).  Still, the centralization threatens to a greater or lesser
extent the freedom -- it puts control much more into the hands of
administrators, costs more, involves more people in decisions.

Not much to do with the original question, sure, but I needed a little
philosophical ramble to start my day.  Now I have to write an hour exam
for my kiddies, which is less fun.  Last day of class, though, which IS
fun!  Hooray!

(It isn't only the students that anticipate summer...;-)

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Wed Apr 24 06:16:17 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 24 Apr 2002 09:16:17 -0400 (EDT)
Subject: cooling
In-Reply-To: <024901c1eb38$52a8ba90$0d01a8c0@jonxp>
Message-ID: <Pine.LNX.4.44.0204240908580.27746-100000@lucifer.rgb.private.net>

On Tue, 23 Apr 2002, Jon Mitchiner wrote:

> The other consideration to have is some kind of monitoring/alerting system
> for the room.  A client has a dedicated cooling equipment for a beowulf
> cluster for 52 machines.  Recently the A/C broke one morning and they did
> not find out till the afternoon when someone walked into the small network
> room and found the room was in excess of 100 degrees.
> 
> I dont want to think about what could have happened if it happened on a
> friday evening and nobody found about it until Monday. :)

A very good idea is a thermal kill switch on the master power panels,
mentioned in my previous reply.  If room temperature hits a preset FOR
ANY REASON, all nodes go down, period.

Strategically, one would still want to monitor node and room temperature
and install alarms and automated node shutdown scripts as previously
mentioned in many discussions, but set THOSE alarms and shutdowns to go
off at (say) 25C and 30C and set the room kill switch at (say) 35C.
That way if cooling fails, first you get mail/pages/human alarms (at
25C, assuming room temperature is set to 20C and ordinarily stays there
+/- 2C), then at 30C (or when CPU temps pass a preset alarm that goes
off when ambient room temperature gets about there) nodes start shutting
themselves down, and only if the humans and shutdowns fail to control
the temperature or get the AC running again and the temperature keeps
climbing does the room kill go off.

This protects the systems against ANY POSSIBILITY that they will operate
for an extended time at a "dangerous" temperature.

Conservative people, or people with evidence that the properly
functioning room has a very stable temperature profile, could reduce the
alarm margins even further -- 35C is already quite a bit hotter than one
wants to let the ambient air reach.  30C would be better, but it doesn't
leave much room for less intrusive alarms and scripts to take effect.

   rgb

> 
> Jon Mitchiner
> 
> ----- Original Message -----
> From: "Bob Drzyzgula" <bob at drzyzgula.org>
> To: "Robert B Heckendorn" <heckendo at cs.uidaho.edu>
> Cc: <beowulf at beowulf.org>
> Sent: Tuesday, April 23, 2002 9:54 PM
> Subject: Re: cooling
> 
> 
> > If the new load requires the installation of new
> > chillers, it could indeed cost a pile-o'-money. Even
> > if each node burned electricity at 100 Watts, you
> > are looking at 50 kW of power consumption, or about
> > 170,000 BTU/hr, requiring about 14 tons of cooling to
> > remove -- your facilities folks may well be looking
> > at installing something like one or more Liebert
> > chillers such as these:
> > http://www.liebert.com/dynamic/displayproduct.asp?id=545&cycles=60Hz
> >
> > There could well be additional shortfalls in external
> > heat exchanger capacity, pipe capacity out to the
> > heat exchangers, electric power for the computers
> > and for the chillers, etc. If you don't already
> > have the raised floor space, that could also add
> > quite a bit to the cost to cool all those nodes.
> >
> > As to how we are making the A/C for our systems "affordable",
> > we do it by virtue of the HVAC budget belonging to
> > a different division, :-) although that also means
> > that we don't have *control* over that budget, and
> > when we hit the ceiling on cooling we kind of have
> > to just stop installing new equipment until the whining
> > and begging and pleading might eventually get us
> > a new chiller -- and even then we might have to give
> > up some rack space so there'd be a place to put it. :-(
> >
> > --Bob
> >
> > On Tue, Apr 23, 2002 at 05:41:35PM -0700, Robert B Heckendorn wrote:
> > >
> > > We are looking at the facilities issues in installing a beowulf on the
> > > order of 500 nodes.  What facilities is telling us is that it is going
> > > to almost cost us more to buy the cooling for the machine than to buy
> > > machine itself.  How are people making the air conditioning for their
> > > machines affordable?  Have we miscalculated the HVAC loads?  Are we
> > > being over charged?
> > >
> > > thanks for any guidance.
> > >
> > > --
> > > | Robert Heckendorn                        | We may not be the only
> > > | heckendo at cs.uidaho.edu                   | species on the planet but
> > > | http://www.cs.uidaho.edu/~heckendo       | we sure do act like it.
> > > | CS Dept, University of Idaho             |
> > > | Moscow, Idaho, USA   83844-1010          |
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org
> > > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> >
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Wed Apr 24 06:50:38 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 24 Apr 2002 09:50:38 -0400 (EDT)
Subject: COTS cooling
In-Reply-To: <200204240500.WAA23212@brownlee.cs.uidaho.edu>
Message-ID: <Pine.LNX.4.44.0204240933130.27746-100000@lucifer.rgb.private.net>

On Tue, 23 Apr 2002, Robert B Heckendorn wrote:

> We don't have to pay for the cooling but the cost of the installation
> of cooling is being used as an argument to cut corners on the machine
> itself.  :-( So I would love to get the cost of the installation of
> cooling down.
> 
> One of the responses to my mail said:
> 
> "We just purchased ~150 dual AMDs, and are cooling them with 4 Fujitsu
> ceiling-mounted air-conditioners: about 50kW of AC cost us about $25k,
> which is about 10% of the cost of the machines."
> 
> This sounds like COTS cooling to go with our COTS machines.  :-) It
> has the nice feature that if one AC goes out the others keep running.
> It is also nice in that half a dozen 125KBTU/hr units in the ceiling
> would seem to handle a fairly large load and all machines for the next
> 4 years of expansion.
> 
> 450W/dualnode * 3.4BTU/hr/W * 400 nodes  = 612K BTU/hr
> 
> Does anyone else comments on this scheme (pros or con)?
> Is anyone doing anything like this?

An AC consists of two separate components.  One is the heat
exchanger/blower/ductwork, which generally lives "in" the space being
cooled. To remove a lot of heat, one has to move a lot of air over a lot
of cold surface, so whether you use one large blower or several smaller
ones, you have to move the same volume of air over the same area cooled
the same amount, and one ALSO has to locate the ducting so that cold air
flows out, gets pulled through your systems, and then goes into the
return efficiently.  Otherwise your room will have hot spots and cool
spots, and hot-spot nodes will fail.  You can feel the heat just walking
past banks of nodes in our room, but there is always a feel of cold air
going past you towards the heat as well.

The other is the chiller, the part that actually takes the heat from the
room, squeezes it out (literally) into the outside air, and returns
cold-something (water, coolant, whatever) to the in-room heat exchanger.
Window unit ACs combine the two into a single package -- in central AC
and building AC they almost always are distinct.

In a centralized operation, the chillers might be located far from the
room.  They may have constraints on WHERE they can be located (typically
on the roof, for example, but likely only on certain parts of the scarce
roof real estate).  Getting unplanned insulated high volume pipes
through a big building made out of steel reinforced concrete is
nontrivial, getting power to the roof for new chillers is nontrivial,
putting the (heavy) chillers on the roof where they won't just fall
through onto the people working below is nontrivial.  Nontrivial =
expensive.  Then there is a wide range of ways that people/organizations
"bill" for this sort of construction -- beware creative accounting.

So sure, it might cost a relatively small amount to install a relative
large number of relatively cheap chillers and heat exchangers IF your
room e.g. has an outside wall with a preexisting window or ductwork to
the outside, there is plenty of room outside for a concrete pad and
wiring to support the chillers, and so forth.  OTOH, if you are in the
basement (we are) of a large building with an existing chiller delivery
system and "have" to add capacity or upgrade capacity to our existing
chiller farm (as does Bob, clearly, and likely many others) and get
"billed" according to how they account for the cost (which may add in
all sorts of administrative, architectural, engineering expenses and not
just the cost of the hardware) and if your room is fairly small and has
relatively low ceilings (precluding lots of ceiling mounted heat
exchangers) this might not work.

YMMV.  Some sites might be able to do things more cheaply than others,
and it wouldn't surprise me at all to see a factor of 4 difference in
cost from one end to the other.  Ours was quite expensive, but it gives
us a pretty good node density in a relatively small space, and SPACE is
very, very "expensive" in our building (which we "share" with math while
both departments try to grow).

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From maurice at harddata.com  Wed Apr 24 07:26:51 2002
From: maurice at harddata.com (Maurice Hilarius)
Date: Wed, 24 Apr 2002 08:26:51 -0600
Subject: Beowulf digest, Vol 1 #841 - 9 msgs
In-Reply-To: <200204241357.g3ODvUb00963@blueraja.scyld.com>
Message-ID: <5.1.0.14.2.20020424082446.03967e10@mail.harddata.com>

With regards to your message at 07:57 AM 4/24/02, 
beowulf-request at beowulf.org. Where you stated:

>Message: 5
>Date: Wed, 24 Apr 2002 11:18:46 +0100
>From: javier.iglesias at freesurf.ch
>To: beowulf at beowulf.org
>Subject: Suggestions on fiber Gigabit NICs
>
>Hi all,
>
>To cope with some network bottleneck problems leading to calculation
>crashes, we envisage to migrate our 18-nodes' (bi-AMD 1600+/Tyan Tiger
>MP/FastEthernet/Scyld 27-b8) master to Gigabit.
>
>I would like to get your feelings/experiences on two fiber Gigabit
>NICs :
You probably should consider the SysKonnect and Intel offerings too..

Also, if in a cluster that all machines are close together why fibre?
Copper is a lot less expensive.

Also why Scyld?
More modern bproc and other parts are available as downloadable GPL materials..


With our best regards,

Maurice W. Hilarius       Telephone: 01-780-456-9771
Hard Data Ltd.               FAX:       01-780-456-9772
11060 - 166 Avenue        mailto:maurice at harddata.com
Edmonton, AB, Canada      http://www.harddata.com/
    T5X 1Y3

Ask me about the UP1500 Alpha - Full systems from $3,500!


From sp at scali.com  Tue Apr 23 16:45:24 2002
From: sp at scali.com (Steffen Persvold)
Date: Wed, 24 Apr 2002 01:45:24 +0200 (CEST)
Subject: Kidger's comments on Quadric's design and performance
In-Reply-To: <3CC514CE.8D08E38@lfbs.rwth-aachen.de>
Message-ID: <Pine.LNX.4.30.0204240046250.5858-100000@elin.scali.no>

On Tue, 23 Apr 2002, Joachim Worringen wrote:
[snip]
> Richard Fryer wrote:
> > Also a brief note about the Dolphin product line, since the issue of link
> > saturation has come up:  - they DO also sell switches - or at least offer
> > them.  And if you check the SCI specification, you'll see that there are
> > some elaborate discussions of fabric architectures that the protocol
> > supports and switches enable.  What I DO NOT know is if the SCALI software
> > supports switch-based operation, and also don't know what the impact is on
> > the system cost per node.  My 'inexperienced' assessment of the appeal in
> > the Dolphin family is that you can start without the switch and later add it
> > if the performance benefit warrents.  That's what I'd say if I were selling
> > them anyway - and didn't know otherwise.  :-)
>
> The "external" switches are not designed for large-scale HPC
> applications (although they scale quite well inside the range of their
> supported number of nodes), but for high-performance, high-availabitlity
> small-scale cluster or embedded applications, as i.e. Sun sells. With
> ext. switches, you don't have to do anything to keep the network up if a
> node fails (and also nothing if it comes back as SCI is not
> source-routed). In torus topologies, re-routing needs to be applied to
> bypass bad nodes (Scali does this on-the-fly).
>
> Scali does not support external switches AFAIK (at least doesn't sell
> such systems any longer), which is less a technical issue but more a
> design-issue as the topology is fully transparent for the nodes
> accessing the network (they did use switches in the past, see
> http://www.scali.com/whitepaper/ehpc97/slide_9.html).
>

In theory there is no problem using the Scali SCI driver in a SCI switched
environment, we just haven't got the software to manage the switch (i.e
set up the routing tables). You could use Dolphin SW to manage the switch
though...

We used to use switches (and even the Dolphin driver) back in the SPARC
SBus days because (IIRC) the Dolphin SBus cards didn't have separate
out/in connectors (necessary to build ringlets and toruses).

> For large scale applications, distributed switches as in torus
> topologies scale better and more cost-efficient (see
> http://www.scali.com/whitepaper/scieurope98/scale_paper.pdf and other
> resources). With switches, you need *a lot* of cables and switches
> (which doesn't hinder Quadrics to do so - resulting in an impressive 14
> miles of cables for a recent system (IIRC) with single cables being up
> to 25m in length). It would need to be verified if such a system build
> with a Quadrics-like fat-tree topologie using Dolphins 8-port switches
> would scale better than the equivalent torus topologie for different
> communication patterns. I doubt it. At least, the interconect would cost
> a lot more (at least twice, or even more depending on the dimension of
> the tree).
>
> SCI-MPICH, can be used with arbitraries SCI topologies (because it uses
> the SISCI interface and thus runs with Scali or Dolphin SCI drivers). It
> is not that closely coupled to the SCI drivers as ScaMPI is.
>

It is true that ScaMPI uses a (proprietary) interface between userspace
(MPI library) and kernel space (SCI driver), but the SCI topology is
still transparent to the userspace layer.

Best regards,
Steffen


From ajax at aragorn.sapphire.no  Wed Apr 24 00:35:38 2002
From: ajax at aragorn.sapphire.no (Steffen Persvold)
Date: Wed, 24 Apr 2002 09:35:38 +0200 (CEST)
Subject: Kidger's comments on Quadric's design and performance
Message-ID: <Pine.LNX.4.44.0204240934150.23134-100000@aragorn.sapphire.no>

On Tue, 23 Apr 2002, Joachim Worringen wrote:
[snip]
> Richard Fryer wrote:
> > Also a brief note about the Dolphin product line, since the issue of link
> > saturation has come up:  - they DO also sell switches - or at least offer
> > them.  And if you check the SCI specification, you'll see that there are
> > some elaborate discussions of fabric architectures that the protocol
> > supports and switches enable.  What I DO NOT know is if the SCALI software
> > supports switch-based operation, and also don't know what the impact is on
> > the system cost per node.  My 'inexperienced' assessment of the appeal in
> > the Dolphin family is that you can start without the switch and later add it
> > if the performance benefit warrents.  That's what I'd say if I were selling
> > them anyway - and didn't know otherwise.  :-)
>
> The "external" switches are not designed for large-scale HPC
> applications (although they scale quite well inside the range of their
> supported number of nodes), but for high-performance, high-availabitlity
> small-scale cluster or embedded applications, as i.e. Sun sells. With
> ext. switches, you don't have to do anything to keep the network up if a
> node fails (and also nothing if it comes back as SCI is not
> source-routed). In torus topologies, re-routing needs to be applied to
> bypass bad nodes (Scali does this on-the-fly).
>
> Scali does not support external switches AFAIK (at least doesn't sell
> such systems any longer), which is less a technical issue but more a
> design-issue as the topology is fully transparent for the nodes
> accessing the network (they did use switches in the past, see
> http://www.scali.com/whitepaper/ehpc97/slide_9.html).
>

In theory there is no problem using the Scali SCI driver in a SCI switched
environment, we just haven't got the software to manage the switch (i.e
set up the routing tables). You could use Dolphin SW to manage the switch
though...

We used to use switches (and even the Dolphin driver) back in the SPARC
SBus days because (IIRC) the Dolphin SBus cards didn't have separate
out/in connectors (necessary to build ringlets and toruses).


> For large scale applications, distributed switches as in torus
> topologies scale better and more cost-efficient (see
> http://www.scali.com/whitepaper/scieurope98/scale_paper.pdf and other
> resources). With switches, you need *a lot* of cables and switches
> (which doesn't hinder Quadrics to do so - resulting in an impressive 14
> miles of cables for a recent system (IIRC) with single cables being up
> to 25m in length). It would need to be verified if such a system build
> with a Quadrics-like fat-tree topologie using Dolphins 8-port switches
> would scale better than the equivalent torus topologie for different
> communication patterns. I doubt it. At least, the interconect would cost
> a lot more (at least twice, or even more depending on the dimension of
> the tree).
>
> SCI-MPICH, can be used with arbitraries SCI topologies (because it uses
> the SISCI interface and thus runs with Scali or Dolphin SCI drivers). It
> is not that closely coupled to the SCI drivers as ScaMPI is.
>

It is true that ScaMPI uses a (proprietary) interface between userspace
(MPI library) and kernel space (SCI driver), but the SCI topology is
still transparent to the userspace layer.

Best regards,
Steffen


From jcownie at etnus.com  Wed Apr 24 01:42:03 2002
From: jcownie at etnus.com (James Cownie)
Date: Wed, 24 Apr 2002 09:42:03 +0100
Subject: Kidger's comments on Quadric's design and performance 
In-Reply-To: Message from Joachim Worringen <joachim@lfbs.RWTH-Aachen.DE> 
   of "Tue, 23 Apr 2002 10:01:18 +0200." <3CC514CE.8D08E38@lfbs.rwth-aachen.de> 
Message-ID: <170ILj-0HT-00@etnus.com>

> > This message also reminded me to ask if a long-held opinion is valid - and
> > that opinion is "that a cache coherent interconnect would offer performance
> > enhancement when applications are at the 'more tightly coupled' end of the
> > spectrum."  I know that present PCI based interfaces can't do that without
> > invoking software overhead and latencies.  Anyone have data - or an argument
> > for invalidating this opinion?
> 
> You would need another programming model than MPI for that (see below),
> maybe OpenMP as you basically have the characteristics of a SMP system
> with cc-NUMA architecture.

No, you don't have an SMP model.

You need to distinguish between a system which has a single address
space and one with multiple address spaces accessed explicitly.
You can have a cache coherent interface in the second, but that
doesn't make it into the first. 

What you have in the Quadrics (assuming it's still like the Meiko in
this respect) is an explicit cache coherent remote store access model.

You can access remote store without the active collaboration of the
owner of the remote store (so it's not message passing), but you have
to _know_ that you're accessing remotely and generate different code
(maybe execute a channel program) to do it. You can't just indirect a
random int * and fetch from remote store.

In the OpenMP model you generally don't know which accesses are
remote, all of the UPC threads live in the same address space and can
pass pointers around at will. The compiler does not know which
references will be to non-local store.

Languages for the explicit remote store access model include

UPC               http://hpc.gwu.edu/~upc/
Co-array Fortran  http://www.co-array.org/
Titanium          http://www.cs.berkeley.edu/~liblit/titanium/

Of course these languages can also run on SMP machines (and indeed one
might hope that they can achieve better performance than something
like OpenMP, because the compiler can better lay out shared areas to
avoid false sharing effects and has better knowledge about which
accesses are to shared variables).

Enjoy

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


From michael.worsham at intermedia.com  Wed Apr 24 06:10:31 2002
From: michael.worsham at intermedia.com (Worsham, Michael A.)
Date: Wed, 24 Apr 2002 09:10:31 -0400
Subject: Liquid cooling?
Message-ID: <EBFF9C6EBD5FD211838700805FA792A609B02865@TPAEXCH4>

Has anyone attempting to create a beowulf cluster using extreme methods of
cooling, such as the liquid cooling?

Example sites: http://www.koolance.com/, http://www.senfu.com.tw/, &
http://www.overclockershideout.com/

-- M

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020424/05bfc96f/attachment.html>

From jcownie at etnus.com  Wed Apr 24 06:26:53 2002
From: jcownie at etnus.com (James Cownie)
Date: Wed, 24 Apr 2002 14:26:53 +0100
Subject: Kidger's comments on Quadric's design and performance
Message-ID: <170MnN-0Lm-00@etnus.com>

Sorry if you get something like this message twice, I submitted it
once and nothing has come back, although my correction to one of the
www addresses went through :-(

Joachim Worringen <joachim at lfbs.RWTH-Aachen.DE> wrote

> > This message also reminded me to ask if a long-held opinion is valid - and
> > that opinion is "that a cache coherent interconnect would offer performance
> > enhancement when applications are at the 'more tightly coupled' end of the
> > spectrum."  I know that present PCI based interfaces can't do that without
> > invoking software overhead and latencies.  Anyone have data - or an argument
> > for invalidating this opinion?
> 
> You would need another programming model than MPI for that (see below),
> maybe OpenMP as you basically have the characteristics of a SMP system
> with cc-NUMA architecture.

No, you are confusing two completely different issues. To support
OpenMP you need a single address space which spans the processors. 

You can have cache coherent communication interfaces which do not
implement such a thing. (If it's still the same as it was at Meiko,
the Quadrics is an example of such an interface).

What Quadrics provides is an explicit remote store access model. You
can perform reads or writes cache coherently to a remote process'
address space, but you have to know that you're doing a remote access
and do something different to achieve it. You can't just indirect
through some random pointer and have that fetch data.

OpenMP assumes a single address space within which pointers can be
passed around freely, so will not implement easily on top of an
interface like Quadrics, even though that is (I believe) cache
coherent at both ends.

Languages which are built on an explicit remote store access model
include

Co-Array Fortran	http://www.co-array.org
UPC                     http://hpc.gwu.edu/~upc
Titanium                http://www.cs.berkeley.edu/Research/Projects/titanium/

in these languages the compiler always knows which accesses may be
remote.

Of course such languages can also run on SMP boxes and use a
"genuinely" shared memory (and, indeed one might hope that the extra
information available in such languages allows the compiler to
generate better code for such a machine than one can generate from
OpenMP, since it should be able to avoid much false sharing).

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


From jnemmers at helix.nih.gov  Wed Apr 24 07:45:19 2002
From: jnemmers at helix.nih.gov (Justin Nemmers)
Date: Wed, 24 Apr 2002 10:45:19 -0400
Subject: Burn-in Utilities
Message-ID: <p05100307b8ec74c09a69@[165.112.138.83]>

All:
	I am in search of a utility that will allow me to burn-in a 
new PC.  Ideally, it would peg the procs at 100% as well as exercise 
the memory (as much as 2Gb/Node.  I know there is a Sun provided 
utility to do this on Sparc systems, but does anyone have a 
suggestion for a linux-based (perl would work, too) that will do the 
same thing?

Cheers,
Justin
-- 

System Administrator
National Institutes of Health
Center for Information Technology
9000 Rockville PK
Building 12B 2N/207
Bethesda, MD 20892-5680
301.496.0396
http://biowulf.nih.gov


From math at velocet.ca  Wed Apr 24 08:30:08 2002
From: math at velocet.ca (Velocet)
Date: Wed, 24 Apr 2002 11:30:08 -0400
Subject: Burn-in Utilities
In-Reply-To: <p05100307b8ec74c09a69@[165.112.138.83]>; from jnemmers@helix.nih.gov on Wed, Apr 24, 2002 at 10:45:19AM -0400
References: <p05100307b8ec74c09a69@[165.112.138.83]>
Message-ID: <20020424113008.E56252@velocet.ca>

On Wed, Apr 24, 2002 at 10:45:19AM -0400, Justin Nemmers's all...
> All:
> 	I am in search of a utility that will allow me to burn-in a 
> new PC.  Ideally, it would peg the procs at 100% as well as exercise 
> the memory (as much as 2Gb/Node.  I know there is a Sun provided 
> utility to do this on Sparc systems, but does anyone have a 
> suggestion for a linux-based (perl would work, too) that will do the 
> same thing?

The packages (in debian and redhat AFAIK) cpuburn and memtest will do
you nicely.

We run 5 odd of each of burnMMX burnK7 and memtest on our athlon machines for
2-3 days and see if even one crashes. We've had a crash on machines tested
AFTER being in service with no problems for 3-4 months. So its definitely a
hardcore excercise.  Oh we also stick dnetc on them on top of all that just to
make sure its hurting.  I think they're set to generate the most heat
possible in the CPU during operation. They definitely draw the most current -
when we were first setting up our cluster and werent sure of power draw,
8 dual 1.333Ghz athlon boards (no drives) would run G98 fine on a 15
amp circuit - as soon as we ran burnMMX/k7 we'd blow breakers.

We run 5-10 to get a nice high context switch going and excercise the OS as
well ;) We (through trial and error) found that running only 1 each of
burnMMX/burnK7 at a time will often not crash for days, whereas running 5-10
will.

(In fact, we only consider a crash within 12 hours to be a reason to RMA it if
its slated for a workstation running windows.  12 hours of that test is almost
equivalent to a crash every 3-6 months of regular LINUX desktop use (and with
windows how can you tell? :))

Its actually suprising how well you can measure the quality of boards that
way. Out of 40 246x Tyan boards we found one bad stick of ram and 0 cpus and
boards bad using this method. However with ECS K75As we found 1/10 boards as
shipped to us would die in 1-6 hours under this load, and another 1/10 will
die within the 2-3 days. while ! burnMMX; do RMA_via_VAR; done

Nonetheless we've never seen every unit of a certain brand always crash within
that time - eventually we get good boards - so using proper sorting after
testing in this manner you can always end up with a set of good boards (at
least as far as these tests are concerned). So far with any board that makes
it past 2-3 days of this we've never seen a problem with Gaussian98, Gromacs
or distributed-net afterwards (at least until we hit long term electron
migration path problems due to regular CPU heat wear and tear...) but none of
our boards/CPUs (the PcChips M817 LMRs are hitting 16 months of continuous
operation) are there yet.

/kc


> 
> Cheers,
> Justin
> -- 
> 
> System Administrator
> National Institutes of Health
> Center for Information Technology
> 9000 Rockville PK
> Building 12B 2N/207
> Bethesda, MD 20892-5680
> 301.496.0396
> http://biowulf.nih.gov
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From walke at usna.edu  Wed Apr 24 08:32:56 2002
From: walke at usna.edu (LT V. H. Walke)
Date: 24 Apr 2002 11:32:56 -0400
Subject: Burn-in Utilities
In-Reply-To: <p05100307b8ec74c09a69@[165.112.138.83]>
References: <p05100307b8ec74c09a69@[165.112.138.83]>
Message-ID: <1019662376.28412.11.camel@vhwalke.mathsci.usna.edu>

Try

cpu-burn http://users.ev1.net/~redelm/

or MemTest86 http://www.teresaudio.com/memtest86/

Good luck,
Vann


On Wed, 2002-04-24 at 10:45, Justin Nemmers wrote:
> All:
> 	I am in search of a utility that will allow me to burn-in a 
> new PC.  Ideally, it would peg the procs at 100% as well as exercise 
> the memory (as much as 2Gb/Node.  I know there is a Sun provided 
> utility to do this on Sparc systems, but does anyone have a 
> suggestion for a linux-based (perl would work, too) that will do the 
> same thing?
> 
> Cheers,
> Justin
> -- 
> 
> System Administrator
> National Institutes of Health
> Center for Information Technology
> 9000 Rockville PK
> Building 12B 2N/207
> Bethesda, MD 20892-5680
> 301.496.0396
> http://biowulf.nih.gov
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- 
----------------------------------------------------------------------
  Vann H. Walke                        Office: Chauvenet 341
  Computer Science Dept.               Ph:  410-293-6811
  572 Holloway Road, Stop 9F           Fax: 410-293-2686
  United States Naval Academy          email: walke at usna.edu
  Annapolis, MD 21402-5002             http://www.cs.usna.edu/~walke
----------------------------------------------------------------------


From cozzi at nd.edu  Wed Apr 24 08:44:04 2002
From: cozzi at nd.edu (Marc Cozzi)
Date: Wed, 24 Apr 2002 10:44:04 -0500
Subject: Liquid cooling?
Message-ID: <F163413C9250D211A55C0060979D52803DFA7B@hertz.rad.nd.edu>

I have a difficult time understanding all this overclocking/cooling stuff.
If you go to some of the
web sites listed below you get a real sense of all show no go. Cases carved
out on the side,
plastic windows installed, neon lights installed inside. extra fans on the
top, bottom, sides, front and back.
 
The original stuff is engineered for gods sake! Although in some cases
poorly.
Liquid cooling an overclocked AMD chip could cost between $50 and $400!!!
Wouldn't that money be better spent toward a a faster chip? Perhaps as much
as %250 faster compared to
what little you may get with overclocking? Not to mention the warranty
problems with damn near every thing in the box.
The potential problems with broken cooling lines inside and out of the
boxes. I would think for most of us
time is money and the maintenance of such systems (we are talking of
clusters and not single systems) would
be prohibitive. Unless the boss is easily fooled.
 
Reminds me of the 1966 Chevrolet Impalas with hydraulics, dingo balls, neon
license plate frames..
 
Ok, maybe makes limited sense for 1U type systems....
 
   --marc
 

-----Original Message-----
From: Worsham, Michael A. [mailto:michael.worsham at intermedia.com]
Sent: April 24, 2002 8:11 AM
To: 'beowulf at beowulf.org'
Subject: Liquid cooling?


Has anyone attempting to create a beowulf cluster using extreme methods of
cooling, such as the liquid cooling? 

Example sites: http://www.koolance.com/ <http://www.koolance.com/> ,
http://www.senfu.com.tw/ <http://www.senfu.com.tw/> , &
http://www.overclockershideout.com/ <http://www.overclockershideout.com/>  

-- M 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020424/a958bccb/attachment.html>

From josip at icase.edu  Wed Apr 24 08:56:53 2002
From: josip at icase.edu (Josip Loncaric)
Date: Wed, 24 Apr 2002 11:56:53 -0400
Subject: cooling
References: <Pine.OSF.4.21.0204240555450.29099-100000@holodec15.aei-potsdam.mpg.de>
Message-ID: <3CC6D5C5.A83691AA@icase.edu>

Steven Berukoff wrote:
> 
> We just purchased ~150 dual AMDs, and are cooling them with 4 Fujitsu
> ceiling-mounted air-conditioners: about 50kW of AC cost us about $25k,
> which is about 10% of the cost of the machines.

Your setup also has the advantage of redundancy.  We've got a large AC
(a 5-ton unit) plus two smaller/self-contained AC units for those
inevitable times when the large unit is not performing properly.

AC typically breaks down when you need it the most (i.e. on the hottest
day when it has to work the hardest).  We try to have some redundancy
and the ability to stage 1-2-3 AC units as needed.  Oversized AC to
handle event the hottest days is not as efficient, and when it breaks
down, most of the computers have to be powered off...

Sincerely,
Josip

P.S.  Temperature alarms are a highly recommended.  One can even wire
the temperature sensor to kill the computer power at the circuit breaker
box when the temperature exceeds dangerous levels.  Automatic handling
of overheating is needed because when AC fails, the temperature in a
small room (heated by 5-10KW dissipated by the computers) can go above
100 deg F within ~30 minutes.  This may be faster than a system manager
can drive in on a weekend -- so automated shutdown may be needed even if
the temperature alarm pages someone.

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From john.hearns at cern.ch  Wed Apr 24 08:57:13 2002
From: john.hearns at cern.ch (John Hearns)
Date: 24 Apr 2002 17:57:13 +0200
Subject: Liquid cooling?
In-Reply-To: <EBFF9C6EBD5FD211838700805FA792A609B02865@TPAEXCH4>
References: <EBFF9C6EBD5FD211838700805FA792A609B02865@TPAEXCH4>
Message-ID: <1019663834.6257.10.camel@ues4>

On Wed, 2002-04-24 at 15:10, Worsham, Michael A. wrote:
> Has anyone attempting to create a beowulf cluster using extreme methods of
> cooling, such as the liquid cooling?
> 
> Example sites: http://www.koolance.com/, http://www.senfu.com.tw/, &
> http://www.overclockershideout.com/
> 

Well, I think Robert Brown has FINALLY been beaten here.
You're not going to install Freon tanks, complete with plastic
fish are you Bob? 
I just have this bizarre vision of Bob in an aqualung visiting
a Freon-flooded machine room...


From becker at scyld.com  Wed Apr 24 09:34:43 2002
From: becker at scyld.com (Donald Becker)
Date: Wed, 24 Apr 2002 12:34:43 -0400 (EDT)
Subject: List moderation and related info
In-Reply-To: <170MnN-0Lm-00@etnus.com>
Message-ID: <Pine.LNX.4.33.0204241203450.17528-100000@presario>

On Wed, 24 Apr 2002, James Cownie wrote:
> Sorry if you get something like this message twice, I submitted it
> once and nothing has come back, although my correction to one of the
> www addresses went through :-(

Your messages to the list are held for moderation due to the header
contents.  Your messages don't appear to be coming from your subscribed
address.  I've since added an exception that should allow your posts to
be automatically approved.

The message filters on the Beowulf lists are frequently updated, and for
good reason.  There are many attempts to post spam and virus emails.
Only a tiny fraction manages to slip through, and I add new rules to
attempt to catch future copies.  But the spammers learn new tricks.
Readers should expect occasional bogus messages.

On the topic of moderation-holds: I try to do the approval moderation each
day.  I'm sometime on travel and don't have the opportunity to
moderate.  (In the past delegating this work has not been reliable.)
To avoid moderation holds:
  Have your post appear to come from your subscribed email address
  Use a non-trivial subject line
  Post from an IP address address that can be reverse-resolved
  Don't post from any machine in ".kr", ".cn" or ".pt".
  Avoid any mention of printer supplies or "enlargement" drugs
     (unless the latter is directly related to your cluster use ;->)


-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993


From joachim at lfbs.RWTH-Aachen.DE  Wed Apr 24 09:16:01 2002
From: joachim at lfbs.RWTH-Aachen.DE (Joachim Worringen)
Date: Wed, 24 Apr 2002 18:16:01 +0200
Subject: Kidger's comments on Quadric's design and performance
References: <170MnN-0Lm-00@etnus.com>
Message-ID: <3CC6DA41.367BEB55@lfbs.rwth-aachen.de>

James Cownie wrote:
> 
> Sorry if you get something like this message twice, I submitted it
> once and nothing has come back, although my correction to one of the
> www addresses went through :-(
> 
> Joachim Worringen <joachim at lfbs.RWTH-Aachen.DE> wrote
> 
> > > This message also reminded me to ask if a long-held opinion is valid - and
> > > that opinion is "that a cache coherent interconnect would offer performance
> > > enhancement when applications are at the 'more tightly coupled' end of the
> > > spectrum."  I know that present PCI based interfaces can't do that without
> > > invoking software overhead and latencies.  Anyone have data - or an argument
> > > for invalidating this opinion?
> >
> > You would need another programming model than MPI for that (see below),
> > maybe OpenMP as you basically have the characteristics of a SMP system
> > with cc-NUMA architecture.
> 
> No, you are confusing two completely different issues. To support
> OpenMP you need a single address space which spans the processors.

You are right, this is completely different. However, I did not mean
that connecting nodes of a cluster with a cache-coherent interface
"gives you an SMP", but more precisely "gives the shared parts of the
distributed distinct address spaces nearly SMP-like access
characteristics", with respect to a suitable programming model. 

This would enable a matching OpenMP-Compiler/run-time-lib to generate
and run code with (more or less) SMP-like performance as does the OMNI
OpenMP-Compiler (currently on top of a software DSM library SCASH on top
of SCore, see http://www.hpcc.jp/Omni - this is all software which is
much more perfomance-sensitive to bad data-placement and has generally a
much higher overhead than such a hw-based solution would have).

There is something similar on top of SCI, namely the HAMSTER project
(http://hamster.informatik.tu-muenchen.de/), but w/o OpenMP, IIRC, and
still some software-overhead to "simulate" cachable remote memory on top
of SCI-connected PCs.

With Quadrics, this should be possible in an even more efficient manner
due to the hardware-MMU and -TLB on the adapter.

To have a real cc-NUMA-SMP, the integration needs to be higher (HP
X-Class, DG/IBM NUMA-Q, ...), this is for sure.  The question is: are
large-scale SMPs as sold by IBM, Sun, ... not the better solution for
such tasks? Quadrics is expensive, and you still have to manage a bunch
of PCs instead a nice, single SMP.

  Joachim

-- 
|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339


From trey at ERC.MsState.Edu  Wed Apr 24 09:43:23 2002
From: trey at ERC.MsState.Edu (Trey Breckenridge)
Date: Wed, 24 Apr 2002 11:43:23 -0500 (CDT)
Subject: cooling
Message-ID: <200204241643.g3OGhNGU010172@ERC.MsState.Edu>

>Steven Berukoff wrote:
>> 
>> We just purchased ~150 dual AMDs, and are cooling them with 4 Fujitsu
>> ceiling-mounted air-conditioners: about 50kW of AC cost us about $25k,
>> which is about 10% of the cost of the machines.

One of the major disadvantages of overhead A/C units is that when the
overflow drain clogs (and it will), the excess water will spill into
your machines.  Another point to consider is access for maintenance of
the A/C.  In some cases, it may require you to shut down and move your
racks for the maintenance personnel to have full access the A/C units.

A second disadvantage with ceiling mounted units (with the supply-side
vent in the ceiling) is that your cooling efficiency will be lower than
with a under-floor based system.  Basically, from a ceiling mounted
unit, the cool air from the supply will have to travel the entire
height of the room in order to "reach" the machine that is the farthest
away (your bottommost machine in a rack) which may be 10 feet.  When
cooling from under the floor in a raised floor scenario, the furthest
machine from the supply is maybe 6-8 feet away (the topmost machine in
your rack).  The difference in the cold air velocity at 6 feet versus
10 feet may be significant.  My experience is that it does make a
difference.  In our data center with ceiling mounted A/C's, the
machines at the bottom of our racks run considerable hotter than the
top machines.  However, in our second data center, we have under-floor
A/C.  The machine temperatures in the racks there are much more
consistent (and cooler on average) from top to bottom.

Of course, all of this is just my opinion.

__________________________________________________________________________
   Trey Breckenridge - Computing Systems Manager - trey at ERC.MsState.Edu
         Mississippi State University Engineering Research Center


From pzb at datastacks.com  Wed Apr 24 09:59:34 2002
From: pzb at datastacks.com (Peter Bowen)
Date: 24 Apr 2002 12:59:34 -0400
Subject: Burn-in Utilities
In-Reply-To: <p05100307b8ec74c09a69@[165.112.138.83]>
References: <p05100307b8ec74c09a69@[165.112.138.83]>
Message-ID: <1019667574.6829.3.camel@gargleblaster.caffeinexchange.org>

On Wed, 2002-04-24 at 10:45, Justin Nemmers wrote:
> All:
> 	I am in search of a utility that will allow me to burn-in a 
> new PC.  Ideally, it would peg the procs at 100% as well as exercise 
> the memory (as much as 2Gb/Node.  I know there is a Sun provided 
> utility to do this on Sparc systems, but does anyone have a 
> suggestion for a linux-based (perl would work, too) that will do the 
> same thing?

The best answer is cerberus.  VA Linux Systems wrote it for burn-in
testing their machines, and open sourced it for others to use.  Red Hat
is maintaining a version of it that probably will do many of the things
you want.

See http://people.redhat.com/bmatthews/cerberus and
http://sourceforge.net/projects/va-ctcs for more info.

Thanks.
Peter


From maurice at harddata.com  Wed Apr 24 10:41:59 2002
From: maurice at harddata.com (Maurice Hilarius)
Date: Wed, 24 Apr 2002 11:41:59 -0600
Subject: Beowulf digest, Vol 1 #843 - 2 msgs
In-Reply-To: <200204241601.g3OG17b06015@blueraja.scyld.com>
Message-ID: <5.1.0.14.2.20020424114047.07630ba0@mail.harddata.com>

With regards to your message at 10:01 AM 4/24/02, 
beowulf-request at beowulf.org. Where you stated:

>On Wed, 2002-04-24 at 15:10, Worsham, Michael A. wrote:
> > Has anyone attempting to create a beowulf cluster using extreme methods of
> > cooling, such as the liquid cooling?
> >
> > Example sites: http://www.koolance.com/, http://www.senfu.com.tw/, &
> > http://www.overclockershideout.com/
> >
>
>Well, I think Robert Brown has FINALLY been beaten here.
>You're not going to install Freon tanks, complete with plastic
>fish are you Bob?
>I just have this bizarre vision of Bob in an aqualung visiting
>a Freon-flooded machine room...

Interesting you should ask, as we are about a month away from shipping our 
clusters (dual athlon and dual XEON) with liquid cooling using a 
centralised radiator and pump unit per rack of machines..


With our best regards,

Maurice W. Hilarius       Telephone: 01-780-456-9771
Hard Data Ltd.               FAX:       01-780-456-9772
11060 - 166 Avenue        mailto:maurice at harddata.com
Edmonton, AB, Canada      http://www.harddata.com/
    T5X 1Y3

Ask me about the UP1500 Alpha - Full systems from $3,500!


From rgb at phy.duke.edu  Wed Apr 24 10:49:03 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 24 Apr 2002 13:49:03 -0400 (EDT)
Subject: Liquid cooling?
In-Reply-To: <1019663834.6257.10.camel@ues4>
Message-ID: <Pine.LNX.4.44.0204241344180.17167-100000@ganesh.phy.duke.edu>

On 24 Apr 2002, John Hearns wrote:

> On Wed, 2002-04-24 at 15:10, Worsham, Michael A. wrote:
> > Has anyone attempting to create a beowulf cluster using extreme methods of
> > cooling, such as the liquid cooling?
> > 
> > Example sites: http://www.koolance.com/, http://www.senfu.com.tw/, &
> > http://www.overclockershideout.com/
> > 
> 
> Well, I think Robert Brown has FINALLY been beaten here.
> You're not going to install Freon tanks, complete with plastic
> fish are you Bob? 
> I just have this bizarre vision of Bob in an aqualung visiting
> a Freon-flooded machine room...

Oh no, this has all been discussed before on the list before (many
times, actually -- look back at the archives with google to find some of
them) and MY favorite solution is to build a really large computer room
in, say, Antarctica and just put fans in the windows.

Liquid solutions (no pun intended:-) tend to be expensive, messy,
environmentally nasty (if you don't use water), risky (water and
electricity don't mix well) and, as you note, servicing the machines in
a full immersion rack can be, well, "involved".

;-)

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Wed Apr 24 10:51:22 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Wed, 24 Apr 2002 13:51:22 -0400 (EDT)
Subject: List moderation and related info
In-Reply-To: <Pine.LNX.4.33.0204241203450.17528-100000@presario>
Message-ID: <Pine.LNX.4.44.0204241350130.17167-100000@ganesh.phy.duke.edu>

On Wed, 24 Apr 2002, Donald Becker wrote:

>   Don't post from any machine in ".kr", ".cn" or ".pt".

Good choices:-)

>   Avoid any mention of printer supplies or "enlargement" drugs
>      (unless the latter is directly related to your cluster use ;->)

But wait, then how did this originally message get in?  How did this
reply get in?

Self-referential systems are all too confusing...;-)

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From fraser5 at cox.net  Wed Apr 24 11:14:50 2002
From: fraser5 at cox.net (Jim Fraser)
Date: Wed, 24 Apr 2002 14:14:50 -0400
Subject: Liquid cooling?
In-Reply-To: <1019663834.6257.10.camel@ues4>
Message-ID: <000001c1ebbb$ed5e3370$0800005a@papabear>

While I think overclocking is kinda silly in extreme cases (like hot-rods)
and nearly pointless for most real serious computing applications, I think
the water-cooling has real merits and could be considered for dense
clusters.
1) Water cools orders-of-magnitudes better then air
2) It is far quieter
3) CPU temps hardly vary as compared to air (even under load) (better
stability)
4) It does not have to be that much more expensive then high-end air cooling
(there is a real price to cool dual cpu's in a 1U steel box.)
5) leaks are almost unheard-of, and are not as catastrophic as they sound
(distilled water is generally not a problem, but a leak is a possible mode
of failure)
6) Dense water cooled systems could be easily be engineered to remove bulk
heat far better then the rows of tiny little cheap jap fans whirring at 7000
rpm...talk about failure rates!?! Fans are most prone to failure that result
in hardware breakdowns.  Don't discard water cooling.


Also for the non-budget minded, there is the complete board submergence
route with  hydrofluoroether (HFW) 3M makes it
http://products.3m.com/usenglish/mfg_industrial/elec_materials.jhtml?powurl=
SKKXCT77P5be2FCSL3BCQXgeGST1T4S9TCgv5NGBVHDQ19gl ...this is expensive (~250
bucks+/gal but exceptionally effective.  I think CRAY made a machine that
used a similar fluid once.
A couple of weeks ago I saw the guys on TECHTV demonstrate this stuff in a
fish-tank like set-up:
http://www.techtv.com/screensavers/supergeek/story/0,24330,3380128,00.html

jim


-----Original Message-----
From: beowulf-admin at beowulf.org [mailto:beowulf-admin at beowulf.org]On
Behalf Of John Hearns
Sent: Wednesday, April 24, 2002 11:57 AM
To: beowulf at beowulf.org
Subject: Re: Liquid cooling?


On Wed, 2002-04-24 at 15:10, Worsham, Michael A. wrote:
> Has anyone attempting to create a beowulf cluster using extreme methods of
> cooling, such as the liquid cooling?
>
> Example sites: http://www.koolance.com/, http://www.senfu.com.tw/, &
> http://www.overclockershideout.com/
>

Well, I think Robert Brown has FINALLY been beaten here.
You're not going to install Freon tanks, complete with plastic
fish are you Bob?
I just have this bizarre vision of Bob in an aqualung visiting
a Freon-flooded machine room...


_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From James.P.Lux at jpl.nasa.gov  Wed Apr 24 12:47:07 2002
From: James.P.Lux at jpl.nasa.gov (Jim Lux)
Date: Wed, 24 Apr 2002 12:47:07 -0700
Subject: Liquid cooling?
In-Reply-To: <000001c1ebbb$ed5e3370$0800005a@papabear>
References: <1019663834.6257.10.camel@ues4>
Message-ID: <5.1.0.14.2.20020424123459.00b1e010@mail1.jpl.nasa.gov>

Jim makes a number of good points, some of which would also apply to 
conduction cooled equipment (as used in space applications and some 
avionics, where you can't depend on the air density). In my efforts to 
design a "field usable Beowulf" which is entirely sealed (mud/dust proof, 
etc.), I had come to similar conclusions about the viability of liquid 
cooling (mass gets to be a bit of a problem, though).

However, a basic philosophical issue arises...  The "beauty" of a Beowulf 
is that it uses "commodity" computers, and so, can capitalize on enormous 
efficiencies of scale for the huge consumer market.  As you start to go 
towards less consumer configurations, you're straying, to a certain extent 
from the "pile of cheap PC's" paradigm and back towards the "behemoth in 
the machine room" model.  Certainly, the 500+ node computers that folks are 
putting together with 1U cases, dozens of machines in dozens of racks, with 
dedicated cooling, etc., is straying pretty far from the original Beowulf 
concept.

It still is cluster computing, and maybe what makes it Beowulf'ish is not 
the hardware, per se, but the fact that you are using "off the shelf" 
cheap/free software (Linux, e.g.), and "off the shelf" interconnects??

By the way, I wouldn't fool with DI water in a totally immersed system.. 
too corrosive.  If you don't want to pop for the Fluorinerts, then various 
silicone and mineral oils would work well, are non-toxic, and inexpensive. 
These have been used for decades for immersed cooling of all sorts of stuff 
(transformers, for instance). You'd want to assess compatibility with 
adhesives and existing coatings.  And, you better hope that it really does 
improve reliability... it's going to be a mess to service.  You might want 
to do some hard core burn in first and get past the infant mortalities, 
before you "literally take the plunge".

And, while plunging the Mobo in oil wouldn't bother it, I wonder if the 
same is true of things like the hard disk drive?  They're probably vented, 
and/or have moving parts that are outside the sealed area.

At 02:14 PM 4/24/2002 -0400, Jim Fraser wrote:

>While I think overclocking is kinda silly in extreme cases (like hot-rods)
>and nearly pointless for most real serious computing applications, I think
>the water-cooling has real merits and could be considered for dense
>clusters.
>1) Water cools orders-of-magnitudes better then air
>2) It is far quieter
>3) CPU temps hardly vary as compared to air (even under load) (better
>stability)
>4) It does not have to be that much more expensive then high-end air cooling
>(there is a real price to cool dual cpu's in a 1U steel box.)
>5) leaks are almost unheard-of, and are not as catastrophic as they sound
>(distilled water is generally not a problem, but a leak is a possible mode
>of failure)
>6) Dense water cooled systems could be easily be engineered to remove bulk
>heat far better then the rows of tiny little cheap jap fans whirring at 7000
>rpm...talk about failure rates!?! Fans are most prone to failure that result
>in hardware breakdowns.  Don't discard water cooling.
>
>jim
>
>
>---
>On Wed, 2002-04-24 at 15:10, Worsham, Michael A. wrote:
> > Has anyone attempting to create a beowulf cluster using extreme methods of
> > cooling, such as the liquid cooling?
> >
> > Example sites: http://www.koolance.com/, http://www.senfu.com.tw/, &
> > http://www.overclockershideout.com/
> >
>
>Well, I think Robert Brown has FINALLY been beaten here.
>You're not going to install Freon tanks, complete with plastic
>fish are you Bob?
>I just have this bizarre vision of Bob in an aqualung visiting
>a Freon-flooded machine room...
>

Jim Lux
Spacecraft Telecommunications Equipment Section
Jet Propulsion Laboratory
4800 Oak Grove Road, Mail Stop 161-213
Pasadena CA 91109

818/354-2075, fax 818/393-6875


From math at velocet.ca  Wed Apr 24 14:59:16 2002
From: math at velocet.ca (Velocet)
Date: Wed, 24 Apr 2002 17:59:16 -0400
Subject: Liquid cooling?
In-Reply-To: <Pine.LNX.4.44.0204241344180.17167-100000@ganesh.phy.duke.edu>; from rgb@phy.duke.edu on Wed, Apr 24, 2002 at 01:49:03PM -0400
References: <1019663834.6257.10.camel@ues4> <Pine.LNX.4.44.0204241344180.17167-100000@ganesh.phy.duke.edu>
Message-ID: <20020424175916.D12933@velocet.ca>

On Wed, Apr 24, 2002 at 01:49:03PM -0400, Robert G. Brown's all...
> On 24 Apr 2002, John Hearns wrote:
> 
> > On Wed, 2002-04-24 at 15:10, Worsham, Michael A. wrote:
> > > Has anyone attempting to create a beowulf cluster using extreme methods of
> > > cooling, such as the liquid cooling?
> > > 
> > > Example sites: http://www.koolance.com/, http://www.senfu.com.tw/, &
> > > http://www.overclockershideout.com/
> > > 
> > 
> > Well, I think Robert Brown has FINALLY been beaten here.
> > You're not going to install Freon tanks, complete with plastic
> > fish are you Bob? 
> > I just have this bizarre vision of Bob in an aqualung visiting
> > a Freon-flooded machine room...
> 
> Oh no, this has all been discussed before on the list before (many
> times, actually -- look back at the archives with google to find some of
> them) and MY favorite solution is to build a really large computer room
> in, say, Antarctica and just put fans in the windows.
> 
> Liquid solutions (no pun intended:-) tend to be expensive, messy,
> environmentally nasty (if you don't use water), risky (water and
> electricity don't mix well) and, as you note, servicing the machines in
> a full immersion rack can be, well, "involved".

Wasnt someone suggesting putting a huge machine room in alaska for this
reason? Right near 'pacific rim fabric' and right near some huge
power plants in alaska or what not?

Environmental damage notwithstanding.

Anyone ever sell the heat generated from the clusters to someone else? :)

/kc

> 
> ;-)
> 
>    rgb
> 
> -- 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From gerry at cs.tamu.edu  Wed Apr 24 15:48:58 2002
From: gerry at cs.tamu.edu (Gerry Creager N5JXS)
Date: Wed, 24 Apr 2002 17:48:58 -0500
Subject: List moderation and related info
References: <Pine.LNX.4.44.0204241350130.17167-100000@ganesh.phy.duke.edu>
Message-ID: <3CC7365A.7030105@cs.tamu.edu>

Robert G. Brown wrote:

> On Wed, 24 Apr 2002, Donald Becker wrote:
> 
> 
>>  Don't post from any machine in ".kr", ".cn" or ".pt".
>>
> 
> Good choices:-)


Some of my more entertaining non-technical reading, recently, has come 
from sites in those countries...

 
>>  Avoid any mention of printer supplies or "enlargement" drugs
>>     (unless the latter is directly related to your cluster use ;->)
>>
> 
> But wait, then how did this originally message get in?  How did this
> reply get in?


Error in the human-scanning function?

 
> Self-referential systems are all too confusing...;-)


And to think that I learned to call 'em "self-eating watermelons


gerry
--
Gerry Creager -- gerry at cs.tamu.edu
Network Engineering
Academy for Advanced Telecommunications and Learning Technologies		
Texas A&M University	979.458.4020  (Phone) -- 979.847.8578  (Fax)


From canon at nersc.gov  Wed Apr 24 16:01:48 2002
From: canon at nersc.gov (canon at nersc.gov)
Date: Wed, 24 Apr 2002 16:01:48 -0700
Subject: Power Strips and cables
Message-ID: <200204242301.g3ON1m930090@pookie.nersc.gov>

Greetings,

I was curious what novel ways everyone is
powering rack systems.  I'm mainly curious about
high density (1U) systems.  We currently are
racking 32 nodes in a rack and using 4 20A,
vertically mounted power strips.  I'm concerned 
that next wave of machines will be too deep to 
accommodate the four power strips.  What is everyone
doing for these types of scenarios?  I don't
need remote power management at this point.
I'm more concerned with just fitting everything
in neatly.  Also, I've seen custom power cords
that clean things up.  Does anyone know a vendor
or supplier for these types of things?

Thanks in advance,

--Shane Canon


From Chester.Fitch at mdx.com  Wed Apr 24 16:14:59 2002
From: Chester.Fitch at mdx.com (Fitch, Chester)
Date: Wed, 24 Apr 2002 17:14:59 -0600
Subject: Liquid cooling?
Message-ID: <19E8BE159FECD4118FE700508BEE12D2012FF7EE@mdx-email1.den.mdx.com>

Yes, there was some talk on /. a while back about putting a server farm up
on the North slope of Alaska... here's the link:
http://slashdot.org/article.pl?sid=01/05/14/159258&mode=thread 

Idea was lots of cooling capacity (especially in winter) and lots of
low-cost natural gas to power the thing.. Problems, however, included
staffing and getting the data traffic to/from the lower 48 states.. (not to
mention the time required for a service call!) 

Interesting idea -- I actually used the idea as an exercise in class last
semester - as a (obviously extreme) exercise in facilities management. Point
was to get them to think about all the infrastructure we often take for
granted..
    
As far as selling the heat generated by our systems... A co-generation
facility off of the computer room? (Hmm.. Maybe, for some of our bigger
beowulfs..) But if your campus buildings have steam heating, well, there
might be something to it..

;-)

Chet 

> -----Original Message-----
> From: Velocet [mailto:math at velocet.ca]
> Sent: Wednesday, April 24, 2002 3:59 PM
> To: beowulf at beowulf.org
> Subject: Re: Liquid cooling?
> 
> 
> On Wed, Apr 24, 2002 at 01:49:03PM -0400, Robert G. Brown's all...
> > On 24 Apr 2002, John Hearns wrote:
> > 
> > > On Wed, 2002-04-24 at 15:10, Worsham, Michael A. wrote:
> > > > Has anyone attempting to create a beowulf cluster using 
> extreme methods of
> > > > cooling, such as the liquid cooling?
> > > > 
> > > > Example sites: http://www.koolance.com/, 
> http://www.senfu.com.tw/, &
> > > > http://www.overclockershideout.com/
> > > > 
> > > 
> > > Well, I think Robert Brown has FINALLY been beaten here.
> > > You're not going to install Freon tanks, complete with plastic
> > > fish are you Bob? 
> > > I just have this bizarre vision of Bob in an aqualung visiting
> > > a Freon-flooded machine room...
> > 
> > Oh no, this has all been discussed before on the list before (many
> > times, actually -- look back at the archives with google to 
> find some of
> > them) and MY favorite solution is to build a really large 
> computer room
> > in, say, Antarctica and just put fans in the windows.
> > 
> > Liquid solutions (no pun intended:-) tend to be expensive, messy,
> > environmentally nasty (if you don't use water), risky (water and
> > electricity don't mix well) and, as you note, servicing the 
> machines in
> > a full immersion rack can be, well, "involved".
> 
> Wasnt someone suggesting putting a huge machine room in 
> alaska for this
> reason? Right near 'pacific rim fabric' and right near some huge
> power plants in alaska or what not?
> 
> Environmental damage notwithstanding.
> 
> Anyone ever sell the heat generated from the clusters to 
> someone else? :)
> 
> /kc
> 
> > 
> > ;-)
> > 
> >    rgb
> > 
> > -- 
> > Robert G. Brown	                       
> http://www.phy.duke.edu/~rgb/
> > Duke University Dept. of Physics, Box 90305
> > Durham, N.C. 27708-0305
> > Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> > 
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> -- 
> Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  * 
>  Toronto, CANADA 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) 
> visit http://www.beowulf.org/mailman/listinfo/beowulf
> 


From serguei.patchkovskii at sympatico.ca  Wed Apr 24 16:15:13 2002
From: serguei.patchkovskii at sympatico.ca (Serguei Patchkovskii)
Date: Wed, 24 Apr 2002 19:15:13 -0400
Subject: Intel releases C++/Fortran suite V 6.0 for Linux
References: <Pine.LNX.4.33.0204240954130.21703-100000@abel.mi.uib.no>
Message-ID: <004e01c1ebe5$e37f6660$6401a8c0@sympatico.ca>

> From: "Bjorn Tore Sund" <bjornts at mi.uib.no>
> I've been wanting to test these out, both in the previous versions
> and this, but as long as Intel are only releasing them as RedHat
> rpms, they are fundamentally useless on a SuSE system.  Or at least
> a lot of hassle to install.

No they are not - the installation script works without any changes, at
least
on Suse 7.1 and Suse 7.3. It may whine a little about the kernel and glibc
versions, but that's it. I've been running both 5.x and 6.x (beta) versions
Intel's compiler under Suse for the last six month, and haven't seen -any-
Suse-specific problems.

Serguei


From purp at wildbrain.com  Wed Apr 24 21:49:09 2002
From: purp at wildbrain.com (Jim Meyer)
Date: 24 Apr 2002 21:49:09 -0700
Subject: COTS cooling
In-Reply-To: <200204240500.WAA23212@brownlee.cs.uidaho.edu>
References: <200204240500.WAA23212@brownlee.cs.uidaho.edu>
Message-ID: <1019710150.3596.37.camel@milagro.wildbrain.com>

On Tue, 2002-04-23 at 22:00, Robert B Heckendorn wrote:
> We don't have to pay for the cooling but the cost of the installation
> of cooling is being used as an argument to cut corners on the machine
> itself.  :-( So I would love to get the cost of the installation of
> cooling down.

I just faced a similar circumstance; we're building a new facility and
our CFO originally nixed raised floors and serious cooling because the
general contractor showed him a big pricetag with no context. I didn't
end up involved in the project until six months later.

I was lucky enough to get three magic formulas and an excellent bit of
advice. The formulas:

Formulas:
KVA @ 3 Phase = ((I*E)*1.73)/1000
BTU/hr        = (((I*E)*0.8)/1000)*3413
AC Tonnage    = BTU/12000

The advice: Create a spreadsheet. Show these formulas. Show replacement
cost of your equipment, both current and future if you plan to expand in
that room. Total it all up. Then show the costs of installation against
that. Context.

For us, it showed that the total buildout of a real computer room would
cost less than 10% of the cost of the machines. That and a discussion of
raised failure rates due to heat turned the corner on that one.

Good luck!

--j
-- 
Jim Meyer, Geek At Large                              purp at wildbrain.com


From rgb at phy.duke.edu  Wed Apr 24 22:01:52 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 25 Apr 2002 01:01:52 -0400 (EDT)
Subject: Liquid cooling?
In-Reply-To: <20020424175916.D12933@velocet.ca>
Message-ID: <Pine.LNX.4.44.0204250046040.23586-100000@lucifer.rgb.private.net>

On Wed, 24 Apr 2002, Velocet wrote:

> Anyone ever sell the heat generated from the clusters to someone else? :)

Step right up, folks, I gotcher heat right here, yes ma'am, packaged to
go.  How about you, sir, couldja use some heat?  Whatzat?  It's hot
enough outside already today?  Snark snark snark -- every crowd's got
one, folks, a JOKER!  THIS heat is special, Geuuwiiiine Beowulf Cluster
heat, imported from one of the O-riginal beowulf clusters, you can't
find heat like this any more, it's practically antique heat.

No, now stop that.  Quit walking away.  And don't shake your head like
that, sir, why, one day you'll be LACKING some heat and NEEDING some
heat and then you'll be sorry indeed you didn't take advantage of this
special offer, never mind that it is midsummer and hotter than H*** out
here... well, OK then.  I guess I'll just keep my heat.  Have to dump it
outside again, or worse, pay to have it hauled away.  Don't nobody seem
to WANT heat anymore, and a few months ago everybody was begging me for
heat.

Sigh.  That didn't go too well.  The problem with heat is one or t'other
of those darned laws of thermodynamics -- always making more of it when
it is waste, can't get enough of it when it is an energy source, can't
(generally speaking, although there are specific exceptions) take waste
heat and use it to make more organized energy without an ever-cooler
reservoir to ultimately dump it in.  Otherwise, by the time you run your
cluster hot enough for the heat to be "useful" (which requires a
signficant temperature differential relative to ambient) you're frying
the cluster's innards.

On a modest scale, of course, sure.  In the winter, I recycle my home
cluster's heat.  In the summer, I probably pay more than I gained in the
winter to remove it and dump it outdoors, but at least I break (more)
APPROXIMATELY even.  The same cycle probably holds elsewhere -- where it
is already cold, you can probably reuse the heat IF you can get it from
here to there (heat actually being pretty difficult and expensive to
pack up and ship from where you make it to where you MIGHT want it).
Where it is already hot, folks just look at you funny when you try to
sell them your garbage.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From rgb at phy.duke.edu  Thu Apr 25 00:19:01 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 25 Apr 2002 03:19:01 -0400 (EDT)
Subject: Tyan Tiger 2460
Message-ID: <Pine.LNX.4.44.0204250237120.23586-100000@lucifer.rgb.private.net>

Dear List,

We've had problems (as have others on this list) getting our 2U
rackmount Tyan Tiger 2460 motherboards to boot/install/run reliably and
stably.  Seth (our systems guy) and I worked on a couple of the boxes
today armed with a 32 bit riser, a 64 bit riser, and an ATI rage video
card and a 3c905m NIC.  

We took the PCI cards off of their frames so we could mount them
vertically directly in the slots for testing.  We also dismounted the
risers so we could try them in different slots as well. The following is
a summary of our findings.

  a) Only the video card would work in slot 1.  Period.  If we put the
3c905 in slot one all by itself (using the BIOS console), the system
would behave erratically, actually mistaking the number and speed of
processors during boot and crashing under heavy network loads if and
when it booted.

  b) If slot one had video or was empty, the system would work fine for
all other vertical configurations.  That is, video in 1, net in 6, video
in 2, net in 3 or vice versa, video in 5, NIC in 2, etc.  I don't know
that we tested every combination but we didn't find another that failed
in all our tests.  Slot 1 alone seems to be the ringer.

It is not a 64 vs 32 bit slot question or a power question per se, as
far as we can tell.  Slots 1-4 are all apparently identical 32 bit, five
volt slots, slots 5+ are 32 bit five volt slots, and both the 3c905 and
ATI are slotted for 3.3/32 bit slots with the extra notch near the
back.  There is no reason that we can see for the 3c905 to work in slot
2, 3, 4, 5, 6, 7 but not in slot 1.

This is further verified by the fact that we had a 2566 to play with as
well, which has two 64/66 3.3 volt slots, and the cards worked perfectly
in them in any order.

  c) Our real torment comes from the riser.  Most riser cards are
designed so they HAVE to plug into slot 1 so that their physical
framework can hold the cards sideways in the remaining room over the PCI
bus.  Plugged into slot 2, there isn't generally room to fit a full
height card (or the support frame) into the remaining space to the side.
With the riser in slot 1, no combination of cards in the riser that
included the NIC would work, and even the video alone in the slot that
should have been a "straight through" connection appeared to have
problems, although a system without a NIC is useless to us so the issue
is moot.  Again, the most common symptom was that the system wouldn't
even get the CPU info correct at the bios level before any boot is even
initiated, and if the boot/install succeeded at all the system was
highly unstable under any kind of load.

The problem persisted, identically, when we put the 64 bit riser (which
we were really counting on to fix things) into slot 1 and plugged the
NIC and video into it, in either order.  We had hoped that the problem
was just the 32 bit riser not correctly connecting lines needed for the
power/clock to automatically set to the needs of the card and that the
64 bit card would "fix" this.  As noted above, the problem is all slot
1, though, in any card orientation even without the riser at all.

HOWEVER, being clever little beasties, we put the dismounted (32 bit)
riser in slot 2 with the extra cabled keys in slots 3 and 4, added the
dismounted PCI cards to any slots we felt like and voila!  The system,
she work perfectly.  Right number of CPUs, flawless boot/install, still
running under heavy load for ten hours or so now.

Since the 3c905 is a highly reliable NIC (and the ATI rage is ditto a
reliable video card and for that matter we also saw the problem earlier
with other NICs, e.g. tulipsj) that work perfectly in many, many
systems, one has to be at least tempted to conclude that this is a
reproducible BUG in the 2460 Tiger motherboard, either in the BIOS or
(worse) in the physical wiring of slot 1. We are reporting it to Tyan as
such to see if they are aware of it (couldn't find it on their website
if they are) and if they know of any fix.  In the meantime, we are
testing a workaround consisting of a riser with a flexible ribbon
connecting the primary slot, so that it can be installed offset from
where it is plugged into the PCI bus.  We hypothesize that if we mount
this riser in the framework (so it sits physically above slot 1 and can
take full height cards) but plug it into slots 2-4, it will work fine
and the systems will stabilize.

Of course the RIGHT solution would be to keep our perfectly good cards
and risers and get Tyan to replace the 2460's (if there isn't a bios
upgrade that fixes the ones we have).  Given the frustration and
downtime and lost productivity we have suffered, giving us 2466
replacements seems reasonable to me:-).

Anyway, this explains to at least some extent why such a wide range of
experiences has been reported for these motherboards on the list.
People who rackmounted them probably had problems, although I'm willing
to believe that there are riser cards out there or particular card
combinations that would "fix" the problem, possibly without the owner
ever knowing it existed.  People who tower mounted them probably did not
have problems, especially if they used an AGP video card or put their
video and NIC into the regular 32 bit slots (or in any event
"accidentally" avoided putting something into slot 1 that wouldn't work
there).  The discussion above may help anybody out there who is still
having problems -- rearrange your cards as described above and all
SHOULD be well and/or replace your riser and/or get Tyan to make it
right.

BTW, so far the 2466 runs fine, as noted by many listvolken.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From john.hearns at cern.ch  Thu Apr 25 01:42:37 2002
From: john.hearns at cern.ch (John Hearns)
Date: 25 Apr 2002 10:42:37 +0200
Subject: IBM Lego brick storage
Message-ID: <1019724158.7365.24.camel@ues4>

Perhaps not relevent to Beowulfery,
but there is discussion of cooling :-)

IBM makes Lego-brick like storage cube:
http://www.eetimes.com/at/news/OEG20020423S0091


From math at velocet.ca  Thu Apr 25 09:18:04 2002
From: math at velocet.ca (Velocet)
Date: Thu, 25 Apr 2002 12:18:04 -0400
Subject: Tyan Tiger 2460
In-Reply-To: <Pine.LNX.4.44.0204250237120.23586-100000@lucifer.rgb.private.net>; from rgb@phy.duke.edu on Thu, Apr 25, 2002 at 03:19:01AM -0400
References: <Pine.LNX.4.44.0204250237120.23586-100000@lucifer.rgb.private.net>
Message-ID: <20020425121804.U12933@velocet.ca>

On Thu, Apr 25, 2002 at 03:19:01AM -0400, Robert G. Brown's all...
> Dear List,
> 
> We've had problems (as have others on this list) getting our 2U
> rackmount Tyan Tiger 2460 motherboards to boot/install/run reliably and
> stably.  Seth (our systems guy) and I worked on a couple of the boxes
> today armed with a 32 bit riser, a 64 bit riser, and an ATI rage video
> card and a 3c905m NIC.  

re Tigers....

We got back a 2466 from RMA that was somehow fried. New replacement board came
back. The new bios reports "V4.0 rel 6" and also "Phoenix 4.01".

I saw this change from previous versions and decided to try our Tbirds in it
that we had tried before under previous BIOS versions (and I cant remember the
version #s from before and I cant reboot any nodes to find out :)

Well something has changed because it warns that the processors are non
MP and so it will operate uniprocessor as SMP is unsupported with non MPs.
Cant flash back to a previous bios version either. So Tyan musta struck some
deal with AMD on this. :) Im wondering why they bothered, really, since
Tbirds are almost out of production anyway.

We still have a few test boards running happily with dual Tbird 1.33Ghz
on both 2460s and 2466s, I assume on the older bios.

No major problems with either type of board, except those wierd Addtron
GBE cards which y'all should stay away from. :)

/kc


> 
> We took the PCI cards off of their frames so we could mount them
> vertically directly in the slots for testing.  We also dismounted the
> risers so we could try them in different slots as well. The following is
> a summary of our findings.
> 
>   a) Only the video card would work in slot 1.  Period.  If we put the
> 3c905 in slot one all by itself (using the BIOS console), the system
> would behave erratically, actually mistaking the number and speed of
> processors during boot and crashing under heavy network loads if and
> when it booted.
> 
>   b) If slot one had video or was empty, the system would work fine for
> all other vertical configurations.  That is, video in 1, net in 6, video
> in 2, net in 3 or vice versa, video in 5, NIC in 2, etc.  I don't know
> that we tested every combination but we didn't find another that failed
> in all our tests.  Slot 1 alone seems to be the ringer.
> 
> It is not a 64 vs 32 bit slot question or a power question per se, as
> far as we can tell.  Slots 1-4 are all apparently identical 32 bit, five
> volt slots, slots 5+ are 32 bit five volt slots, and both the 3c905 and
> ATI are slotted for 3.3/32 bit slots with the extra notch near the
> back.  There is no reason that we can see for the 3c905 to work in slot
> 2, 3, 4, 5, 6, 7 but not in slot 1.
> 
> This is further verified by the fact that we had a 2566 to play with as
> well, which has two 64/66 3.3 volt slots, and the cards worked perfectly
> in them in any order.
> 
>   c) Our real torment comes from the riser.  Most riser cards are
> designed so they HAVE to plug into slot 1 so that their physical
> framework can hold the cards sideways in the remaining room over the PCI
> bus.  Plugged into slot 2, there isn't generally room to fit a full
> height card (or the support frame) into the remaining space to the side.
> With the riser in slot 1, no combination of cards in the riser that
> included the NIC would work, and even the video alone in the slot that
> should have been a "straight through" connection appeared to have
> problems, although a system without a NIC is useless to us so the issue
> is moot.  Again, the most common symptom was that the system wouldn't
> even get the CPU info correct at the bios level before any boot is even
> initiated, and if the boot/install succeeded at all the system was
> highly unstable under any kind of load.
> 
> The problem persisted, identically, when we put the 64 bit riser (which
> we were really counting on to fix things) into slot 1 and plugged the
> NIC and video into it, in either order.  We had hoped that the problem
> was just the 32 bit riser not correctly connecting lines needed for the
> power/clock to automatically set to the needs of the card and that the
> 64 bit card would "fix" this.  As noted above, the problem is all slot
> 1, though, in any card orientation even without the riser at all.
> 
> HOWEVER, being clever little beasties, we put the dismounted (32 bit)
> riser in slot 2 with the extra cabled keys in slots 3 and 4, added the
> dismounted PCI cards to any slots we felt like and voila!  The system,
> she work perfectly.  Right number of CPUs, flawless boot/install, still
> running under heavy load for ten hours or so now.
> 
> Since the 3c905 is a highly reliable NIC (and the ATI rage is ditto a
> reliable video card and for that matter we also saw the problem earlier
> with other NICs, e.g. tulipsj) that work perfectly in many, many
> systems, one has to be at least tempted to conclude that this is a
> reproducible BUG in the 2460 Tiger motherboard, either in the BIOS or
> (worse) in the physical wiring of slot 1. We are reporting it to Tyan as
> such to see if they are aware of it (couldn't find it on their website
> if they are) and if they know of any fix.  In the meantime, we are
> testing a workaround consisting of a riser with a flexible ribbon
> connecting the primary slot, so that it can be installed offset from
> where it is plugged into the PCI bus.  We hypothesize that if we mount
> this riser in the framework (so it sits physically above slot 1 and can
> take full height cards) but plug it into slots 2-4, it will work fine
> and the systems will stabilize.
> 
> Of course the RIGHT solution would be to keep our perfectly good cards
> and risers and get Tyan to replace the 2460's (if there isn't a bios
> upgrade that fixes the ones we have).  Given the frustration and
> downtime and lost productivity we have suffered, giving us 2466
> replacements seems reasonable to me:-).
> 
> Anyway, this explains to at least some extent why such a wide range of
> experiences has been reported for these motherboards on the list.
> People who rackmounted them probably had problems, although I'm willing
> to believe that there are riser cards out there or particular card
> combinations that would "fix" the problem, possibly without the owner
> ever knowing it existed.  People who tower mounted them probably did not
> have problems, especially if they used an AGP video card or put their
> video and NIC into the regular 32 bit slots (or in any event
> "accidentally" avoided putting something into slot 1 that wouldn't work
> there).  The discussion above may help anybody out there who is still
> having problems -- rearrange your cards as described above and all
> SHOULD be well and/or replace your riser and/or get Tyan to make it
> right.
> 
> BTW, so far the 2466 runs fine, as noted by many listvolken.
> 
>    rgb
> 
> -- 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From kus at free.net  Thu Apr 25 09:20:14 2002
From: kus at free.net (Mikhail Kuzminsky)
Date: Thu, 25 Apr 2002 20:20:14 +0400 (MSD)
Subject: Tyan Tiger 2460 (Re)
Message-ID: <200204251620.UAA24432@nocserv.free.net>

According to Robert G. Brown
> From beowulf-admin at beowulf.org Thu Apr 25 11:50:34 2002
> From: "Robert G. Brown" <rgb at phy.duke.edu>
> To: Beowulf Mailing List <beowulf at beowulf.org>
> Subject: Tyan Tiger 2460
> 
> We've had problems (as have others on this list) getting our 2U
> rackmount Tyan Tiger 2460 motherboards to boot/install/run reliably and
> stably. 
> 
> ... to conclude that this is a
> reproducible BUG in the 2460 Tiger motherboard, either in the BIOS or
> (worse) in the physical wiring of slot 1...

> BTW, so far the 2466 runs fine, as noted by many listvolken.
> 
  It's not only problem w/Tyan dual motherboards. The problem
exist also w/correct work of Hardware Monitor chips (for work of
lm_sensors it's necessary to do (at the boot) some trick w/BIOS), 
for both 2460 and 2466. Moreover, for Thunder w/Tualatin chips lm_sensors
can't work. May be Supermicro boards are more stable ...
 
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow


From gabriel.weinstock at dnamerican.com  Thu Apr 25 09:58:22 2002
From: gabriel.weinstock at dnamerican.com (Gabriel J. Weinstock)
Date: Thu, 25 Apr 2002 12:58:22 -0400
Subject: network card problems
Message-ID: <17193765606726@DNAMERICAN.COM>

One of my co-workers is having a problem with his CNet SinglePoint 10/100 
CardBus PC card on his laptop, and I thought I would ask the gurus here since 
so many of you have done kernel/network driver development.

Essentially, what is happening is that the driver is dumping huge amount of 
messages to the syslog facility, to the point of filling up his root 
partition. The messages are as follows:
-----
kernel: rtl8139_rx_interrupt: eth0: In rtl8139_rx(), current ef74 BufAddr 
efd8, free to ef64, Cmd 0c.
kernel: rtl8139_rx_interrupt: eth0:  rtl8139_rx() status 602001, size 0060, 
cur ef74.
kernel: rtl8139_rx_interrupt: eth0: Done rtl8139_rx(), current efd8 BufAddr 
efd8, free to efc8, Cmd 0d.
kernel: rtl8139_interrupt: eth0: interrupt status=0x0000 ackstat=0x0000 new 
intstat=0x0000.
kernel: rtl8139_interrupt: eth0: exiting interrupt, intr_status=0x0000.
kernel: rtl8139_interrupt: eth0: interrupt status=0x0000 ackstat=0x0001 new 
intstat=0x0001.
-----
I have no clue as to why this is happening and what to do about it.
If anyone has any suggestions, I would greatly appreciate it.
Thanks,
Gabriel


From jlb17 at duke.edu  Thu Apr 25 10:47:58 2002
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Thu, 25 Apr 2002 13:47:58 -0400 (EDT)
Subject: Tyan Tiger 2460
In-Reply-To: <20020425121804.U12933@velocet.ca>
Message-ID: <Pine.LNX.4.44.0204251344260.1591-100000@chaos.egr.duke.edu>

On Thu, 25 Apr 2002 at 12:18pm, Velocet wrote

> We got back a 2466 from RMA that was somehow fried. New replacement board came
> back. The new bios reports "V4.0 rel 6" and also "Phoenix 4.01".
> 
> I saw this change from previous versions and decided to try our Tbirds in it
> that we had tried before under previous BIOS versions (and I cant remember the
> version #s from before and I cant reboot any nodes to find out :)
> 
> Well something has changed because it warns that the processors are non
> MP and so it will operate uniprocessor as SMP is unsupported with non MPs.
> Cant flash back to a previous bios version either. So Tyan musta struck some
> deal with AMD on this. :) Im wondering why they bothered, really, since
> Tbirds are almost out of production anyway.

Tyan's products page reports that there are two versions of the S2466.  
The new ones are the S2466N-4M.  The main difference listed is that the 
4Ms have functional onboard USB (v1.1), whereas the original S2466Ns 
were shipping with those addon 4port PCI USB cards.  That's why there are 
now two BIOSes for the S2466, and they warn that flashing one type of 
board with the BIOS for the other is Bad.

They must have taken the opportunity in fixing the USB problem to also 
"fix" the non-SMP chip in SMP config "problem".

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


From gary at umsl.edu  Thu Apr 25 11:48:24 2002
From: gary at umsl.edu (Gary Stiehr)
Date: Thu, 25 Apr 2002 13:48:24 -0500
Subject: preemption in PBSPro with MPICH on Linux
Message-ID: <3CC84F78.4010106@umsl.edu>

Hi,

Does anyone have experience using the preemption feature of PBSPro with 
MPICH jobs on Linux clusters?  I believe in the release notes for PBSPro 
5.2, it says that it will only send SIGSTOP to the process that PBS 
started (i.e, the mpirun process).  Therefore, if that process started 
other processes (as is the case with my MPICH job), the other processes 
will continue to run.  Does anyone know of a way to suspend all 
processes started from an MPICH job?

I need to do this because some MPICH jobs last several weeks and other 
smaller jobs submitted would have to wait if I do not use preemption.  I 
suppose another method would be to make sure that the long MPICH job 
checkpoints and then just have PBS kill the job after a certain amount 
of time.

Any experiences and/or suggestions would be appreciated.

Thanks,
Gary Stiehr
Information Technology Services
University of Missouri - St. Louis
gary at umsl.edu


From becker at scyld.com  Thu Apr 25 18:52:00 2002
From: becker at scyld.com (Donald Becker)
Date: Thu, 25 Apr 2002 21:52:00 -0400 (EDT)
Subject: network card problems
In-Reply-To: <17193765606726@DNAMERICAN.COM>
Message-ID: <Pine.LNX.4.33.0204252150250.1031-100000@presario>

On Thu, 25 Apr 2002, Gabriel J. Weinstock wrote:

> One of my co-workers is having a problem with his CNet SinglePoint 10/100
> CardBus PC card on his laptop, and I thought I would ask the gurus here since
> so many of you have done kernel/network driver development.

This is the wrong list to ask this question.  The appropriate place is
the realtek at scyld.com mailing list.  See
   http://www.scyld.com/mailman/listinfo/

> Essentially, what is happening is that the driver is dumping huge amount of
> messages to the syslog facility, to the point of filling up his root
> partition. The messages are as follows:

He turned on debugging to the highest level.
Solution: Don't do that.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993


From jcownie at etnus.com  Thu Apr 25 01:44:59 2002
From: jcownie at etnus.com (James Cownie)
Date: Thu, 25 Apr 2002 09:44:59 +0100
Subject: Kidger's comments on Quadric's design and performance 
In-Reply-To: Message from Joachim Worringen <joachim@lfbs.RWTH-Aachen.DE> 
   of "Wed, 24 Apr 2002 18:16:01 +0200." <3CC6DA41.367BEB55@lfbs.rwth-aachen.de> 
Message-ID: <170es7-0HS-00@etnus.com>

Joachim Worringen wrote :-
> > No, you are confusing two completely different issues. To support
> > OpenMP you need a single address space which spans the processors.
> 
> You are right, this is completely different. However, I did not mean
> that connecting nodes of a cluster with a cache-coherent interface
> "gives you an SMP", but more precisely "gives the shared parts of the
> distributed distinct address spaces nearly SMP-like access
> characteristics", with respect to a suitable programming model. 

...

> With Quadrics, this should be possible in an even more efficient manner
> due to the hardware-MMU and -TLB on the adapter.

(One caveat, I'm assuming here that the Quadrics' model remains the
same as it was when we were all back at Meiko).

I think you still do not understand Quadrics' model. There is _no_
shared part of the address space. Access to remote address spaces is
never achieved directly by an arbitrary load/store from any CPU, it
always requires an access to the communication processor.

As I said before the model is of _explicit_ remote store access. You
have to generate different instructions to perform a remote access.

On the other hand, all remote accesses are fully cache coherent both
locally and remotely.

The issue of cache coherence of the interface is unrelated to the
issue of how you cause a remote store access.

> To have a real cc-NUMA-SMP, the integration needs to be higher (HP
> X-Class, DG/IBM NUMA-Q, ...), this is for sure.  The question is: are
> large-scale SMPs as sold by IBM, Sun, ... not the better solution for
> such tasks? Quadrics is expensive, and you still have to manage a bunch
> of PCs instead a nice, single SMP.

But, as I said, Quadrics' doesn't pretend to be a cc-NUMA-SMP at
all. Their technology is used to build _big_ clusters which may contain
the SMPs as nodes, but certainly scale above the range where the SMPs
run out. (See the recently announced HP/Quadrics PNL cluster, or the
Compaq SCs at LANL and CEA, for instance).

-- Jim 

James Cownie	<jcownie at etnus.com>
Etnus, LLC.     +44 117 9071438
http://www.etnus.com


From SGaudet at turbotekcomputer.com  Thu Apr 25 10:09:56 2002
From: SGaudet at turbotekcomputer.com (Steve Gaudet)
Date: Thu, 25 Apr 2002 13:09:56 -0400
Subject: Tyan Tiger 2460 (Re)
Message-ID: <3450CC8673CFD411A24700105A618BD6267E5E@911TURBO>

Hello 

> According to Robert G. Brown
> > From beowulf-admin at beowulf.org Thu Apr 25 11:50:34 2002
> > From: "Robert G. Brown" <rgb at phy.duke.edu>
> > To: Beowulf Mailing List <beowulf at beowulf.org>
> > Subject: Tyan Tiger 2460
> > 
> > We've had problems (as have others on this list) getting our 2U
> > rackmount Tyan Tiger 2460 motherboards to boot/install/run 
> reliably and
> > stably. 
> > 
> > ... to conclude that this is a
> > reproducible BUG in the 2460 Tiger motherboard, either in 
> the BIOS or
> > (worse) in the physical wiring of slot 1...
> 
> > BTW, so far the 2466 runs fine, as noted by many listvolken.
> > 
>   It's not only problem w/Tyan dual motherboards. The problem
> exist also w/correct work of Hardware Monitor chips (for work of
> lm_sensors it's necessary to do (at the boot) some trick w/BIOS), 
> for both 2460 and 2466. Moreover, for Thunder w/Tualatin 
> chips lm_sensors
> can't work. May be Supermicro boards are more stable ...

Has anyone reported these issues to Tyan's technical support?  Tyan knows
that their products sell very well in the Linux cluster market.  So not
addressing these issues would cause customers to look elsewhere.

Regards,

Steve Gaudet 
Linux Solutions Engineer
   ..... 
  <(???)> 
 
===================================================================
| Turbotek Computer Corp.    tel:603-666-3062 ext. 21             |
| 8025 South Willow St.      fax:603-666-4519                     |
| Building 2, Unit 105       toll free:800-573-5393               |
| Manchester, NH 03103       e-mail:sgaudet at turbotekcomputer.com  |
|                            web: http://www.turbotekcomputer.com |
===================================================================

  
From sp at scali.com  Thu Apr 25 12:02:44 2002
From: sp at scali.com (Steffen Persvold)
Date: Thu, 25 Apr 2002 21:02:44 +0200 (CEST)
Subject: [NFS] NFS clients behind a masqueraded gateway
In-Reply-To: <Pine.LNX.4.30.0204180946500.10622-100000@elin.scali.no>
Message-ID: <Pine.LNX.4.30.0204252059310.16930-100000@elin.scali.no>

Hi Wulfers,

I'm taking the liberty to post my question here as well since it might be
relevant for some of you and you might have some experience with this.

I've also posted the mail to the NFS mailing list, but I haven't gotten
any answers (yet).

Any pointers to what the problem might be are higly appreciated.

On Thu, 18 Apr 2002, Steffen Persvold wrote:

> Hi all,
>
> I'm experiencing some problems with a cluster setup. The cluster is set up
> in a way that you have a frontend machine configured as a masquerading
> gateway and all the compute nodes behind it on a private network (i.e the
> frontend has two network interfaces). User home directories and also other
> data directories which should be available to the cluster (i.e statically
> mounted in the same location on both frontend and nodes) are located on
> external NFS servers (IRIX and Linux servers). This seems to work fine
> when the cluster is in use, but if the cluster is idle for some time (e.g
> over night), the NFS directories has become unavailable and trying to
> reboot the frontend results in a complete hang when it tries to unmount
> the NFS directories (it hangs in a fuser command). The frontend and all
> the nodes are running RedHat 7.2, but with a stock 2.4.18 kernel (plus
> Trond's seekdir patch, thanks for the help BTW).
>
> Ideas anyone ?
>
> Thanks in advance,
>


Best regards,
Steffen


From rgb at phy.duke.edu  Thu Apr 25 20:21:21 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Thu, 25 Apr 2002 23:21:21 -0400 (EDT)
Subject: Tyan Tiger 2460 (Re)
In-Reply-To: <3450CC8673CFD411A24700105A618BD6267E5E@911TURBO>
Message-ID: <Pine.LNX.4.44.0204252255590.6466-100000@lucifer.rgb.private.net>

On Thu, 25 Apr 2002, Steve Gaudet wrote:

> Has anyone reported these issues to Tyan's technical support?  Tyan knows
> that their products sell very well in the Linux cluster market.  So not
> addressing these issues would cause customers to look elsewhere.

We filed a ticket yesterday, but AFAIK no response yet.  I've been
snooping their website (and 3com's, and google, and anything else I can
think of) trying to find some hint that this problem has been reported
before.  It is very strange that such a complete screw up could still
exist -- how can anybody be using Tigers in 1U or 2U cases if this
problem is universal?  I posted here half hoping to hear somebody say
"Ah, you need to change XXX in the bios, you idiot".  I'd even welcome
the idiot part, if only I could get things to work.

Still, it is completely reproducible on the Tigers we have, it has hit
other Tiger users at Duke, some of whom posted here a month or two ago
and finally gave up and replaced their 2460's with 2466's in sheer
frustration (or so we suspect -- we only figured out the details I
posted yesterday and haven't had time to verify that the problems are
indeed the same) and it is a REAL problem.  The 3c905 unfortunately is
about 1 cm too tall to fit in a 2U case straight up if the metal
backplate is taken off, although the ATI Rage actually will fit.

I just registered (shudder, but what's a bit more SPAM in my already
hi-cal email diet:-) at amdmb.com, which looks like a support forum for
amd motherboard and will try a post there.  It looks like it gets the
attention of both AMD and Tyan engineers and may be faster than Tyan's
support page to return something useful.  People are (purportedly)
SELLING cluster nodes with Tiger 2460's in a 1U or 2U chassis with a
riser; it MUST work somehow, for some bios flash or configuration.

If not and the Tiger 2460 has a dark secret (it doesn't, uhh, actually
work in a 1-2U rackmount configuration, oops) well, I'm preparing a few
flaming torches and oiling my pitchfork...unless of course Tyan makes it
right.  Very quickly -- this has cost us at least a month of
potential productivity at this point, and looks to cost us at least a
few days more.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From heckendo at cs.uidaho.edu  Fri Apr 26 00:04:12 2002
From: heckendo at cs.uidaho.edu (Robert B Heckendorn)
Date: Fri, 26 Apr 2002 00:04:12 -0700 (PDT)
Subject: liquid cooling
In-Reply-To: <200204251601.g3PG1Bb04333@blueraja.scyld.com>
Message-ID: <200204260704.AAA12953@brownlee.cs.uidaho.edu>

> Interesting idea -- I actually used the idea as an exercise in class last
> semester - as a (obviously extreme) exercise in facilities management. Point
> was to get them to think about all the infrastructure we often take for
> granted..
>     
> As far as selling the heat generated by our systems... A co-generation
> facility off of the computer room? (Hmm.. Maybe, for some of our bigger
> beowulfs..) But if your campus buildings have steam heating, well, there
> might be something to it..

The best idea we heard was to use the waste heat from the beowulf
to heat the university swimming pool.  :-)

-- 
| Robert Heckendorn                        | We may not be the only
| heckendo at cs.uidaho.edu                   | species on the planet but
| http://www.cs.uidaho.edu/~heckendo       | we sure do act like it.
| CS Dept, University of Idaho             |
| Moscow, Idaho, USA   83844-1010          |


From mack.joseph at epa.gov  Fri Apr 26 06:43:34 2002
From: mack.joseph at epa.gov (Joseph Mack)
Date: Fri, 26 Apr 2002 09:43:34 -0400
Subject: howto increase MTU size on 100Mbps FE
Message-ID: <3CC95986.EB9B5B9F@epa.gov>

I know that jumbo frames increase throughput rate on GigE and was
wondering if a similar thing is possible with regular FE.

according to
 
http://sd.wareonearth.com/~phil/jumbo.html
 
the MTU of 1500 was chosen for 10Mbps ethernet and was kept
for 100Mbps and 1Gbps ethernet for backwards compatibility
on mixed networks. However MTU=1500 is too small
for 100Mbps and 1Gbps ethernet. In Gbps ethernet jumbo frames
(ie bigger MTU) is used to increase throughput.
 
With netpipe I found that throughput on FE was approx linear with
increasing MTU upto the max=1500bytes. I assume that there
is no sharp corner at 1500 and if in principle larger
frames could be sent, then throughput should also increase
for FE. (Let's assume that the larger packets will
never get off the LAN and will never need to be fragmented).
 
I couldn't increase the MTU above 1500 with ifconfig or ip link.
I found that the MTU seemed to be defined in
 
linux/include/if_ether.h
as
ETH_DATA_LEN and ETH_FRAME_LEN
 
and increased these by 1500, recompiled the kernel and net-tools
and rebooted. I still can't install a device with MTU>1500

VLAN sends a packet larger than the standard MTU, having an
extra 4 bytes of out of band data. The VLAN people have 
problems with larger MTUs. Here's their mailing list

http://www.WANfear.com/pipermail/vlan/

where I found the following e-mails

http://www.WANfear.com/pipermail/vlan/2002q2/002385.html
http://www.WANfear.com/pipermail/vlan/2002q2/002399.html
http://www.WANfear.com/pipermail/vlan/2002q2/002401.html

which indicate that the MTU is set in the NIC driver and
that in some cases the MTU=1500 is coded into the hardware
or is at least hard to change.

I don't know whether regular commodity switches (eg Netgear
FS series) care about packet size, but I was going to 
try to send packets over a cross-over cable initially. 

Am I barking up sensible trees here?

Thanks Joe

-- 
Joseph Mack PhD, Senior Systems Engineer, Lockheed Martin
contractor to the National Environmental Supercomputer Center, 
mailto:mack.joseph at epa.gov ph# 919-541-0007, RTP, NC, USA


From maurice at harddata.com  Fri Apr 26 08:25:51 2002
From: maurice at harddata.com (Maurice Hilarius)
Date: Fri, 26 Apr 2002 09:25:51 -0600
Subject: [Fwd: Tyan Tiger 2460]
In-Reply-To: <3CC94369.4DBF1BAF@scyld.com>
Message-ID: <5.1.0.14.2.20020426090916.072721e0@mail.harddata.com>

With regards to your message at 06:09 AM 4/26/02, Karen Keadle-Calvert. 
Where you stated:
>Daniel,
>
>Thought this might be of interest.  Didn't know if it would apply to
>your situation or not.
>
>Karen
>
>-------- Original Message --------
>Subject: Tyan Tiger 2460
>Date: Thu, 25 Apr 2002 03:19:01 -0400 (EDT)
>From: "Robert G. Brown" <rgb at phy.duke.edu>
>To: Beowulf Mailing List <beowulf at beowulf.org>
>CC: Matthew Durbin <matthew.durbin at amd.com>
>
>Dear List,
>
>We've had problems (as have others on this list) getting our 2U
>rackmount Tyan Tiger 2460 motherboards to boot/install/run reliably and
>stably.  Seth (our systems guy) and I worked on a couple of the boxes
>today armed with a 32 bit riser, a 64 bit riser, and an ATI rage video
>card and a 3c905m NIC.
>
>We took the PCI cards off of their frames so we could mount them
>vertically directly in the slots for testing.  We also dismounted the
>risers so we could try them in different slots as well. The following is
>a summary of our findings.
>
>   a) Only the video card would work in slot 1.  Period.  If we put the
>3c905 in slot one all by itself (using the BIOS console), the system
>would behave erratically, actually mistaking the number and speed of
>processors during boot and crashing under heavy network loads if and
>when it booted.

That is basically correct, with SOME video cards.
In general the BIOS and bus setup seem to prefer the first slot be used by 
video, but it really seems to matter what card it is more than anything 
else. In general the ATI RageXL cards are not happy, but the RAGE Pro are, 
and many TNT2 cards work well over all slots.

>   b) If slot one had video or was empty, the system would work fine for
>all other vertical configurations.  That is, video in 1, net in 6, video
>in 2, net in 3 or vice versa, video in 5, NIC in 2, etc.  I don't know
>that we tested every combination but we didn't find another that failed
>in all our tests.  Slot 1 alone seems to be the ringer.

If you are using a riser the other slots are mainly irrelevant.
In some risers they use extension boards to derive addressing from the next 
two slots, and in others they use some logic on the riser. It is advisable 
to use the Tyan M2039 riser as it seems to behave well with this, although, 
depending on cards used sometimes we see the ability to only support two 
out of three cards on the riser.

>It is not a 64 vs 32 bit slot question or a power question per se, as
>far as we can tell.  Slots 1-4 are all apparently identical 32 bit, five
>volt slots, slots 5+ are 32 bit five volt slots, and both the 3c905 and
>ATI are slotted for 3.3/32 bit slots with the extra notch near the
>back.  There is no reason that we can see for the 3c905 to work in slot
>2, 3, 4, 5, 6, 7 but not in slot 1.
>
>This is further verified by the fact that we had a 2566 to play with as
>well, which has two 64/66 3.3 volt slots, and the cards worked perfectly
>in them in any order.

In the case of the 2466 the only drawback with what you describe is that 
generally to get 33MHz cards running off a riser in slot1 or 2 usually 
requires the motherboard to be jumpered to 33MHz on the 64 bit PCI. There 
ARE however NICs and video cards that will run on a 66MHz bus successfully, 
but it does require some testing to find the right choices..

>   c) Our real torment comes from the riser.  Most riser cards are
>designed so they HAVE to plug into slot 1 so that their physical
>framework can hold the cards sideways in the remaining room over the PCI
>bus.  Plugged into slot 2, there isn't generally room to fit a full
>height card (or the support frame) into the remaining space to the side.
>With the riser in slot 1, no combination of cards in the riser that
>included the NIC would work, and even the video alone in the slot that
>should have been a "straight through" connection appeared to have
>problems, although a system without a NIC is useless to us so the issue
>is moot.  Again, the most common symptom was that the system wouldn't
>even get the CPU info correct at the bios level before any boot is even
>initiated, and if the boot/install succeeded at all the system was
>highly unstable under any kind of load.

Again, I think you are mostly seeing a riser card issue. We have used 
different risers with 3COM, Intel, and DLink NICs successfully, with the 
riser plugged into slot 1.
These have included some 32 bit, and a few 64 bit risers. In general we 
have the best results, supporting 64 bit, on the Tyan riser. But with 32 
bit only cards we are successful with more generic models.

Of course the RIGHT solution would be to keep our perfectly good cards
>and risers and get Tyan to replace the 2460's (if there isn't a bios
>upgrade that fixes the ones we have).  Given the frustration and
>downtime and lost productivity we have suffered, giving us 2466
>replacements seems reasonable to me:-).
While I am sure that this would be a possible solution, I feel that the 
right solution is to use a different (better) riser card.

>Anyway, this explains to at least some extent why such a wide range of
>experiences has been reported for these motherboards on the list.
Most of the problems I see are caused by:
1) Obsolete BIOS versions
2) Poor RAM
3) problems with cooling
4) In appropriate BIOS setup choices
5) Riser cards with issues

>BTW, so far the 2466 runs fine, as noted by many listvolken.


2466 is actually MUCH more difficult to deal with, especially if you want 
to use a 64 bit/66MHz card, as the bus is very particular about what cards 
you use. 5 volt cards are definitely going to make problems on most risers, 
in our testing.

Still as you mention, people have had success, but you can not just throw 
ANY riser or NIC or (especially) video card in and have it work..


With our best regards,

Maurice W. Hilarius       Telephone: 01-780-456-9771
Hard Data Ltd.               FAX:       01-780-456-9772
11060 - 166 Avenue        mailto:maurice at harddata.com
Edmonton, AB, Canada      http://www.harddata.com/
    T5X 1Y3

Ask me about the UP1500 Alpha - Full systems from $3,500!


From rgb at phy.duke.edu  Fri Apr 26 08:55:56 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Fri, 26 Apr 2002 11:55:56 -0400 (EDT)
Subject: [Fwd: Tyan Tiger 2460]
In-Reply-To: <5.1.0.14.2.20020426090916.072721e0@mail.harddata.com>
Message-ID: <Pine.LNX.4.44.0204261131020.16666-100000@sanity.phy.duke.edu>

On Fri, 26 Apr 2002, Maurice Hilarius wrote:

> >   a) Only the video card would work in slot 1.  Period.  If we put the
> >3c905 in slot one all by itself (using the BIOS console), the system
> >would behave erratically, actually mistaking the number and speed of
> >processors during boot and crashing under heavy network loads if and
> >when it booted.
> 
> That is basically correct, with SOME video cards.
> In general the BIOS and bus setup seem to prefer the first slot be used by 
> video, but it really seems to matter what card it is more than anything 
> else. In general the ATI RageXL cards are not happy, but the RAGE Pro are, 
> and many TNT2 cards work well over all slots.

You misunderstand.  The video card works fine in all slots.  The system
locks with a 3c905C-TX-M in slot 1 even with the system stripped so that
2 processors, a stick of certified registered ECC DDR, and the 3C905C
are it (NOTHING else plugged into the mobo).  Tyan is now refusing to
own the problem, and we're on the phone with 3Com to see if we can get
some help at that end.  They, at least, are being constructive.

I was mistaken that only video works in slot 1.  A Netgear in slot 1
still appears to work, but it doesn't support PXE or WOL and we need PXE
for the nodes.

Or maybe I misunderstand, if you are pointing out that the 2460 does
have a history of problems with certain video cards in slot 1 AS WELL AS
the 3c905.

> >   b) If slot one had video or was empty, the system would work fine for
> >all other vertical configurations.  That is, video in 1, net in 6, video
> >in 2, net in 3 or vice versa, video in 5, NIC in 2, etc.  I don't know
> >that we tested every combination but we didn't find another that failed
> >in all our tests.  Slot 1 alone seems to be the ringer.
> 
> If you are using a riser the other slots are mainly irrelevant.
> In some risers they use extension boards to derive addressing from the next 
> two slots, and in others they use some logic on the riser. It is advisable 
> to use the Tyan M2039 riser as it seems to behave well with this, although, 
> depending on cards used sometimes we see the ability to only support two 
> out of three cards on the riser.

We've just verified that the system works fine with a riser that plugs
into the data lines in slot 2, not slot 1 (via a ribbon cable).  It
really is a slot 1/3C905 issue.  Interestingly, the system does work if
the FSB is set back to 200 MHz.

I've now gone through dozens of threads on the motherboard on amdmb.com,
and the 246X motherboards appear to be extremely "finicky", often
requiring immense amounts of energy to find a companion configuration
that will work.  Tyan refused to acknowledge that there were any
problems with the motherboard at all when we were on the phone with
them, which is odd given the dozens of threads reporting them in the
forum.  I consider it a "problem" when a motherboard is so bleeding edge
sensitive to timing/configuration issues that moving a well-known stable
PCI card from a major manufacturer over a slot makes the system break.
Obviously Tyan doesn't;-)

> >It is not a 64 vs 32 bit slot question or a power question per se, as
> >far as we can tell.  Slots 1-4 are all apparently identical 32 bit, five
> >volt slots, slots 5+ are 32 bit five volt slots, and both the 3c905 and
> >ATI are slotted for 3.3/32 bit slots with the extra notch near the
> >back.  There is no reason that we can see for the 3c905 to work in slot
> >2, 3, 4, 5, 6, 7 but not in slot 1.
> >
> >This is further verified by the fact that we had a 2566 to play with as
> >well, which has two 64/66 3.3 volt slots, and the cards worked perfectly
> >in them in any order.
> 
> In the case of the 2466 the only drawback with what you describe is that 
> generally to get 33MHz cards running off a riser in slot1 or 2 usually 
> requires the motherboard to be jumpered to 33MHz on the 64 bit PCI. There 
> ARE however NICs and video cards that will run on a 66MHz bus successfully, 
> but it does require some testing to find the right choices..
> 
> >   c) Our real torment comes from the riser.  Most riser cards are
> >designed so they HAVE to plug into slot 1 so that their physical
> >framework can hold the cards sideways in the remaining room over the PCI
> >bus.  Plugged into slot 2, there isn't generally room to fit a full
> >height card (or the support frame) into the remaining space to the side.
> >With the riser in slot 1, no combination of cards in the riser that
> >included the NIC would work, and even the video alone in the slot that
> >should have been a "straight through" connection appeared to have
> >problems, although a system without a NIC is useless to us so the issue
> >is moot.  Again, the most common symptom was that the system wouldn't
> >even get the CPU info correct at the bios level before any boot is even
> >initiated, and if the boot/install succeeded at all the system was
> >highly unstable under any kind of load.
> 
> Again, I think you are mostly seeing a riser card issue. We have used 
> different risers with 3COM, Intel, and DLink NICs successfully, with the 
> riser plugged into slot 1.
> These have included some 32 bit, and a few 64 bit risers. In general we 
> have the best results, supporting 64 bit, on the Tyan riser. But with 32 
> bit only cards we are successful with more generic models.

It's not a riser issue.  The system locks, as noted above, if the 3c905
is the ONLY card in the system and is plugged vertically into slot 1 (no
riser in the system at all).

The only riser-related issue is that it does seem to be related to the
use of the slot 1 data lines and not the power rails, since the other
riser slots draw power from other PCI slots with little extension
cables, and a 3c905 in any slot-1 mounted riser then causes the lockup.

> 
> Of course the RIGHT solution would be to keep our perfectly good cards
> >and risers and get Tyan to replace the 2460's (if there isn't a bios
> >upgrade that fixes the ones we have).  Given the frustration and
> >downtime and lost productivity we have suffered, giving us 2466
> >replacements seems reasonable to me:-).
> While I am sure that this would be a possible solution, I feel that the 
> right solution is to use a different (better) riser card.
> 
> >Anyway, this explains to at least some extent why such a wide range of
> >experiences has been reported for these motherboards on the list.
> Most of the problems I see are caused by:
> 1) Obsolete BIOS versions
> 2) Poor RAM
> 3) problems with cooling
> 4) In appropriate BIOS setup choices
> 5) Riser cards with issues
> 
> >BTW, so far the 2466 runs fine, as noted by many listvolken.
> 
> 
> 2466 is actually MUCH more difficult to deal with, especially if you want 
> to use a 64 bit/66MHz card, as the bus is very particular about what cards 
> you use. 5 volt cards are definitely going to make problems on most risers, 
> in our testing.

The good thing about the 2466 is that it has onboard 100BT in addition
to the serial console, so that one doesn't necessarily need any cards at
all to run as a simple node.  If one wants to use it as a gigabit-linked
node, then one probably wants a 64/66 card anyway.  We've only been
playing with one since yesterday, but it does seem a bit better (with
what we've tested) than the 2460, but then, our 2460's do not work at
all in the configuration we're trying to run.

> 
> Still as you mention, people have had success, but you can not just throw 
> ANY riser or NIC or (especially) video card in and have it work..

Overall, the Tyans seem a bit on the maddening side.  Marginal hardware
is Evil.  I'm sure we'll eventually get things worked out (we're trying
to microconfigure the 3c905 in ITS bios on the phone with 3com now) but
it costs a lot in time, energy, and lost productivity.  (Well, looks
like configuring the 3c905 bios by hand didn't do it).

So far, the only solutions we've found appear to be displaced risers or
(possibly) different NICs.  Someone suggested that EEpro's work in a
slot 1 riser, and they do PXE and perform well.  Setting the FSB back
isn't an option.

I do appreciate your help and the remarks/suggestions above.  If I sound
abrupt, it is due to two nights running up til 3 websurfing on this
issue, and a pending meeting on why our cluster nodes still aren't in
production this afternoon.

   rgb

> 
> 
> 
> With our best regards,
> 
> Maurice W. Hilarius       Telephone: 01-780-456-9771
> Hard Data Ltd.               FAX:       01-780-456-9772
> 11060 - 166 Avenue        mailto:maurice at harddata.com
> Edmonton, AB, Canada      http://www.harddata.com/
>     T5X 1Y3
> 
> Ask me about the UP1500 Alpha - Full systems from $3,500!
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From SGaudet at turbotekcomputer.com  Fri Apr 26 05:34:36 2002
From: SGaudet at turbotekcomputer.com (Steve Gaudet)
Date: Fri, 26 Apr 2002 08:34:36 -0400
Subject: Tyan Tiger 2460 (Re)
Message-ID: <3450CC8673CFD411A24700105A618BD6267E6F@911TURBO>

Hello,

> On Thu, 25 Apr 2002, Steve Gaudet wrote:
> 
> > Has anyone reported these issues to Tyan's technical 
> support?  Tyan knows
> > that their products sell very well in the Linux cluster 
> market.  So not
> > addressing these issues would cause customers to look elsewhere.
> 
> We filed a ticket yesterday, but AFAIK no response yet.  I've been
> snooping their website (and 3com's, and google, and anything 
> else I can
> think of) trying to find some hint that this problem has been reported
> before.  It is very strange that such a complete screw up could still
> exist -- how can anybody be using Tigers in 1U or 2U cases if this
> problem is universal?  I posted here half hoping to hear somebody say
> "Ah, you need to change XXX in the bios, you idiot".  I'd even welcome
> the idiot part, if only I could get things to work.

I also submitted a ticket on this and made them aware of the list.  I'll
follow up with a call next week if I don't hear something soon.  We're also
going to do some testing here in our lab.

Cheers,

Steve Gaudet 
Linux Solutions Engineer
   ..... 
  <(???)> 
 
===================================================================
| Turbotek Computer Corp.    tel:603-666-3062 ext. 21             |
| 8025 South Willow St.      fax:603-666-4519                     |
| Building 2, Unit 105       toll free:800-573-5393               |
| Manchester, NH 03103       e-mail:sgaudet at turbotekcomputer.com  |
|                            web: http://www.turbotekcomputer.com |
===================================================================

  
From rok at ucsd.edu  Fri Apr 26 09:36:15 2002
From: rok at ucsd.edu (Robert Konecny)
Date: Fri, 26 Apr 2002 09:36:15 -0700
Subject: [NFS] NFS clients behind a masqueraded gateway
In-Reply-To: <Pine.LNX.4.30.0204252059310.16930-100000@elin.scali.no>; from sp@scali.com on Thu, Apr 25, 2002 at 09:02:44PM +0200
References: <Pine.LNX.4.30.0204180946500.10622-100000@elin.scali.no> <Pine.LNX.4.30.0204252059310.16930-100000@elin.scali.no>
Message-ID: <20020426093615.A3960@ucsd.edu>

Hi Steffen,

you're probably hitting NAT timeouts and NFS clients don't handle this 
well. We have the same setup on our cluster but we are using automountd on 
our NFS mounted directories - which works nicely around this problem.

cheers,

robert


On Thu, Apr 25, 2002 at 09:02:44PM +0200, Steffen Persvold wrote:
> Hi Wulfers,
> 
> I'm taking the liberty to post my question here as well since it might be
> relevant for some of you and you might have some experience with this.
> 
> I've also posted the mail to the NFS mailing list, but I haven't gotten
> any answers (yet).
> 
> Any pointers to what the problem might be are higly appreciated.
> 
> On Thu, 18 Apr 2002, Steffen Persvold wrote:
> 
> > Hi all,
> >
> > I'm experiencing some problems with a cluster setup. The cluster is set up
> > in a way that you have a frontend machine configured as a masquerading
> > gateway and all the compute nodes behind it on a private network (i.e the
> > frontend has two network interfaces). User home directories and also other
> > data directories which should be available to the cluster (i.e statically
> > mounted in the same location on both frontend and nodes) are located on
> > external NFS servers (IRIX and Linux servers). This seems to work fine
> > when the cluster is in use, but if the cluster is idle for some time (e.g
> > over night), the NFS directories has become unavailable and trying to
> > reboot the frontend results in a complete hang when it tries to unmount
> > the NFS directories (it hangs in a fuser command). The frontend and all
> > the nodes are running RedHat 7.2, but with a stock 2.4.18 kernel (plus
> > Trond's seekdir patch, thanks for the help BTW).
> >
> > Ideas anyone ?
> >
> > Thanks in advance,
> >
> 
> 
> Best regards,
> Steffen
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


From bogdan.costescu at iwr.uni-heidelberg.de  Fri Apr 26 14:08:32 2002
From: bogdan.costescu at iwr.uni-heidelberg.de (Bogdan Costescu)
Date: Fri, 26 Apr 2002 23:08:32 +0200 (CEST)
Subject: howto increase MTU size on 100Mbps FE
In-Reply-To: <3CC95986.EB9B5B9F@epa.gov>
Message-ID: <Pine.LNX.4.44.0204262228020.9178-100000@kenzo.iwr.uni-heidelberg.de>

On Fri, 26 Apr 2002, Joseph Mack wrote:

> I know that jumbo frames increase throughput rate on GigE and was
> wondering if a similar thing is possible with regular FE.

I was just about to ask the same thing... but to be more precise I was 
looking for first-hand experience with MPI/PVM over oversized Ethernet 
frames. As you say, the current situation doesn't really encourages 
experiments with Fast Ethernet, but I was thinking especially to people 
using Gigabit Ethernet.

> I couldn't increase the MTU above 1500 with ifconfig or ip link.

There are also limitations in the drivers. The upper network layers will 
refuse to set a value that is not supported by the driver.

> VLAN sends a packet larger than the standard MTU, having an
> extra 4 bytes of out of band data. The VLAN people have 
> problems with larger MTUs. 

I think that most of the problems come from the fact that most cards do 
not have support for VLAN - allow the extra 4 bytes _only_ if the VLAN tag 
is present. Most of the cards allow oversized frames but there is no 
control over size and/or VLAN tag presence.

> which indicate that the MTU is set in the NIC driver and
> that in some cases the MTU=1500 is coded into the hardware
> or is at least hard to change.

There are actually some hardware limitations: cards have FIFO buffers 
which are designed based on normal Ethernet frame size; while VLAN's 4 
extra bytes usually fit, a 4-8 KiByte packet usually doesn't. Some drivers 
also take active measures to prevent Tx underruns which are probably 
disturbed by oversized frames.

> I don't know whether regular commodity switches (eg Netgear
> FS series) care about packet size, but I was going to 
> try to send packets over a cross-over cable initially. 

That's another question. Store-and-forward switches probably need to store 
the whole packet before transmitting it further...

<shameless plug>
A week ago, I released a patch to add large MTU/VLAN support to the 3c59x 
driver in 2.4.18. So far I haven't receive any feedback about it... It 
still needs some work, but I first wanted to test functional support.
http://www.iwr.uni-heidelberg.de/groups/biocomp/bogdan/tornado/index.html
</shameless plug>

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De


From becker at scyld.com  Fri Apr 26 15:27:21 2002
From: becker at scyld.com (Donald Becker)
Date: Fri, 26 Apr 2002 18:27:21 -0400 (EDT)
Subject: howto increase MTU size on 100Mbps FE
In-Reply-To: <3CC95986.EB9B5B9F@epa.gov>
Message-ID: <Pine.LNX.4.33.0204261804410.5109-100000@presario>

On Fri, 26 Apr 2002, Joseph Mack wrote:

> I know that jumbo frames increase throughput rate on GigE and was
> wondering if a similar thing is possible with regular FE.

I used to track which FE NICs support oversized frames.  Jumbo frames
turned out to be so problematic that I've stopped maintaining the table.

> the MTU of 1500 was chosen for 10Mbps ethernet and was kept
> for 100Mbps and 1Gbps ethernet for backwards compatibility

Yup, 1500 bytes was chosen for interactive response on original
Ethernet.  (Note: originally Ethernet was 3Mbps, but commercial
equipment started at 10Mbps.)

The backwards compatibility issue is severe.  The only way to
automatically support jumbo frames is using the paged autonegotiation
information, and there is no standard established for this.

Jumbo frame *will* break equipment that isn't expecting oversized
packets.  If you detect a receive jabber (which is what a jumbo frame
looks like), you are allowed (and _should_) disable your receiver for a
period of time.  The rationale is that a network with an on-going
problem is likely to be generating flawed packets that shouldn't be
interpreted as valid.

> VLAN sends a packet larger than the standard MTU, having an
> extra 4 bytes of out of band data. The VLAN people have
> problems with larger MTUs. Here's their mailing list
> http://www.WANfear.com/pipermail/vlan/

Most of the vLAN people don't initially understand the capability of the
NICs, or why disabling Rx length checks is a Very Bad Idea.
There are many modern NIC types that have explicit VLAN support, and
VLAN should only be used with those NICs.  (Generic clients do not
require VLAN support.

> which indicate that the MTU is set in the NIC driver and
> that in some cases the MTU=1500 is coded into the hardware
> or is at least hard to change.

Hardware that isn't expecting to handle oversized frames might break in
unexpected ways when Rx frame size checking is disabled.  Breaking for
every packet is fine.  Occasionally corrupting packets as a counter
rolls over might never be pinned on the NIC.

The driver also comes into play.  Most drivers are designed to receive
packets into a single skbuff, assigned to a single descriptor.  With
jumbo frames the driver might need to be redesigned with multiple
descriptors per packet.  This adds complexity and might introduce new
race conditions.  Another aspect is that dynamic Tx FIFO threshold
code is likely to be broken when the threshold size exceeds 2KB.  This
is a lurking failure -- it will not reveal itself until the PCI is very
busy, then Boom...

> I don't know whether regular commodity switches (eg Netgear
> FS series) care about packet size, but I was going to
> try to send packets over a cross-over cable initially.

Most switches very much care about packet size.  Consider what happens
in store-and-forward mode.

All of these issues can be fixed or addressed on a case-by-case basis.
If you know the hardware you are using, and the symptoms of the
potential problems, it's fine to use jumbo frames.  But I would never
ship a turn-key product or preconfigured software that used jumbo frames
by default.  It should always require expertise and explicit action for
the end user to turn it on.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993


From atctam at csis.hku.hk  Fri Apr 26 21:33:17 2002
From: atctam at csis.hku.hk (Anthony Tam)
Date: Sat, 27 Apr 2002 12:33:17 +0800
Subject: Large chassis switch
Message-ID: <5.1.0.14.0.20020427121505.03813570@staff.csis.hku.hk>

Hi,

We are going to construct a PC cluster with more than 200 nodes.
Due to the limited budget, we decided to use Fast Ethernet as the
interconnect. We are thinking of using either Alpine 3808 from
Extreme or FastIron 1500 from Foundry, could anybody comment on
their performances? Or, if you have any web links that point to
some performance/evaluations of switches of similar type, this
would be appreciated.

Thanks.


Cheers

Anthony


       e Y8               d8    88
      d8b Y8     88*8e   d8888  88*e    88 88   88*8e  Y8b Y888
     d888b Y8    88 88b   88    88 88  88   88  88 88b  Y8b Y8
    d888888888   88 888   88    88 88  88   88  88 888   Y8b
   d888    b Y8  88 888   888   88 88   88 88   88 888    88
                                                         88
                                                        88


From chaimj at singnet.com.sg  Fri Apr 26 21:29:21 2002
From: chaimj at singnet.com.sg (Chai Mee Joon)
Date: Sat, 27 Apr 2002 12:29:21 +0800 (SGT)
Subject: Commercial:  Cluster Service (request for comments)
Message-ID: <1306.192.168.1.88.1019881761.squirrel@cdr.shogi.com.sg>

Hi everyone,


We provide a cluster computing service accessible online from anywhere.
Our OpenMosix Clusters are available now, and Scyld Beowulf Clusters
will be available in June 2002.


Please take a look at:

http://cluster.homelinux.net


Comments please !


Best regards,

Chai Mee Joon


From fraser5 at cox.net  Sat Apr 27 06:12:47 2002
From: fraser5 at cox.net (Jim Fraser)
Date: Sat, 27 Apr 2002 09:12:47 -0400
Subject: LAN question...
Message-ID: <000801c1eded$3aa77690$0800005a@papabear>

I am sorry this is not a direct beowulf question but I am trying to
understand the whole Wake-on-LAN feature and Resume on PME# features on my
motherboards that I am using in a cluster.  I figured someone here might
know more about this them me. (because I know very little on this subject
and I am having difficulty finding out from the hardware folks info as well)

My question are:

1) What exactly is PME#?  My motherboard can "Resume on PME#" and in the
bios I have it active and it says that if you can resume if there is traffic
on the onboard network adapter or PCI LAN card and that I need an ATX power
supply (which I understand and have) is there a special number or code that
I must send the NIC to turn on the system?  If I try and telnet to the
system (while it is off) I get no response.

2) Is "resume on PME#" different then WOL?

3) what does PME# stand for? I am guessing but Power Management...something
number?

4) Some of the newer external NIC's clam to support WOL and don't have a WOL
cable  that goes to the motherboard WOL connector (Linksys NIC's).   In the
manual it says that the cable is not needed anymore as this can be activated
thru the PCI bus if your mother board supports PXE (which it does).  Again,
is there a magic code to have the NIC send a signal to the motherboard to
have it turn itself on? Also, does it have to go in a special PCI slot?

Any help you can shed would be appreciated.


Thanks,

Jim


From fraser5 at cox.net  Sat Apr 27 09:42:53 2002
From: fraser5 at cox.net (Jim Fraser)
Date: Sat, 27 Apr 2002 12:42:53 -0400
Subject: LAN question... eureka!
In-Reply-To: <000801c1eded$3aa77690$0800005a@papabear>
Message-ID: <000b01c1ee0a$9573c340$0800005a@papabear>

I managed to find a description of a "magic packet" on the AMD site and with
a little more sniffing around found a perl script someone had wrote that
sends magic packet (some stuff then MAC address of the card 16 times) to the
card and the machine turned on.  For some reason, it would only work if I
broadcasted the packet to 255.255.255.255 along with the MAC address but it
works. (If I sent it to the exact address it didn't)

If anyone stills has anything to add feel free.  I still don't fully
understand this but I got it working.

Jim


-----Original Message-----
From: beowulf-admin at beowulf.org [mailto:beowulf-admin at beowulf.org]On
Behalf Of Jim Fraser
Sent: Saturday, April 27, 2002 9:13 AM
To: beowulf at beowulf.org
Subject: LAN question...


I am sorry this is not a direct beowulf question but I am trying to
understand the whole Wake-on-LAN feature and Resume on PME# features on my
motherboards that I am using in a cluster.  I figured someone here might
know more about this them me. (because I know very little on this subject
and I am having difficulty finding out from the hardware folks info as well)

My question are:

1) What exactly is PME#?  My motherboard can "Resume on PME#" and in the
bios I have it active and it says that if you can resume if there is traffic
on the onboard network adapter or PCI LAN card and that I need an ATX power
supply (which I understand and have) is there a special number or code that
I must send the NIC to turn on the system?  If I try and telnet to the
system (while it is off) I get no response.

2) Is "resume on PME#" different then WOL?

3) what does PME# stand for? I am guessing but Power Management...something
number?

4) Some of the newer external NIC's clam to support WOL and don't have a WOL
cable  that goes to the motherboard WOL connector (Linksys NIC's).   In the
manual it says that the cable is not needed anymore as this can be activated
thru the PCI bus if your mother board supports PXE (which it does).  Again,
is there a magic code to have the NIC send a signal to the motherboard to
have it turn itself on? Also, does it have to go in a special PCI slot?

Any help you can shed would be appreciated.


Thanks,

Jim

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


From ajiang at mail.eecis.udel.edu  Sat Apr 27 18:24:29 2002
From: ajiang at mail.eecis.udel.edu (Ao Jiang)
Date: Sat, 27 Apr 2002 21:24:29 -0400 (EDT)
Subject: Help! Scyld Beowuld Installation:
In-Reply-To: <000b01c1ee0a$9573c340$0800005a@papabear>
Message-ID: <Pine.GSO.4.33.0204272112020.3641-100000@ren.eecis.udel.edu>

  Hi,
  I am a beginner of Beowulf user. I tried to install it usingscyld
beowulf release 2.0 Preview. I installed the master node successfully. But
when I installed the slave nodes, I met a problem. According to
Installation guide, when I boot the slave node from CD-Rom, the slave
node will be listed in beosetup by MAC address. It happened! When I
dragged it to center column, clicked 'apply'. The node is assigned
to node 0. But the slave node is always 'down', even if I reboot the
master node. The status slave node still is 'down'.
  By the way, the slave node always send singal "Sending RARP request..."
  If you can give me some suggestion, I'll appreciate it!

  Tom


From agrajag at scyld.com  Sat Apr 27 19:33:27 2002
From: agrajag at scyld.com (Sean DIlda)
Date: 27 Apr 2002 22:33:27 -0400
Subject: Help! Scyld Beowuld Installation:
In-Reply-To: <Pine.GSO.4.33.0204272112020.3641-100000@ren.eecis.udel.edu>
References: <Pine.GSO.4.33.0204272112020.3641-100000@ren.eecis.udel.edu>
Message-ID: <1019961207.1762.4.camel@loiosh>

On Sat, 2002-04-27 at 21:24, Ao Jiang wrote:
>   Hi,
>   I am a beginner of Beowulf user. I tried to install it usingscyld
> beowulf release 2.0 Preview. I installed the master node successfully. But

Ack! That version's over a year and a half old.. If you are really
interested in our software, I strongly recommend getting a newer
version.

> when I installed the slave nodes, I met a problem. According to
> Installation guide, when I boot the slave node from CD-Rom, the slave
> node will be listed in beosetup by MAC address. It happened! When I
> dragged it to center column, clicked 'apply'. The node is assigned
> to node 0. But the slave node is always 'down', even if I reboot the
> master node. The status slave node still is 'down'.
>   By the way, the slave node always send singal "Sending RARP request..."

Based on the message you see on the slave node, I'd say it sounds like
either the file isn't getting written during the apply, or the daemons
aren't getting sighup'ed like they should.

My guess is its the daemons not getting sighuped?  Where there any
problems during boot?  Instead of the green 'OK' message it would have
had a red 'FAILED' message.

You might want to try forcing the beowulf daemons to restart.  As root,
do: /sbin/service beowulf restart
If you see failed messages when its shutting stuff down, don't worry
about it, however if everything isn't 'OK' when they try to start again,
then there is a problem there.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 232 bytes
Desc: This is a digitally signed message part
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020427/14ddec72/attachment.sig>

From ajiang at mail.eecis.udel.edu  Sun Apr 28 14:20:49 2002
From: ajiang at mail.eecis.udel.edu (Ao Jiang)
Date: Sun, 28 Apr 2002 17:20:49 -0400 (EDT)
Subject: Help! Scyld Beowuld Installation:
In-Reply-To: <1019961207.1762.4.camel@loiosh>
Message-ID: <Pine.GSO.4.33.0204281714570.1486-100000@ren.eecis.udel.edu>

   Hi,
   Thanks a lot for your suggestion.
   When I booted the system, the only 'Fail' item is: Bring up interface
eth1, Determining IP address of eth1... I guess the reason is that the
procotol of eth1 is set as DHCP and active on boot.
   I tried /sbin/service beowulf restart... everything is OK! But my
problem is still exits.
   When I shut down the system, the only 'Fail' item is Starting kill all
/etc/rc.d/rc6.d/S00killall:etc/init.d/beoserv No such file or directory.

   BTW, how much of the new version of scyld beowulf? Is it possible to
get a low price version?

   Thanks agian.

   Tom

On 27 Apr 2002, Sean DIlda wrote:

> On Sat, 2002-04-27 at 21:24, Ao Jiang wrote:
> >   Hi,
> >   I am a beginner of Beowulf user. I tried to install it usingscyld
> > beowulf release 2.0 Preview. I installed the master node successfully. But
>
> Ack! That version's over a year and a half old.. If you are really
> interested in our software, I strongly recommend getting a newer
> version.
>
> > when I installed the slave nodes, I met a problem. According to
> > Installation guide, when I boot the slave node from CD-Rom, the slave
> > node will be listed in beosetup by MAC address. It happened! When I
> > dragged it to center column, clicked 'apply'. The node is assigned
> > to node 0. But the slave node is always 'down', even if I reboot the
> > master node. The status slave node still is 'down'.
> >   By the way, the slave node always send singal "Sending RARP request..."
>
> Based on the message you see on the slave node, I'd say it sounds like
> either the file isn't getting written during the apply, or the daemons
> aren't getting sighup'ed like they should.
>
> My guess is its the daemons not getting sighuped?  Where there any
> problems during boot?  Instead of the green 'OK' message it would have
> had a red 'FAILED' message.
>
> You might want to try forcing the beowulf daemons to restart.  As root,
> do: /sbin/service beowulf restart
> If you see failed messages when its shutting stuff down, don't worry
> about it, however if everything isn't 'OK' when they try to start again,
> then there is a problem there.
>


From lindahl at keyresearch.com  Sun Apr 28 12:16:16 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Sun, 28 Apr 2002 12:16:16 -0700
Subject: COTS cooling
In-Reply-To: <200204240500.WAA23212@brownlee.cs.uidaho.edu>; from heckendo@cs.uidaho.edu on Tue, Apr 23, 2002 at 10:00:16PM -0700
References: <200204240407.g3O47Jb22438@blueraja.scyld.com> <200204240500.WAA23212@brownlee.cs.uidaho.edu>
Message-ID: <20020428121616.C11625@wumpus.attbi.com>

On Tue, Apr 23, 2002 at 10:00:16PM -0700, Robert B Heckendorn wrote:

> 450W/dualnode * 3.4BTU/hr/W * 400 nodes  = 612K BTU/hr

One thing that I haven't seen anyone point out is that these nodes
won't actually pull 450W in actual operation. However, you also need
to spec enough additional cooling to (1) survive the failure of one of
the AC units, and (2) account for the fact that the efficiency of the
AC units falls over time, by as much as 30%.

greg


From ole at scali.no  Sun Apr 28 06:53:27 2002
From: ole at scali.no (Ole W. Saastad)
Date: Sun, 28 Apr 2002 15:53:27 +0200
Subject: Hyperthreading in P4
Message-ID: <3CCBFED7.FCF1DA94@scali.no>

New Pentium 4 processors has hyper threading capabilities
and when setting this the linux sees 4 cpus on each dual node.

I have done some testing with OpenMP programs and found that
for OpenMP threaded programs there is no performance gain in using 
the hypertheading. Using a number of threads that equal the number 
of real processors seems to be optimal.
However, this is the results from just a few OpenMP programs and
might not tell a full story. 
I would like comments from others who has played with threads 
and hyperthreading in a Pentium 4 processor environment.


Hyperthreading is claimed to perform better when running a large
number of processes and a high number of threads. But this is most
probably different applications or different requests lets say to a 
web or db. server.


-- 
Ole W. Saastad, Dr.Scient. Scali AS P.O.Box 150 Oppsal 0619 Oslo NORWAY 
Tel:+47 22 62 89 68(dir) mailto:ole at scali.no http://www.scali.com 
Are you meeting Petaflop requirements with Gigaflops performance ?
    - Scali Terarack bringing Teraflops to the masses.


From suzen at theochem.tu-muenchen.de  Sun Apr 28 12:30:35 2002
From: suzen at theochem.tu-muenchen.de (Mehmet Ali Suzen)
Date: Sun, 28 Apr 2002 21:30:35 +0200
Subject: MPICH p4_error No space left on device.
Message-ID: <3CCC4DDB.8030901@theochem.tu-muenchen.de>

Hello,
I'm experiencing a problem with MPICH 1.2.2.1 in SMP linux cluster, just 
after  sending
my program with mpirun I'm receiving this error.

p4_error : Last Message No space left on device.

What may cause this p4_error? Or Where could I find the detailed 
description of p4_error?
I've checked user manual of MPICH,  but I didn't gain  much help.

Thanks for any comment.

Cheers,

Mehmet


From lindahl at keyresearch.com  Sun Apr 28 16:00:41 2002
From: lindahl at keyresearch.com (Greg Lindahl)
Date: Sun, 28 Apr 2002 16:00:41 -0700
Subject: COTS cooling
In-Reply-To: <1019710150.3596.37.camel@milagro.wildbrain.com>; from purp@wildbrain.com on Wed, Apr 24, 2002 at 09:49:09PM -0700
References: <200204240500.WAA23212@brownlee.cs.uidaho.edu> <1019710150.3596.37.camel@milagro.wildbrain.com>
Message-ID: <20020428160041.A12006@wumpus.attbi.com>

On Wed, Apr 24, 2002 at 09:49:09PM -0700, Jim Meyer wrote:

> For us, it showed that the total buildout of a real computer room would
> cost less than 10% of the cost of the machines. That and a discussion of
> raised failure rates due to heat turned the corner on that one.

That seems to be a rule of thumb, that's the number I've seen for some
very large machine rooms. Of course it's often more expensive to
change an existing room than add stuff to one you're building.

greg

p.s. your company website isn't very linux or mozilla friendly, ah well.


From dan at systemsfirm.net  Sun Apr 28 23:06:27 2002
From: dan at systemsfirm.net (Daniel R. Philpott)
Date: Mon, 29 Apr 2002 02:06:27 -0400
Subject: Tyan Tiger 2460
Message-ID: <A4DF80F6AB5FB442A2E5B8838062D73F17E0@topcat.systemsfirm.net>

> From: Robert G. Brown [mailto:rgb at phy.duke.edu] 
>
> a couple of the boxes today armed with a 32 bit riser, a 64 
> bit riser, and an ATI rage video card and a 3c905m NIC.  
> 
>   a) Only the video card would work in slot 1.  Period.  If 
> we put the 3c905 in slot one all by itself (using the BIOS 
> console), the system would behave erratically, actually 
> 
>   b) If slot one had video or was empty, the system would 
> work fine for all other vertical configurations.  That is, 
> 
> The problem persisted, identically, when we put the 64 bit 
> riser (which we were really counting on to fix things) into 
> slot 1 and plugged the NIC and video into it, in either 
> order.  
> 
> HOWEVER, being clever little beasties, we put the dismounted 
> (32 bit) riser in slot 2 with the extra cabled keys in slots 
> 3 and 4, added the dismounted PCI cards to any slots we felt 
> like and voila!  The system, she work perfectly.  

Just a quick observation, looking at the different configurations you
tried makes me think that the working configuration may have modified
(slightly) the latency of the PCI bus.  Have you tried modifying the PCI
latency in BIOS to work without the extra hardware?

I don't use the Tiger 2460 boards (I was a sucker for the S2462) so I
can't test this hypothesis but maybe someone else can test it out.  

Dan


From rgb at phy.duke.edu  Mon Apr 29 08:44:19 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 29 Apr 2002 11:44:19 -0400 (EDT)
Subject: Tyan Tiger 2460
In-Reply-To: <A4DF80F6AB5FB442A2E5B8838062D73F17E0@topcat.systemsfirm.net>
Message-ID: <Pine.LNX.4.44.0204291140390.15521-100000@ganesh.phy.duke.edu>

On Mon, 29 Apr 2002, Daniel R. Philpott wrote:

> Just a quick observation, looking at the different configurations you
> tried makes me think that the working configuration may have modified
> (slightly) the latency of the PCI bus.  Have you tried modifying the PCI
> latency in BIOS to work without the extra hardware?
> 
> I don't use the Tiger 2460 boards (I was a sucker for the S2462) so I
> can't test this hypothesis but maybe someone else can test it out.  

We did play with some of the slot settings in the bios to no avail.
We're in mid-test of the slot 2 riser solution (which has worked fine
over the whole weekend at a load average between 3 and 6 and with the
disk and network being heavily banged on all the while, so it looks like
it works).  As soon as it finishes (or I feel like grabbing another of
the idle chassis to test with and open it up) I'll give this a more
focused try.

It does have the feel of a latency issue, but why slot 1.  Why slot 1
when the system is (otherwise) entirely empty?  I'm not asking you,
understand, just looking to heaven for Enlightenment...

   rgb

> 
> Dan
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From jlb17 at duke.edu  Mon Apr 29 09:05:47 2002
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Mon, 29 Apr 2002 12:05:47 -0400 (EDT)
Subject: Processor contention(?) and network bandwidth on AMD
Message-ID: <Pine.LNX.4.44.0204251348470.1591-100000@chaos.egr.duke.edu>

This is probably in the category of "Yup, that's the way it is, deal with 
it", but, just in case anyone has any ideas, I'm throwing it out there.

In the course of testing the gigabit connection in a new server, I noticed 
that overloaded dual AMD systems take a big hit in network bandwidth.  I'm 
testing with ttcp, and all connections were made over the same switch (HP 
Procurve 2324).

As an example, the results for a Tiger MPX (S2466) based node with dual 
1900+s and using the integrated 3Com are:

unloaded:                                         11486.6 KB/real sec
2 matlab simulations:                             10637.8 KB/real sec
2 matlab simulations and 2 SETI at homes (nice -19):  6645.4 KB/real sec

Ouch.  This is on RedHat 7.2 with kernel 2.4.9-31.

I eliminated every variable I could think of -- I tried this on an S2462 
(Thunder MP) based system, I used a PCI Intel eepro100 card rather than 
the built-in 3Com, I upgraded to an almost vanilla (French Vanilla?) 
2.4.18 kernel (the one from SGI's 1.1 XFS release).  All showed the same 
results (well, 2.4.18 didn't show much of a drop with just the two 
matlabs, but still crashed with matlab+SETI).  The one Intel system I 
tested (dual PIII 933 on an i860) showed very little bandwidth drop with 
load, and no extra drop for an overload.

Any ideas?  Is there any way to fix this?  Or is the answer just to not 
run background nice jobs on cluster nodes?

Thanks.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


From aby_sinha at yahoo.com  Mon Apr 29 09:56:06 2002
From: aby_sinha at yahoo.com (abhishek Sinha)
Date: Mon, 29 Apr 2002 09:56:06 -0700 (PDT)
Subject: Hyperthreading in P4
In-Reply-To: <3CCBFED7.FCF1DA94@scali.no>
Message-ID: <20020429165606.5649.qmail@web20710.mail.yahoo.com>

Hi

I am also using a dual Xeon 2.2 Ghz box and it seems
that the box is slower than my normal pentium 3 also.
The reason i guess is the kernel. If i watch my
/proc/interrupts i see all of them on the single CPU .
Upon research on the net i found that it required some
kind of IRQ routing patch , (ingo's i guess) so that
the CPU's perform better.

I havent had much exposure to writing Open MP programs
and then testing the power of the Xeon.

comments??

abhishek
--- "Ole W. Saastad" <ole at scali.no> wrote:
> 
> New Pentium 4 processors has hyper threading
> capabilities
> and when setting this the linux sees 4 cpus on each
> dual node.
> 
> I have done some testing with OpenMP programs and
> found that
> for OpenMP threaded programs there is no performance
> gain in using 
> the hypertheading. Using a number of threads that
> equal the number 
> of real processors seems to be optimal.
> However, this is the results from just a few OpenMP
> programs and
> might not tell a full story. 
> I would like comments from others who has played
> with threads 
> and hyperthreading in a Pentium 4 processor
> environment.
> 
> 
> Hyperthreading is claimed to perform better when
> running a large
> number of processes and a high number of threads.
> But this is most
> probably different applications or different
> requests lets say to a 
> web or db. server.
> 
> 
> -- 
> Ole W. Saastad, Dr.Scient. Scali AS P.O.Box 150
> Oppsal 0619 Oslo NORWAY 
> Tel:+47 22 62 89 68(dir) mailto:ole at scali.no
> http://www.scali.com 
> Are you meeting Petaflop requirements with Gigaflops
> performance ?
>     - Scali Terarack bringing Teraflops to the
> masses.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


=====
And I dont want the world to see me
Coz i know that they won't understand
When Everything else is meant to be broken
I just want u to know who i m ....

__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com


From josip at icase.edu  Mon Apr 29 10:54:34 2002
From: josip at icase.edu (Josip Loncaric)
Date: Mon, 29 Apr 2002 13:54:34 -0400
Subject: howto increase MTU size on 100Mbps FE
References: <Pine.LNX.4.33.0204261804410.5109-100000@presario>
Message-ID: <3CCD88DA.67FDCB8D@icase.edu>

Donald Becker wrote:
> 
> On Fri, 26 Apr 2002, Joseph Mack wrote:
> 
> > I know that jumbo frames increase throughput rate on GigE and was
> > wondering if a similar thing is possible with regular FE.
> 
> I used to track which FE NICs support oversized frames.  Jumbo frames
> turned out to be so problematic that I've stopped maintaining the table.
> [...]

> The backwards compatibility issue is severe.

Jumbo frames are great to reduce host frame procesing overhead, but,
unfortunately, we arrived at the same conclusion: jumbo frames and
normal equipment do not mix well.  If you have a separate network where
all participants use jumbo frames, fine; otherwise, things get messy.

Alteon (a key proponent of jumbo frames) has some suggestions: define a
normal frame VLAN including everybody and a (smaller) jumbo frame VLAN;
then use their ACEswitch 180 to automatically fragment UDP datagrams
when routing from a jumbo frame VLAN to a non-jumbo frame VLAN (TCP is
supposed to negotiate MTU for each connection, so it should not need
this help).  This sounds simple, but it requires support for 802.1Q VLAN
tagging in Linux kernel if a machine is to participate in both jumbo
frame and in non-jumbo frame VLAN.  Moreover, in practice this mix is
fragile for many reasons, as Donald Becker has explained...

One of the problems I've seen involves UDP packets generated by NFS. 
When a large UDP packet (jumbo frame MTU=9000) is fragmented into 6
standard (MTU=1500) UDP packets, the receiver is likely to drop some of
these 6 fragments because they are arriving too closely spaced in time. 
If even one fragment is dropped, the NFS has to resend that jumbo UDP
packet, and the process can repeat.  This results in a drastic NFS
performance drop (almost 100:1 in our experience).  To restore
performance, you need significant interrupt mitigation on the receiver's
NIC (e.g. receive all 6 packets before interrupting), but this can hurt
MPI application performance.  NFS-over-TCP may be another good solution
(untested!).

We got good gigabit ethernet bandwidth using jumbo frames (about 2-3
times better than normal frames using NICs with Alteon chipsets and the
acenic driver), but in the end full compatibility with existing
non-jumbo equipment won the argument: we went back to normal frames. 
The frame processing overhead does not seem as bad now that CPUs are so
much faster (2GHz+), even with our gigabit ethernet, and particularly
not with fast ethernet.

However, if we had a separate jumbo-frame-only gigabit ethernet network,
we'd stick to jumbo frames.  Jumbo frames are simply a better solution
for bulk data transfer, even with fast CPUs.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Research Fellow               mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134


From Daniel.Kidger at quadrics.com  Mon Apr 29 10:54:35 2002
From: Daniel.Kidger at quadrics.com (Daniel Kidger)
Date: Mon, 29 Apr 2002 18:54:35 +0100
Subject: Hyperthreading in P4
Message-ID: <010C86D15E4D1247B9A5DD312B7F5AA74D2D63@stegosaurus.bristol.quadrics.com>

Ole W. Saastad [mailto:ole at scali.no] wrote:

>New Pentium 4 processors has hyper threading capabilities
>and when setting this the linux sees 4 cpus on each dual node.

>I have done some testing with OpenMP programs and found that
>for OpenMP threaded programs there is no performance gain in using 
>the hypertheading. Using a number of threads that equal the number 
>of real processors seems to be optimal.

Having a multi-threaded processor should help codes which are limited by
memory *latency*.
I doubt if memory-bandwidth limited codes would benefit much, since memory
bandwidth is a limited resource which is already oversubscribed on many
dual-P4 nodes.

Also there are no more floating-point units that a standard P4, so CPU
limited codes wont see any improvement either.

Perhaps the interesting area though is where the CPU can issue instructions
to the FPU *AND* to the integer execution units concurrently but for
different threads. This would perhaps allow general Linux system services to
not impact the performance on application codes?

Yours,
Daniel.

--------------------------------------------------------------
Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
----------------------- www.quadrics.com --------------------


From hahn at physics.mcmaster.ca  Mon Apr 29 12:40:30 2002
From: hahn at physics.mcmaster.ca (Mark Hahn)
Date: Mon, 29 Apr 2002 15:40:30 -0400 (EDT)
Subject: Processor contention(?) and network bandwidth on AMD
In-Reply-To: <Pine.LNX.4.44.0204251348470.1591-100000@chaos.egr.duke.edu>
Message-ID: <Pine.LNX.4.33.0204291522590.20026-100000@coffee.psychology.mcmaster.ca>

> This is probably in the category of "Yup, that's the way it is, deal with 
> it", but, just in case anyone has any ideas, I'm throwing it out there.

well, there are several contributing factors, which are probably
mitigated by running a decently modern (ie 2.4.18) kernel.

for instance, at gigabit speeds, you're almost certainly generating
nontrivial MM load.  there's been a huge amount of improvement in 2.4's
in how they handle ram.

2.4 (vs 2.2) has some fairly profound changes to the structure of the
network stack, including efforts to make zero-copy possible (which 
effects alignment of packets, I think, even if you're not sendfile'ing.)
I've also heard mutterings from Alan Cox-ish people that the scheme for
waking up user-space is a bit too timid (resulting in the stack punting,
and relying on ksoftirqd to eventually do the deed.)

and of course, there's a big update to the scheduler impending or 
already merged ("Ingo's O(1) scheduler").  it seems to be a lot smarter
about issues like migrating procs, affinity, waking up the right tasks, etc.

> unloaded:                                         11486.6 KB/real sec
> 2 matlab simulations:                             10637.8 KB/real sec
> 2 matlab simulations and 2 SETI at homes (nice -19):  6645.4 KB/real sec

SETI at home is obviously in the "so don't do that" category.  I expect your
matlab was decelerated by a similar amount.

though I think that another of Ingo's goals with the new scheduler
was to give more intuitive nice -19 behavior.  that is, most people 
think of nice -19 as a way to spend otherwise idle CPU on something.
Linux (and at least some other Unixes) have *NEVER* done this - 
figure around 5% CPU, and that's ignoring cache/etc effects.
anyway, I think Ingo tries to keep -19 pretty close to idle-only.
(there are fundamental issues with really implementing an idle-only
form of scheduling, since you wind up with "priority inversion",
where the idle-only task holds a lock when high-pri jobs want to do
something...)

> Ouch.  This is on RedHat 7.2 with kernel 2.4.9-31.

ugh.

NOTE TO ALL BEOWULF USERS: seriously consider running 2.4.18
or better, and *definitely* try out gcc 3.1 ASAP.  

these updates have changed my life </evangelical>


> 2.4.18 kernel (the one from SGI's 1.1 XFS release).  All showed the same 
> results (well, 2.4.18 didn't show much of a drop with just the two 
> matlabs, but still crashed with matlab+SETI).  The one Intel system I 

hmm, come to think of it, I think Ingo's scheduler isn't merged 
in 2.4.19-pre yet.

> tested (dual PIII 933 on an i860) showed very little bandwidth drop with 

i860 is a P4 chipset afaik...

> load, and no extra drop for an overload.

then again, it would not be astonishing if PIII and P4 showed
dramatically different effects of cache pollution.  remember that 
the P4 depends rather strongly on seeing a decent hit rate in 
its many caches (trace cache, normal I+D caches, TLB, prediction tables, etc)


From jlb17 at duke.edu  Mon Apr 29 12:49:32 2002
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Mon, 29 Apr 2002 15:49:32 -0400 (EDT)
Subject: Processor contention(?) and network bandwidth on AMD
In-Reply-To: <Pine.LNX.4.33.0204291522590.20026-100000@coffee.psychology.mcmaster.ca>
Message-ID: <Pine.LNX.4.44.0204291544380.1681-100000@chaos.egr.duke.edu>

On Mon, 29 Apr 2002 at 3:40pm, Mark Hahn wrote

> well, there are several contributing factors, which are probably
> mitigated by running a decently modern (ie 2.4.18) kernel.
> 
> for instance, at gigabit speeds, you're almost certainly generating
> nontrivial MM load.  there's been a huge amount of improvement in 2.4's
> in how they handle ram.

These tests (obviously) were only at FE speed, though.  The receiving end 
was gigabit, but the sending end (the dual AMD nodes) was FE.

> > unloaded:                                         11486.6 KB/real sec
> > 2 matlab simulations:                             10637.8 KB/real sec
> > 2 matlab simulations and 2 SETI at homes (nice -19):  6645.4 KB/real sec
> 
> SETI at home is obviously in the "so don't do that" category.  I expect your
> matlab was decelerated by a similar amount.

Sure, but it was just an example of a niced background load, which 
"shouldn't" interfere with anything.  It certainly shouldn't crash 
bandwidth like that.

> > tested (dual PIII 933 on an i860) showed very little bandwidth drop with 
> 
> i860 is a P4 chipset afaik...

Oops -- not enough coffee.  I meant i840.  Dual PIII with RDRAM.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


From siegert at sfu.ca  Mon Apr 29 13:13:30 2002
From: siegert at sfu.ca (Martin Siegert)
Date: Mon, 29 Apr 2002 13:13:30 -0700
Subject: Processor contention(?) and network bandwidth on AMD
In-Reply-To: <Pine.LNX.4.33.0204291522590.20026-100000@coffee.psychology.mcmaster.ca>; from hahn@physics.mcmaster.ca on Mon, Apr 29, 2002 at 03:40:30PM -0400
References: <Pine.LNX.4.44.0204251348470.1591-100000@chaos.egr.duke.edu> <Pine.LNX.4.33.0204291522590.20026-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20020429131330.A10649@stikine.ucs.sfu.ca>

On Mon, Apr 29, 2002 at 03:40:30PM -0400, Mark Hahn wrote:
<snip>
> NOTE TO ALL BEOWULF USERS: seriously consider running 2.4.18
> or better, and *definitely* try out gcc 3.1 ASAP.  
> 
> these updates have changed my life </evangelical>

I can only confirm that 2.4.18 improves "life" dramatically.
Before that certain MPI jobs (using LAM) would just hang.

Now with respect to gcc-3.x:
It has been pointed out that gcc-3.0 is extremely bad for Athlons
and (only) bad for PIIIs:
http://math-atlas.sourceforge.net/errata.html#gcc3.0
My recent exercise in compiling atlas only confirms this.

Have these issues been resolved in gcc-3.1?

Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================


From SGaudet at turbotekcomputer.com  Mon Apr 29 11:33:18 2002
From: SGaudet at turbotekcomputer.com (Steve Gaudet)
Date: Mon, 29 Apr 2002 14:33:18 -0400
Subject: Hyperthreading in P4
Message-ID: <3450CC8673CFD411A24700105A618BD6267E92@911TURBO>


> I am also using a dual Xeon 2.2 Ghz box and it seems
> that the box is slower than my normal pentium 3 also.
> The reason i guess is the kernel. If i watch my
> /proc/interrupts i see all of them on the single CPU .
> Upon research on the net i found that it required some
> kind of IRQ routing patch , (ingo's i guess) so that
> the CPU's perform better.
> 
> I havent had much exposure to writing Open MP programs
> and then testing the power of the Xeon.
> 
> comments??


Key issues are:
1.  Code must be threaded.
2.  BIOS and O.S. must be enabled.
	- RH has a patch available on their site.

As for the performance vs PIII, I strongly recommend that application
developers use our C and Fortran compilers for Linux.  GCC is not
well-optimized for Netburst architecture, PGI is o.k., and our compilers
really fly!  In addition, our compilers generate the best code for PIII,
AMD, and P4P/Xeon based systems so you really can't lose.

Pls see the following Hyper-Threading whitepaper (preliminary) for all the
gory details about req'ts.> 


Cheers,


Steve Gaudet 
Linux Solutions Engineer
   ..... 
  <(???)> 
 
===================================================================
| Turbotek Computer Corp.    tel:603-666-3062 ext. 21             |
| 8025 South Willow St.      fax:603-666-4519                     |
| Building 2, Unit 105       toll free:800-573-5393               |
| Manchester, NH 03103       e-mail:sgaudet at turbotekcomputer.com  |
|                            web: http://www.turbotekcomputer.com |
===================================================================


-------------- next part --------------
A non-text attachment was scrubbed...
Name: HyperthreadingOSenabling.doc
Type: application/msword
Size: 1554944 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020429/7efbb061/attachment.doc>

From math at velocet.ca  Mon Apr 29 13:56:52 2002
From: math at velocet.ca (Velocet)
Date: Mon, 29 Apr 2002 16:56:52 -0400
Subject: Processor contention(?) and network bandwidth on AMD
In-Reply-To: <Pine.LNX.4.33.0204291522590.20026-100000@coffee.psychology.mcmaster.ca>; from hahn@physics.mcmaster.ca on Mon, Apr 29, 2002 at 03:40:30PM -0400
References: <Pine.LNX.4.44.0204251348470.1591-100000@chaos.egr.duke.edu> <Pine.LNX.4.33.0204291522590.20026-100000@coffee.psychology.mcmaster.ca>
Message-ID: <20020429165652.U8530@velocet.ca>

On Mon, Apr 29, 2002 at 03:40:30PM -0400, Mark Hahn's all...

> and of course, there's a big update to the scheduler impending or 
> already merged ("Ingo's O(1) scheduler").  it seems to be a lot smarter
> about issues like migrating procs, affinity, waking up the right tasks, etc.
> 
> > unloaded:                                         11486.6 KB/real sec
> > 2 matlab simulations:                             10637.8 KB/real sec
> > 2 matlab simulations and 2 SETI at homes (nice -19):  6645.4 KB/real sec
> 
> SETI at home is obviously in the "so don't do that" category.  I expect your
> matlab was decelerated by a similar amount.
> 
> though I think that another of Ingo's goals with the new scheduler
> was to give more intuitive nice -19 behavior.  that is, most people 
> think of nice -19 as a way to spend otherwise idle CPU on something.
> Linux (and at least some other Unixes) have *NEVER* done this - 
> figure around 5% CPU, and that's ignoring cache/etc effects.
> anyway, I think Ingo tries to keep -19 pretty close to idle-only.
> (there are fundamental issues with really implementing an idle-only
> form of scheduling, since you wind up with "priority inversion",
> where the idle-only task holds a lock when high-pri jobs want to do
> something...)

How does freebsd do this then? They've had idle (and realtime) priority
in the kernels for a couple years. And there are no problems with
priority inversion (which was Mike Shaver's answer to me for linux's
lack of idle time priority 2 years ago when I asked him if Linux was
going to be incorporating it) - rather, they have lock breaking code working
nicely in freebsd.

Freebsd's idle priority gives 30 levels of idle priority to play with.
Anything at a lower level of idle priority gets NO time on the cpu at all
until there is some available. This is quite nice when things like Gaussian98
is running and you want to put a higher priority g98 job on without having
a nice level 19 in linux G98 fighting and thrashing the cache vs the
nice level 0 g98. I notice a total speedup of 1-2% at least in the difference
between running the two jobs sequentially vs putting 1 at 19 and 1 at 0.

I find this system VERY useful for scheduling jobs on various machines
shared by different groups. It really guarantees that the CPU will be
used primarily for one type of job and not another and avoids
cache thrashing quite nicely - until something goes to disk, and then
the idle job wakes up, etc... is it worth using the extra cpu and possibly
thrashing the cache, or is it more efficient to wait for a bigger chunk
of free CPU?

/kc


> 
> > Ouch.  This is on RedHat 7.2 with kernel 2.4.9-31.
> 
> ugh.
> 
> NOTE TO ALL BEOWULF USERS: seriously consider running 2.4.18
> or better, and *definitely* try out gcc 3.1 ASAP.  
> 
> these updates have changed my life </evangelical>
> 
> 
> > 2.4.18 kernel (the one from SGI's 1.1 XFS release).  All showed the same 
> > results (well, 2.4.18 didn't show much of a drop with just the two 
> > matlabs, but still crashed with matlab+SETI).  The one Intel system I 
> 
> hmm, come to think of it, I think Ingo's scheduler isn't merged 
> in 2.4.19-pre yet.
> 
> > tested (dual PIII 933 on an i860) showed very little bandwidth drop with 
> 
> i860 is a P4 chipset afaik...
> 
> > load, and no extra drop for an overload.
> 
> then again, it would not be astonishing if PIII and P4 showed
> dramatically different effects of cache pollution.  remember that 
> the P4 depends rather strongly on seeing a decent hit rate in 
> its many caches (trace cache, normal I+D caches, TLB, prediction tables, etc)
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From rgb at phy.duke.edu  Mon Apr 29 14:07:24 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Mon, 29 Apr 2002 17:07:24 -0400 (EDT)
Subject: Processor contention(?) and network bandwidth on AMD
In-Reply-To: <Pine.LNX.4.44.0204291544380.1681-100000@chaos.egr.duke.edu>
Message-ID: <Pine.LNX.4.44.0204291638030.16146-100000@ganesh.phy.duke.edu>

On Mon, 29 Apr 2002, Joshua Baker-LePain wrote:

> > > unloaded:                                         11486.6 KB/real sec
> > > 2 matlab simulations:                             10637.8 KB/real sec
> > > 2 matlab simulations and 2 SETI at homes (nice -19):  6645.4 KB/real sec
> > 
> > SETI at home is obviously in the "so don't do that" category.  I expect your
> > matlab was decelerated by a similar amount.
> 
> Sure, but it was just an example of a niced background load, which 
> "shouldn't" interfere with anything.  It certainly shouldn't crash 
> bandwidth like that.

Joshua,

Actually, running a heavy background load can (as you have observed)
significantly affect network times, especially if it is the receiver
that is loaded.  As to whether or not it "should", I cannot say (kind of
a value judgement there:-), but one can try to understand it.  There are
deliberate tradeoffs made in the tuning of the kernel and for better or
worse the linux tradeoffs optimize "user response time" at the expense
of a variety of things that might improve throughput on a purely
computational load or throughput on the network or pretty much anything
else.  Sometimes one can retune -- Josip Loncaric's TCP patch is one
such retuning, but one can also envision changing timeslice granularity
and other things to optimize one thing at the expense of others.
Generally such a retuning is a Bad Idea.  Right now the kernel is pretty
damn good, overall, and all components are delicately balanced.  As
Mark's previous reply made clear, some naive retunings would just lock
up the system (or really make performance go to hell) as important
components starve.

It isn't too hard to see why loading the receiver might decrease the
efficiency of the network.  Imagine the network component of the kernel
from the point of view of the stream receiver (not the transmitter).  It
never knows when the next packet/message will come through.  The kernel
does its best to do OTHER work in the gaps between packets by installing
top half and bottom half handlers and the like (so it does no more work
then absolutely necessary when the asynchronous interrupt is first
received, postponing what it can until later) to provide the illusion of
seamless access to the CPU and other resources for running processes.
One side effect of this is that there are times when the delivery of
packets is delayed so that a background application can complete a
timeslice it was given "in between" packets when the system was
momentarily idle.

What this ends up meaning is that when the system is BUSY, it de facto
delays the delivery of packets that it has buffered for fractions of the
many timeslices of CPU the system is allocating to the competing tasks
when the network process is momentarily idle (blocked, waiting for the
next packet). If it didn't do this a high speed packet stream could (for
example) starve running processes for CPU by forcing them to wait for
the whole stream to complete.

Processing the text of TCP packets (not to mention the interrupts and
context switches themselves) is a nontrivial load on the CPU in its own
right, so much so that people try NOT to run high-performance network
connections for fine-grained code over TCP if they can avoid it.  The
network stack ends up contending for CPU with everything else that is
running, and it makes no sense to retune things so that this is never
true as the cure will likely be worse than the disease for most usage
patterns.  

Curiously, transmitting works more efficiently than receiving, probably
because the transmitter is in charge of the scheduling.  In very crude
terms the transmitter is never interrupted or delayed by other processes
-- it just gets its timeslice, executes a send or stream of sends,
eventually blocks (moving up in priority while blocked) or finishes its
timeslice, and then moves on.  No delays to speak of.

Try this:

  Do your netpipe transmitter on an unloaded host, a host at load 1 and
at load 2.
  Do your netpipe receiver on an unloaded host, a host at load 1 and one
at load 2.

Fill in the matrix -- load 0 to load 0, load 0 to load 1, etc.

I found (in similar tests done years ago) that a TRANSMITTER could be
loaded to 2 (per cpu) with only a small degradation of throughput, but
loading a RECEIVER would drop throughput dramatically, by as much as
50%.

  rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From jlb17 at duke.edu  Mon Apr 29 14:22:41 2002
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Mon, 29 Apr 2002 17:22:41 -0400 (EDT)
Subject: Processor contention(?) and network bandwidth on AMD
In-Reply-To: <Pine.LNX.4.44.0204291638030.16146-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.4.44.0204291717030.1681-100000@chaos.egr.duke.edu>

On Mon, 29 Apr 2002 at 5:07pm, Robert G. Brown wrote

> Try this:
> 
>   Do your netpipe transmitter on an unloaded host, a host at load 1 and
> at load 2.
>   Do your netpipe receiver on an unloaded host, a host at load 1 and one
> at load 2.
> 
> Fill in the matrix -- load 0 to load 0, load 0 to load 1, etc.
> 
> I found (in similar tests done years ago) that a TRANSMITTER could be
> loaded to 2 (per cpu) with only a small degradation of throughput, but
> loading a RECEIVER would drop throughput dramatically, by as much as
> 50%.

I will indeed do these tests and post a followup.  One thing, though, that 
I don't know that I made clear.  I understand and accept that a higher 
load is going to affect bandwidth.  What I'm pointing out (and wondering 
why) is that the effect is *far* greater on dual AMD based systems than it 
is on, e.g., the dual PIII systems I have.

As an aside, At Mark's suggestion, I reniced the ksoftirqds on a S2466 
based system, and saw vast improvement.  For no load, 2 matlabs, and 2 
matlabs+2SETIs I saw (with the cursed 2.4.9-31 RH kernel):

ksoftirqds reniced to 0:

11463.5 KB/real sec
10637   KB/real sec
9585.39 KB/real sec

And reniced to -19:

11481.8 KB/real sec
10632.7 KB/real sec
9347.31 KB/real sec

FWIW.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


From rocky at atipa.com  Mon Apr 29 16:09:08 2002
From: rocky at atipa.com (Rocky McGaugh)
Date: Mon, 29 Apr 2002 18:09:08 -0500 (CDT)
Subject: Hyperthreading in P4
In-Reply-To: <3450CC8673CFD411A24700105A618BD6267E92@911TURBO>
Message-ID: <Pine.LNX.4.33.0204291802370.18438-100000@rocky.lab.atipa.com>

On Mon, 29 Apr 2002, Steve Gaudet wrote:

> 
> Key issues are:
> 1.  Code must be threaded.
> 2.  BIOS and O.S. must be enabled.
> 	- RH has a patch available on their site.
> 
> As for the performance vs PIII, I strongly recommend that application
> developers use our C and Fortran compilers for Linux.  GCC is not
> well-optimized for Netburst architecture, PGI is o.k., and our compilers
> really fly!  In addition, our compilers generate the best code for PIII,
> AMD, and P4P/Xeon based systems so you really can't lose.
> 
> Pls see the following Hyper-Threading whitepaper (preliminary) for all the
> gory details about req'ts.> 
> 
> 

Fine and dandy. The problem is with no way to bind processes to 
processors, it's quite easy for 2 heavy processes (or threads) to migrate 
to the same physical CPU, leaving 2 smaller threads (or processes) on the 
other physical CPU.

For CPU bound apps, its too unpredictable without the processor affinity 
stuff.


-- 
Rocky McGaugh
Atipa Technologies
rocky at atipatechnologies.com
rmcgaugh at atipa.com
1-785-841-9513 x3110
http://1087800222/
perl -e 'print unpack(u, ".=W=W+F%T:7\!A+F-O;0H`");'


From fruechtl at fecit.co.uk  Tue Apr 30 02:20:36 2002
From: fruechtl at fecit.co.uk (Herbert Fruchtl)
Date: Tue, 30 Apr 2002 10:20:36 +0100
Subject: Attachmants (was: Hyperthreading in P4)
References: <200204292059.g3TKxcD06021@blueraja.scyld.com>
Message-ID: <3CCE61E4.20F3D7B6@fecit.co.uk>

Would you PLEEEEEASE not send huge binary attachments to the list! Put
them on the web, send them on demand, whatever. Hundreds of elm users
hate you and will never buy from your company again :-)

I apologise if others already complained. I only get the daily digest
(where attachments are definitely unreadable).

  Herbert


From ole at scali.com  Tue Apr 30 05:00:03 2002
From: ole at scali.com (Ole W. Saastad)
Date: Tue, 30 Apr 2002 14:00:03 +0200
Subject: OpenMP and P4 hyperthreading
Message-ID: <3CCE8743.6ED1D681@scali.no>

Will hyper-threading make sense when using OpenMP and more threads than
physical processors?

I have run the NPB2.3 benchmark in C/OpenMP version on a dual Pentium
Xeon
system and found some interesting results. For most of the benchmarks
there
is no gain in using hyperthreading, as expected, but the for the ep
benchmark 
there is a significant speed up. This benchmark contain a loop with a 
trancendentals like ln, exp and pow (pow is a combination of ln and
exp). 
The ep benchmark is supposed to scale almost perfect as it is
embarrassingly
parallel (hence the name ep), but it was somewhat unexpected that the
speedup
using four threads were so significantly.

For all the others there is a slowdown from 0 to 11%, but for the ep
there
is a speedup of 34%. 

The results can be viewed at :

http://computational-battery.org/

I have received a lot of comments about the hyperthreading due to my
former posting, but little actual benchmark results. It would be
interesting to see if there are other programs or problems that can
benefit from the hyperthreading. 

-- 
Ole W. Saastad, Dr.Scient. Scali AS P.O.Box 150 Oppsal 0619 Oslo NORWAY 
Tel:+47 22 62 89 68(dir) mailto:ole at scali.no http://www.scali.com 
Are you meeting Petaflop requirements with Gigaflops performance ?
    - Scali Terarack bringing Teraflops to the masses.


From raju at linux-delhi.org  Tue Apr 30 06:38:32 2002
From: raju at linux-delhi.org (Raju Mathur)
Date: Tue, 30 Apr 2002 19:08:32 +0530
Subject: Attachmants (was: Hyperthreading in P4)
In-Reply-To: <3CCE61E4.20F3D7B6@fecit.co.uk>
References: <200204292059.g3TKxcD06021@blueraja.scyld.com>
	<3CCE61E4.20F3D7B6@fecit.co.uk>
Message-ID: <15566.40536.998049.36069@mail.linux-delhi.org>

This is a Mailman-administered list, and Mailman has pretty decent
options to filter out messages containing all sorts of unnecessary
content.  For instance, on the Linux-India-* lists which I manage we
quarantine all messages with any MIME content (including HTML) for
administrator action.  While it's a bit of a PITA for the list
administrator, it definitely keeps the list coherent and easy on both
the high-bandwidth and us III-world 28.8-dialup types :-)

The new version of Mailman (which was still Beta, last I checked) had
command-line tools for doing regular list maintenance.  Quite an
improvement over that sucky web interface (IMNSHO).  Vive l'keyboard!

BTW, most (all?) mailers have options for exploding digests into
individual messages, after which you'll be able to read the
attachments just fine.

Regards,

-- Raju

>>>>> "Herbert" == Herbert Fruchtl <fruechtl at fecit.co.uk> writes:

    Herbert> Would you PLEEEEEASE not send huge binary attachments to
    Herbert> the list! Put them on the web, send them on demand,
    Herbert> whatever. Hundreds of elm users hate you and will never
    Herbert> buy from your company again :-)

    Herbert> I apologise if others already complained. I only get the
    Herbert> daily digest (where attachments are definitely
    Herbert> unreadable).

-- 
Raju Mathur          raju at kandalaya.org           http://kandalaya.org/
                     It is the mind that moves


From becker at scyld.com  Tue Apr 30 08:45:27 2002
From: becker at scyld.com (Donald Becker)
Date: Tue, 30 Apr 2002 11:45:27 -0400 (EDT)
Subject: List Attachmants (was: Hyperthreading in P4)
In-Reply-To: <15566.40536.998049.36069@mail.linux-delhi.org>
Message-ID: <Pine.LNX.4.33.0204301128190.5109-100000@presario>

> >>>>> "Herbert" == Herbert Fruchtl <fruechtl at fecit.co.uk> writes:
>
>     Herbert> Would you PLEEEEEASE not send huge binary attachments to
>     Herbert> the list! Put them on the web, send them on demand,

I apologize for allowing this very large message to get through.

Some subscribers were lucky, and didn't see the message.  Sending a
2.4MB message to several thousand subscribers saturated our link.  Once
I figured out what was happening, I shut down the mailer and manually
deleted the large messages from the queue.  (That's why the mailer was
down overnight.)

The Klez virus is part of the problem here.  There have been so many
Klez messages that I assumed the initial complaints were mistaken about
the message source.

On Tue, 30 Apr 2002, Raju Mathur wrote:

> This is a Mailman-administered list, and Mailman has pretty decent
> options to filter out messages containing all sorts of unnecessary
> content.  For instance, on the Linux-India-* lists which I manage we
> quarantine all messages with any MIME content (including HTML) for
> administrator action.

We are running Mailman 2.0.6.
I don't see a moderation option for MIME, although I might have missed it.

> The new version of Mailman (which was still Beta, last I checked) had
> command-line tools for doing regular list maintenance.  Quite an
> improvement over that sucky web interface (IMNSHO).  Vive l'keyboard!

OOoooh... I'm updating when it comes out of beta.  It's very time
consuming to use the web interface to delete the same spam from two
dozen lists.  I'm hoping the new version has "discard" patterns as well
as the current "hold for moderation" pattens.


-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993


From SGaudet at turbotekcomputer.com  Tue Apr 30 10:19:42 2002
From: SGaudet at turbotekcomputer.com (Steve Gaudet)
Date: Tue, 30 Apr 2002 13:19:42 -0400
Subject: List Attachmants (was: Hyperthreading in P4)
Message-ID: <3450CC8673CFD411A24700105A618BD6267EA5@911TURBO>

Don,

> 
> I apologize for allowing this very large message to get through.
> 
> Some subscribers were lucky, and didn't see the message.  Sending a
> 2.4MB message to several thousand subscribers saturated our 
> link.  Once
> I figured out what was happening, I shut down the mailer and manually
> deleted the large messages from the queue.  (That's why the mailer was
> down overnight.)

Very sorry about this.  I didn't realize I put together such a big file.  It
won't happen again.  I'll put all this up on our web site this week.

Again I apologize for this screw up.

Steve Gaudet 


From math at velocet.ca  Tue Apr 30 10:25:12 2002
From: math at velocet.ca (Velocet)
Date: Tue, 30 Apr 2002 13:25:12 -0400
Subject: List Attachmants (was: Hyperthreading in P4)
In-Reply-To: <Pine.LNX.4.33.0204301128190.5109-100000@presario>; from becker@scyld.com on Tue, Apr 30, 2002 at 11:45:27AM -0400
References: <15566.40536.998049.36069@mail.linux-delhi.org> <Pine.LNX.4.33.0204301128190.5109-100000@presario>
Message-ID: <20020430132512.R8530@velocet.ca>

On Tue, Apr 30, 2002 at 11:45:27AM -0400, Donald Becker's all...
> > >>>>> "Herbert" == Herbert Fruchtl <fruechtl at fecit.co.uk> writes:
> >
> >     Herbert> Would you PLEEEEEASE not send huge binary attachments to
> >     Herbert> the list! Put them on the web, send them on demand,
> 
> I apologize for allowing this very large message to get through.
> 
> Some subscribers were lucky, and didn't see the message.  Sending a
> 2.4MB message to several thousand subscribers saturated our link.  Once
> I figured out what was happening, I shut down the mailer and manually
> deleted the large messages from the queue.  (That's why the mailer was
> down overnight.)
> 
> The Klez virus is part of the problem here.  There have been so many
> Klez messages that I assumed the initial complaints were mistaken about
> the message source.
> 
> On Tue, 30 Apr 2002, Raju Mathur wrote:
> 
> > This is a Mailman-administered list, and Mailman has pretty decent
> > options to filter out messages containing all sorts of unnecessary
> > content.  For instance, on the Linux-India-* lists which I manage we
> > quarantine all messages with any MIME content (including HTML) for
> > administrator action.
> 
> We are running Mailman 2.0.6.
> I don't see a moderation option for MIME, although I might have missed it.

In the new mailman (whcih we also use) there should be header filtering.
It should be possible to filter based on Content-type: however that
doesnt help with content-length.

> > The new version of Mailman (which was still Beta, last I checked) had
> > command-line tools for doing regular list maintenance.  Quite an
> > improvement over that sucky web interface (IMNSHO).  Vive l'keyboard!
> 
> OOoooh... I'm updating when it comes out of beta.  It's very time
> consuming to use the web interface to delete the same spam from two
> dozen lists.  I'm hoping the new version has "discard" patterns as well
> as the current "hold for moderation" pattens.

The new one can auto discard spam, andyou can even do it sans notification.
(I do it with, so I get the spam, but just to see what's being discarded
in case some lost sheep subscriber is posting incorrectly to my relatively
private lists.)

Its not bad. Now if I could only figure out why it uses a cluster worth
of CPU to deliver messages, I'd be happy with mailman. :) (1.2Ghz CPU
doing about 20-30% cpu 24hrs a day to send 2000 posts-recipient :( )

/kc

> 
> 
> -- 
> Donald Becker				becker at scyld.com
> Scyld Computing Corporation		http://www.scyld.com
> 410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
> Annapolis MD 21403			410-990-9993
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA 


From jlb17 at duke.edu  Tue Apr 30 12:50:58 2002
From: jlb17 at duke.edu (Joshua Baker-LePain)
Date: Tue, 30 Apr 2002 15:50:58 -0400 (EDT)
Subject: Processor contention(?) and network bandwidth on AMD
In-Reply-To: <Pine.LNX.4.44.0204291638030.16146-100000@ganesh.phy.duke.edu>
Message-ID: <Pine.LNX.4.44.0204301538150.1681-100000@chaos.egr.duke.edu>

On Mon, 29 Apr 2002 at 5:07pm, Robert G. Brown wrote

> Try this:
> 
>   Do your netpipe transmitter on an unloaded host, a host at load 1 and
> at load 2.
>   Do your netpipe receiver on an unloaded host, a host at load 1 and one
> at load 2.
> 
> Fill in the matrix -- load 0 to load 0, load 0 to load 1, etc.
> 
> I found (in similar tests done years ago) that a TRANSMITTER could be
> loaded to 2 (per cpu) with only a small degradation of throughput, but
> loading a RECEIVER would drop throughput dramatically, by as much as
> 50%.

Done -- the results are at <http://www.duke.edu/~jlb17/bwtest.pdf>.  Keep 
in mind these were pretty quick and dirty tests using the systems I have 
on hand.

Between Athlon systems, it seems the transmitter vs. receiver loading 
doesn't make much of a difference.  A newer Intel based system (dual PIIIs 
on a Serverworks HE-SL chipset) shows the same bandwidth hit with overload 
as the Athlon systems.  But older systems (well, a couple of years 
anyway) don't show this hit, which is what set me off on all this.  Maybe 
it's an issue of chipset support?

Anyways, thanks for listening to me babble on about all this.

-- 
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University


From vanw at tticluster.com  Tue Apr 30 15:11:59 2002
From: vanw at tticluster.com (Kevin Van Workum)
Date: Tue, 30 Apr 2002 18:11:59 -0400 (EDT)
Subject: Dolphin Wulfkit
Message-ID: <Pine.LNX.4.33.0204301808030.16960-100000@tticluster.com>

I know this has been discussed before, but I'd like to know the "current" 
opinion. What are your experiences with the Dolphin Wulfkit interconnect? 
Any major issues (compatability/linux7.2/MPI/etc)? General comments.

-- 
Kevin Van Workum
www.tsunamictechnologies.com
ONLINE COMPUTER CLUSTERS

__/__ __/__ *
 /     /   /
/     /   /


From rgb at phy.duke.edu  Tue Apr 30 16:03:57 2002
From: rgb at phy.duke.edu (Robert G. Brown)
Date: Tue, 30 Apr 2002 19:03:57 -0400 (EDT)
Subject: xmlsysd, wulfstat (cluster monitor apps, beta)
Message-ID: <Pine.LNX.4.44.0204301825080.8222-100000@lucifer.rgb.private.net>

Dearest DBUG (and beowulf list) persons,

Announcing xmlsysd and its companion application, wulfstat.

xmlsysd is a lightweight, throttleable daemon that runs either as a
forking daemon or out of xinetd (the latter by default).  When one
connects to it it accepts a very simple command language that basically
a) configures it to deliver certain kinds of /proc and
systems-call-derived information, generally throttling it so it doesn't
return anything you aren't interested in; and b) causes it to wrap up
that information in an xml-formatted message and return it to the
caller.  Security is managed any of several ways -- by ipchains or
iptables, using tcp wrappers, or using xinetd's internal ip-level
security features (or using ssl or ssh tunnels, for the truly paranoid
or those who want to monitor across a WAN).

wulfstat is a companion client application that uses the xmlsysd's
running on a collection of cluster nodes or LAN workstation hosts to
gather information about the nodes or hosts and present it in a simple
tty (e.g. xterm, konsole) accessible tabular form, updating the table
every N seconds (default 5).  Think of it as vmstat, procinfo, ifconfig,
uptime, free, date, the upper part of the top command, and a bit more
all rolled into a single application so that you can monitor whole
connected sets of this information across an entire cluster with some
reasonable granularity.

Such a tool has obvious uses -- for cluster users, it allows them to
monitor host load averages, look for idle resources, monitor memory
usage, obtain information at a glance about remote cpu type and clock,
cache size, monitor network loads, and even see what fraction of a
cluster node's up time has been spent "doing work" instead of idle.
Most of this is equally useful to systems administrators seeking to
monitor LAN host activity -- crashing systems are often signalled by
anomalous consumption of memory or a steady rise in cpu usage, for
example.

The toolset has now been in use for some time and has been reasonably
stable for several weeks (in spite of my constant poking at it to add
new features or fix tiny problems).  I am therefore releasing it as
version 0.1.0 BETA for wider testing, although at the moment it seems to
be doing fine in production.

It is expected that wulfstat is just the first of a number of monitoring
applications that will be developed that use the daemon.  The daemon,
for example, can also be used to monitor tasks on remote nodes by
username and/or taskname and/or run status, although the application
that actually permits name and task lists to be managed on the user side
and the returned results properly displayed has yet to be written.  Full
GUI and/or web applications should also be straightforward to build,
although this time I learned my lesson and built the tty application
FIRST (for xmlsysd's predecessor, procstatd I built a GUI application
and have regretted it ever after).  It is also expected that at least a
few more features will be added to the daemon (it lacks e.g. lm-sensors
support at this point, for example).

The daemon >>should<< have just enough power to form the basis for a
load balancing or job distribution system -- it can certainly
efficiently provide realtime monitoring of many of the components upon
which a queuing decision might be based, including load, memory and
network utilization, non-root tasks running or waiting to run, and even
CPU type, clock, and cache.  It does not run as a privileged user,
however, and is not designed to manage the actual distribution or
control of jobs.

Still, I expect and hope that wulfstat and xmlsysd together will be
immediately useful to cluster people who install it.  The included
documentation should be adequate although not overwhelming -- there are
man pages for both xmlsysd and wulfstat that are very nearly up to date
-- and I'm available to help with installations that don't seem to work
correctly.  The one "gotcha" of wulfstat is that it does require libxml2
(and hence probably RH 7.2 or better) to run -- you will need to ensure
that this RPM is installed on the hosts where wulfstat is to run.
xmlsysd similarly requires libxml to run on the cluster nodes.

I would greatly appreciate feedback and bug reports, if any, from
anybody who chooses to install it and give it a try.

To retrieve it in RPM form, you can use the URL's below:

   http://www.phy.duke.edu/brahma/xmlsysd-0.1.0-beta.i386.rpm
   http://www.phy.duke.edu/brahma/xmlsysd-0.1.0-beta.src.rpm

   http://www.phy.duke.edu/brahma/wulfstat-0.1.0-beta.i386.rpm
   http://www.phy.duke.edu/brahma/wulfstat-0.1.0-beta.src.rpm

If anybody needs it in tarball form (not in source or binary rpm form)
they should contact me directly.  I can easily generate one (or it can
be extracted from the source rpm) but I guarantee the instructions for
installation or configuration -- they are encapsulated already in the
RPMs.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu


From ajiang at mail.eecis.udel.edu  Tue Apr 30 20:19:05 2002
From: ajiang at mail.eecis.udel.edu (Ao Jiang)
Date: Tue, 30 Apr 2002 23:19:05 -0400 (EDT)
Subject: Screen dump analysis:
In-Reply-To: <NGELIDAMILOAGLEOBABKEEAPCAAA.jiangao@udel.edu>
Message-ID: <Pine.GSO.4.33.0204302316220.17881-100000@ren.eecis.udel.edu>

   Hi,
   I have some questions, when I install Scyld Beowulf (version 2.0 preview).
 I am looking forward to seeing someone can give me some direction. Thanks a
 lot!


 After boot up the beowulf sys, the slave node shows:
 "Boot: System boot phase 1 in progress...
...
 Sending RARP request..."
 But this signal seems to send forever, status of node are always 'down'
in the Beosetup of the master node.

 I checked the website and found it may be caused by the beoserv which isn't
seeing RARP signals. But unfortunatly I couldn't find an effective way to
 solve it.

 So would you mind giving me some suggestion on how to fix beoserv problem?
 Or how to send message to slave nodes manually?

 The interesting thing is:

 When I tried to reboot the master node, the slave node seems to receive the
 messages from the master node and reboot too and enter phase 2 or 3. But
 some slave nodes show something wrong. I don't know what they mean and what the
 reasons are? If it is the problem of hardware, which device it is?
 The following is the screen dump:

 Slave node 1:

 "
 EXT2-fs error (device ramdisk(1,3))
 Ext2_add_entry: bad entry in directory #8193; directory entry across
 blocks-offset=13680, inode=9096, rec_len=2064, name_len=5.
 "
 The status of node 1 is error.

 Slave node 2:

"
 Boot: System boot phase 2 in progress...
"
 autorun...Done
 VFS: Cannot open root device 03:01
 Kernel panic: VFS unable to mount root fs on 03:01
"
 The status of node 2 is error.

 Tom


From alvin at Maggie.Linux-Consulting.com  Tue Apr 30 20:39:55 2002
From: alvin at Maggie.Linux-Consulting.com (alvin at Maggie.Linux-Consulting.com)
Date: Tue, 30 Apr 2002 20:39:55 -0700 (PDT)
Subject: Screen dump analysis:
In-Reply-To: <Pine.GSO.4.33.0204302316220.17881-100000@ren.eecis.udel.edu>
Message-ID: <Pine.LNX.3.96.1020430203604.27389A-100000@Maggie.Linux-Consulting.com>

hiya

>  autorun...Done
>  VFS: Cannot open root device 03:01
>  Kernel panic: VFS unable to mount root fs on 03:01
> "
>  The status of node 2 is error.

the system/kernel you are booting is lookign for / on /dev/hda1
but cant find it...
	- you probably copied kernels from differnt machines
	onto this one

lilo: vmlinuz  root=/dev/hda3
	if your /  is located on /dev/hda3

once it comes up...fix /etc/lilo.conf, re-run lilo  and than reboot

c ya
alvin


On Tue, 30 Apr 2002, Ao Jiang wrote:

>    Hi,
>    I have some questions, when I install Scyld Beowulf (version 2.0 preview).
>  I am looking forward to seeing someone can give me some direction. Thanks a
>  lot!
> 
> 
>  After boot up the beowulf sys, the slave node shows:
>  "Boot: System boot phase 1 in progress...
> ...
>  Sending RARP request..."
>  But this signal seems to send forever, status of node are always 'down'
> in the Beosetup of the master node.
> 
>  I checked the website and found it may be caused by the beoserv which isn't
> seeing RARP signals. But unfortunatly I couldn't find an effective way to
>  solve it.
> 
>  So would you mind giving me some suggestion on how to fix beoserv problem?
>  Or how to send message to slave nodes manually?
> 
>  The interesting thing is:
> 
>  When I tried to reboot the master node, the slave node seems to receive the
>  messages from the master node and reboot too and enter phase 2 or 3. But
>  some slave nodes show something wrong. I don't know what they mean and what the
>  reasons are? If it is the problem of hardware, which device it is?
>  The following is the screen dump:
> 
>  Slave node 1:
> 
>  "
>  EXT2-fs error (device ramdisk(1,3))
>  Ext2_add_entry: bad entry in directory #8193; directory entry across
>  blocks-offset=13680, inode=9096, rec_len=2064, name_len=5.
>  "
>  The status of node 1 is error.
> 
>  Slave node 2:
> 
> "
>  Boot: System boot phase 2 in progress...
> "
>  autorun...Done
>  VFS: Cannot open root device 03:01
>  Kernel panic: VFS unable to mount root fs on 03:01
> "
>  The status of node 2 is error.
> 
>  Tom
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>